YouTube Transcription Generator (VLM Run)
Generate transcriptions from YouTube videos using vlmrun for speech-to-text and optional timestamps. This skill:
- •Downloads the YouTube video (or audio) with yt-dlp.
- •Transcribes the video with vlmrun (Orion visual AI).
- •Saves the transcript to a file (plain text or with timestamps).
Refer to vlmrun-cli-skill for vlmrun CLI setup, environment variables, and all vlmrun chat options.
How the assistant should use this skill
- •
Check
.envfor API key- •Ensure
.env(or.env.local) containsVLMRUN_API_KEY. - •If missing, instruct the user to set it before running any
vlmruncommands.
- •Ensure
- •
Use vlmrun for transcription only
- •For transcription (and optional timestamps), use the vlmrun CLI with a video file as input (
-i <video>). - •vlmrun accepts video files (e.g.
.mp4). For YouTube, the skill first downloads the video with yt-dlp, then passes the file to vlmrun.
- •For transcription (and optional timestamps), use the vlmrun CLI with a video file as input (
- •
Workflow
- •User provides a YouTube URL (and optionally output path).
- •Download the video (or audio-only for faster/smaller) with yt-dlp.
- •Run:
vlmrun chat "Transcribe this video with timestamps for each section. Output the full transcript in a clear, readable format." -i <downloaded_file> -o <output_dir>. - •Capture vlmrun’s response and save it as the transcript file (e.g.
transcript.txt).
Prerequisites
- •Python 3.10+
- •VLMRUN_API_KEY (required for vlmrun)
- •vlmrun CLI (
vlmrun[cli]) - •yt-dlp (for downloading YouTube videos)
See vlmrun-cli-skill for detailed vlmrun usage and examples (including video transcription).
Installation & Setup
From the youtube-transcription-generator directory:
Windows (PowerShell):
cd path\to\youtube-transcription-generator uv venv .venv\Scripts\Activate.ps1 uv pip install -r requirements.txt
macOS/Linux:
cd path/to/youtube-transcription-generator uv venv source .venv/bin/activate uv pip install -r requirements.txt
Copy .env_template to .env and set VLMRUN_API_KEY.
Quick Start: Transcribe a YouTube Video
Option A: Run the script (recommended)
# From youtube-transcription-generator directory, with venv activated python scripts/run_transcription.py "https://www.youtube.com/watch?v=VIDEO_ID" -o ./output
This will:
- •Download the video with yt-dlp into the output directory.
- •Run vlmrun to transcribe the video.
- •Save the transcript as
output/transcript.txt(and keep artifacts inoutput/).
Option B: Manual vlmrun (after downloading the video yourself)
# 1) Download with yt-dlp yt-dlp -f "bv*[ext=mp4]+ba/best[ext=mp4]/best" -o video.mp4 "https://www.youtube.com/watch?v=VIDEO_ID" # 2) Transcribe with vlmrun (see vlmrun-cli-skill for options) vlmrun chat "Transcribe this video with timestamps for each section. Output the full transcript in a clear, readable format." -i video.mp4 -o ./output
Capture the vlmrun stdout and save it as your transcript, or use --json if you need structured output.
Prompt variants for vlmrun
- •
With timestamps:
"Transcribe this video with timestamps for each section. Output the full transcript in a clear, readable format." - •
Plain transcript only:
"Transcribe everything said in this video. Output only the spoken text, no timestamps." - •
Structured (e.g. JSON):
Use--jsonand ask for a structured format in the prompt (e.g. list of{ "time": "...", "text": "..." }).
Workflow checklist
- • Confirm
vlmrunis installed andVLMRUN_API_KEYis set (see vlmrun-cli-skill). - • Install dependencies:
uv pip install -r requirements.txt(includesvlmrun[cli]andyt-dlp). - • Run
python scripts/run_transcription.py <youtube_url> -o ./outputor download + vlmrun manually. - • Find transcript in the output directory (e.g.
output/transcript.txt).
Troubleshooting
- •
vlmrun not found
Activate the venv and run:uv pip install "vlmrun[cli]". See vlmrun-cli-skill. - •
Authentication errors
VerifyVLMRUN_API_KEYin.envor the current shell. - •
yt-dlp fails
Update yt-dlp:uv pip install -U yt-dlp. Check the URL is a valid public YouTube video. - •
Large or long videos
Use audio-only download in the script (e.g.-f bestaudio) to reduce size and speed up transcription.