Whisper Transcribe
Transcribe and translate audio/video files locally using OpenAI Whisper. Supports 99 languages, runs entirely on your machine.
Prerequisites
Run once to install dependencies:
pip install openai-whisper --quiet pip install transformers accelerate --quiet # For HuggingFace fine-tuned models
ffmpeg is required for audio processing:
brew install ffmpeg # macOS
Step-by-Step Workflow
For ANY transcription/translation request, follow these steps:
Step 1: Check dependencies
/usr/local/opt/python@3.11/bin/python3.11 ~/.claude/skills/whisper-transcribe/scripts/check_deps.py
Step 2: Determine intent and run the appropriate command
User wants to transcribe audio/video to text:
/usr/local/opt/python@3.11/bin/python3.11 ~/.claude/skills/whisper-transcribe/scripts/whisper_transcribe.py transcribe "<FILE_PATH>" --output-dir ~/Downloads
User wants to translate audio/video to English:
/usr/local/opt/python@3.11/bin/python3.11 ~/.claude/skills/whisper-transcribe/scripts/whisper_transcribe.py translate "<FILE_PATH>" --output-dir ~/Downloads
User wants to detect the language:
/usr/local/opt/python@3.11/bin/python3.11 ~/.claude/skills/whisper-transcribe/scripts/whisper_transcribe.py detect "<FILE_PATH>"
User wants file info without transcribing:
/usr/local/opt/python@3.11/bin/python3.11 ~/.claude/skills/whisper-transcribe/scripts/whisper_transcribe.py info "<FILE_PATH>"
Step 3: Report results
Tell the user:
- •Detected language and confidence
- •The full transcription text
- •Where output files were saved (text, SRT subtitles, JSON)
- •Processing time
- •If translated: both original language and English translation
All Commands
# Transcribe audio/video (auto-detects language, saves .txt + .srt + .json) /usr/local/opt/python@3.11/bin/python3.11 ~/.claude/skills/whisper-transcribe/scripts/whisper_transcribe.py transcribe "<FILE>" --output-dir ~/Downloads # Transcribe with a specific source language (faster, skips detection) /usr/local/opt/python@3.11/bin/python3.11 ~/.claude/skills/whisper-transcribe/scripts/whisper_transcribe.py transcribe "<FILE>" --language te # Transcribe with a larger model for better accuracy /usr/local/opt/python@3.11/bin/python3.11 ~/.claude/skills/whisper-transcribe/scripts/whisper_transcribe.py transcribe "<FILE>" --model medium # Transcribe with a specific HuggingFace fine-tuned model /usr/local/opt/python@3.11/bin/python3.11 ~/.claude/skills/whisper-transcribe/scripts/whisper_transcribe.py transcribe "<FILE>" --hf-model "vasista22/whisper-telugu-large-v2" # Translate any language to English /usr/local/opt/python@3.11/bin/python3.11 ~/.claude/skills/whisper-transcribe/scripts/whisper_transcribe.py translate "<FILE>" --output-dir ~/Downloads # Translate with known source language /usr/local/opt/python@3.11/bin/python3.11 ~/.claude/skills/whisper-transcribe/scripts/whisper_transcribe.py translate "<FILE>" --language te # Detect language of audio /usr/local/opt/python@3.11/bin/python3.11 ~/.claude/skills/whisper-transcribe/scripts/whisper_transcribe.py detect "<FILE>" # Show audio file metadata /usr/local/opt/python@3.11/bin/python3.11 ~/.claude/skills/whisper-transcribe/scripts/whisper_transcribe.py info "<FILE>"
Indian Language Fine-Tuned Models
The skill supports 12 Indian languages with fine-tuned Whisper models from two sources:
- •vasista22 (IIT Madras Speech Lab) — HuggingFace hosted, plug-and-play
- •AI4Bharat IndicWhisper — Downloaded as ZIP, cached locally at
~/.cache/indicwhisper/
Auto-routing: Just pass --language <code> — the best model is selected automatically:
/usr/local/opt/python@3.11/bin/python3.11 ~/.claude/skills/whisper-transcribe/scripts/whisper_transcribe.py transcribe "<FILE>" --language te
Manual override: Use --hf-model to specify any HuggingFace Whisper model:
/usr/local/opt/python@3.11/bin/python3.11 ~/.claude/skills/whisper-transcribe/scripts/whisper_transcribe.py transcribe "<FILE>" --hf-model "vasista22/whisper-telugu-large-v2"
vasista22 Models (HuggingFace — auto-downloaded)
| Language | Code | Model |
|---|---|---|
| Telugu | te | vasista22/whisper-telugu-large-v2 |
| Hindi | hi | vasista22/whisper-hindi-large-v2 |
| Kannada | kn | vasista22/whisper-kannada-medium |
| Gujarati | gu | vasista22/whisper-gujarati-medium |
| Tamil | ta | vasista22/whisper-tamil-medium |
Models by vasista22 (IIT Madras Speech Lab), funded by Bhashini / MeitY.
AI4Bharat IndicWhisper Models (ZIP download — cached locally)
These models are fine-tuned on Whisper-medium using the Vistaar dataset.
First use downloads the model ZIP (~500-800 MB) and caches it at ~/.cache/indicwhisper/<language>/.
| Language | Code | Source |
|---|---|---|
| Bengali | bn | IndicWhisper (AI4Bharat) |
| Malayalam | ml | IndicWhisper (AI4Bharat) |
| Marathi | mr | IndicWhisper (AI4Bharat) |
| Odia | or | IndicWhisper (AI4Bharat) |
| Punjabi | pa | IndicWhisper (AI4Bharat) |
| Sanskrit | sa | IndicWhisper (AI4Bharat) |
| Urdu | ur | IndicWhisper (AI4Bharat) |
Models by AI4Bharat (IIT Madras), MIT licensed.
Priority
When a language has models from both sources (e.g. Hindi, Gujarati, Kannada, Tamil), the vasista22 HuggingFace model is preferred. IndicWhisper is used for languages not covered by vasista22.
Model Sizes
| Model | Size | Speed | Accuracy | Best for |
|---|---|---|---|---|
tiny | 39 MB | Fastest | Low | Quick drafts, clear speech |
base | 74 MB | Fast | Good | Default — good balance |
small | 244 MB | Moderate | Better | Noisy audio, accented speech |
medium | 769 MB | Slow | Great | Non-English, complex audio |
large | 1.5 GB | Slowest | Best | Maximum accuracy, rare languages |
Supported Languages (selection)
English, Spanish, French, German, Italian, Portuguese, Dutch, Russian, Chinese, Japanese, Korean, Arabic, Hindi, Telugu, Tamil, Bengali, Turkish, Ukrainian, Vietnamese, Thai, Indonesian, Swedish, and 70+ more.
Important Notes
- •Default output location is
~/Downloads - •All output is JSON to stdout, status messages go to stderr
- •Three output files per transcription:
.txt(plain text),.srt(subtitles),.json(structured) - •Works with both audio files (mp3, wav, m4a, ogg, flac) and video files (mp4, mkv, webm, mov)
- •Video files have audio automatically extracted before transcription
- •Translation always outputs English (this is a Whisper limitation)
- •First run downloads the model (~74 MB for base) — subsequent runs use cache
- •Runs 100% locally — no internet needed after model download, no API keys
- •Use
--model mediumor--model largefor better accuracy on non-English or noisy audio