OpenAI Whisper
Whisper is a general-purpose speech recognition model supporting transcription, translation, and language detection for 99 languages.
Quick Start
Basic Transcription
import whisper
model = whisper.load_model("turbo")
result = model.transcribe("audio.mp3")
print(result["text"])
Using Scripts
# Basic transcription python scripts/transcribe.py audio.mp3 # Translation to English python scripts/translate.py japanese_audio.mp3 --model medium # Batch processing python scripts/batch_transcribe.py audio1.mp3 audio2.mp3 audio3.mp3 # Language detection python scripts/detect_language.py audio.mp3 # Fallback to lighter models on error python scripts/fallback_transcribe.py audio.mp3 --model large
Common Tasks
Transcribe Audio
Simple transcription:
import whisper
model = whisper.load_model("small")
result = model.transcribe("meeting.mp3")
print(result["text"])
With language specification:
result = model.transcribe("meeting.mp3", language="ja") # Japanese
Generate subtitles:
from whisper.utils import get_writer
result = model.transcribe("video.mp4")
# Save as SRT
writer = get_writer("srt", "./output")
writer(result, "video.mp4")
Script approach:
python scripts/transcribe.py meeting.mp3 --model small --language ja --output_format srt
Translate to English
Translate non-English audio to English text:
model = whisper.load_model("medium") # turbo doesn't support translation
result = model.transcribe("spanish_audio.mp3", task="translate")
print(result["text"]) # English translation
Script approach:
python scripts/translate.py chinese_audio.mp3 --model medium --language zh
Detect Language
import whisper
model = whisper.load_model("turbo")
audio = whisper.load_audio("unknown.mp3")
audio = whisper.pad_or_trim(audio)
mel = whisper.log_mel_spectrogram(audio).to(model.device)
_, probs = model.detect_language(mel)
detected = max(probs, key=probs.get)
print(f"Detected language: {detected}")
Script approach:
python scripts/detect_language.py unknown.mp3 --top_k 5
Batch Processing
Process multiple files efficiently:
import whisper
model = whisper.load_model("turbo") # Load once
files = ["audio1.mp3", "audio2.mp3", "audio3.mp3"]
for audio_file in files:
result = model.transcribe(audio_file)
print(f"{audio_file}: {result['text']}")
Script approach:
python scripts/batch_transcribe.py *.mp3 --model turbo --output_dir ./transcripts
Handle Long Audio
For recordings longer than 30 minutes, process in chunks or use the scripts directly (they handle long audio automatically):
python scripts/transcribe.py long_meeting.mp3 --model small
For custom chunking in Python, see references/advanced_usage.md.
Word-Level Timestamps
Extract precise timing for each word:
result = model.transcribe("audio.mp3", word_timestamps=True)
for segment in result["segments"]:
for word in segment.get("words", []):
print(f"{word['word']}: {word['start']:.2f}s - {word['end']:.2f}s")
Improve Accuracy with Context
Provide domain-specific context:
result = model.transcribe(
"technical_meeting.mp3",
initial_prompt="This is a technical meeting discussing APIs, databases, and cloud infrastructure."
)
Model Selection
Choose the right model for your use case:
| Model | Use Case | Speed | Accuracy |
|---|---|---|---|
tiny | Real-time, resource-constrained | Fastest | Lowest |
base | Quick drafts, low-resource | Very fast | Low |
small | Recommended default | Fast | Good |
medium | High accuracy needed | Moderate | High |
large | Maximum accuracy | Slow | Highest |
turbo | Speed-critical production | Very fast | High |
Notes:
- •Use
.envariants (e.g.,small.en) for English-only audio - •
turbomodel does NOT support translation - •For translation, use
mediumorlarge
See references/models.md for complete model comparison and selection guide.
Output Formats
Whisper supports multiple output formats:
- •txt: Plain text transcript
- •vtt: WebVTT subtitles (web video)
- •srt: SubRip subtitles (most video players)
- •tsv: Tab-separated with timestamps (data analysis)
- •json: Complete metadata (APIs, archival)
from whisper.utils import get_writer
result = model.transcribe("audio.mp3")
# Save in multiple formats
for fmt in ["txt", "srt", "json"]:
writer = get_writer(fmt, "./output")
writer(result, "audio.mp3")
See references/output_formats.md for detailed format specifications.
Error Handling and Fallback
Use the fallback script for automatic retry with lighter models:
# Tries large → turbo → medium → small → base → tiny python scripts/fallback_transcribe.py audio.mp3 --model large
This automatically falls back to lighter models if memory errors occur.
Language Support
Whisper supports 99 languages including:
High-accuracy languages: English, Spanish, French, German, Italian, Portuguese, Russian, Polish, Dutch, Turkish, Japanese, Korean, Chinese
All supported languages: See references/languages.md for complete list and language codes.
Advanced Features
For advanced usage, see references/advanced_usage.md:
- •Temperature fallback strategies
- •Initial prompts and context
- •Word-level timestamps
- •Hallucination detection
- •Quality metrics and filtering
- •Chunk processing for long audio
- •Device management (GPU/CPU)
- •Beam search and sampling
- •VAD (Voice Activity Detection)
API Reference
For complete API documentation, see references/api_reference.md:
- •
whisper.load_model()- Load Whisper models - •
model.transcribe()- Transcribe or translate audio - •
model.detect_language()- Detect audio language - •
whisper.load_audio()- Load audio files - •
whisper.utils.get_writer()- Write output formats - •All parameters and return values
Bundled Scripts
This skill includes production-ready scripts in scripts/:
- •transcribe.py: Basic transcription with all options
- •translate.py: Translation to English
- •batch_transcribe.py: Process multiple files
- •detect_language.py: Language detection
- •fallback_transcribe.py: Auto-fallback on errors
All scripts support --help for full usage information:
python scripts/transcribe.py --help
Installation
Whisper requires installation:
pip install -U openai-whisper # Also requires ffmpeg # Ubuntu/Debian: sudo apt install ffmpeg # macOS: brew install ffmpeg # Windows: # Download from https://ffmpeg.org/download.html
Best Practices
- •Start with
small- Good balance of speed and accuracy - •Specify language - Better accuracy than auto-detection
- •Use appropriate model - Don't use
largewhensmallsuffices - •Provide context - Use
initial_promptfor domain-specific content - •Enable word timestamps - Only when needed (adds processing time)
- •Batch processing - Reuse loaded model for multiple files
- •Monitor quality - Check
avg_logprobandcompression_ratioin JSON output - •Translation limitation - Use
mediumorlarge(notturbo) for translation
Example Workflows
Meeting Transcription
import whisper
from whisper.utils import get_writer
# Load model
model = whisper.load_model("small")
# Transcribe with context
result = model.transcribe(
"meeting.mp3",
language="en",
initial_prompt="Business meeting with Alice, Bob, and Charlie discussing Q4 goals."
)
# Save as text and subtitles
for fmt in ["txt", "srt"]:
writer = get_writer(fmt, "./output")
writer(result, "meeting.mp3")
print(f"Transcription complete: {len(result['text'])} characters")
Video Subtitle Generation
# Generate SRT subtitles with word-level timing
python scripts/transcribe.py video.mp4 \
--model turbo \
--output_format srt \
--word_timestamps \
--output_dir ./subtitles
Multilingual Batch Processing
# Process all MP3 files with auto-detection
python scripts/batch_transcribe.py *.mp3 \
--model small \
--output_format all \
--continue_on_error
High-Accuracy Professional Transcription
import whisper
model = whisper.load_model("large")
result = model.transcribe(
"interview.mp3",
language="en",
temperature=(0.0, 0.2), # Low temperature for consistency
word_timestamps=True,
initial_prompt="Professional interview discussing career and achievements."
)
# Filter high-confidence segments only
high_conf = [s for s in result["segments"] if s["avg_logprob"] > -0.3]