OpenAI Whisper

Whisper is a general-purpose speech recognition model supporting transcription, translation, and language detection for 99 languages.

Quick Start

Basic Transcription

python

import whisper

model = whisper.load_model("turbo")
result = model.transcribe("audio.mp3")
print(result["text"])

Using Scripts

bash

# Basic transcription
python scripts/transcribe.py audio.mp3

# Translation to English
python scripts/translate.py japanese_audio.mp3 --model medium

# Batch processing
python scripts/batch_transcribe.py audio1.mp3 audio2.mp3 audio3.mp3

# Language detection
python scripts/detect_language.py audio.mp3

# Fallback to lighter models on error
python scripts/fallback_transcribe.py audio.mp3 --model large

Common Tasks

Transcribe Audio

Simple transcription:

python

import whisper

model = whisper.load_model("small")
result = model.transcribe("meeting.mp3")
print(result["text"])

With language specification:

python

result = model.transcribe("meeting.mp3", language="ja")  # Japanese

Generate subtitles:

python

from whisper.utils import get_writer

result = model.transcribe("video.mp4")

# Save as SRT
writer = get_writer("srt", "./output")
writer(result, "video.mp4")

Script approach:

bash

python scripts/transcribe.py meeting.mp3 --model small --language ja --output_format srt

Translate to English

Translate non-English audio to English text:

python

model = whisper.load_model("medium")  # turbo doesn't support translation
result = model.transcribe("spanish_audio.mp3", task="translate")
print(result["text"])  # English translation

Script approach:

bash

python scripts/translate.py chinese_audio.mp3 --model medium --language zh

Detect Language

python

import whisper

model = whisper.load_model("turbo")
audio = whisper.load_audio("unknown.mp3")
audio = whisper.pad_or_trim(audio)
mel = whisper.log_mel_spectrogram(audio).to(model.device)

_, probs = model.detect_language(mel)
detected = max(probs, key=probs.get)
print(f"Detected language: {detected}")

Script approach:

bash

python scripts/detect_language.py unknown.mp3 --top_k 5

Batch Processing

Process multiple files efficiently:

python

import whisper

model = whisper.load_model("turbo")  # Load once
files = ["audio1.mp3", "audio2.mp3", "audio3.mp3"]

for audio_file in files:
    result = model.transcribe(audio_file)
    print(f"{audio_file}: {result['text']}")

Script approach:

bash

python scripts/batch_transcribe.py *.mp3 --model turbo --output_dir ./transcripts

Handle Long Audio

For recordings longer than 30 minutes, process in chunks or use the scripts directly (they handle long audio automatically):

bash

python scripts/transcribe.py long_meeting.mp3 --model small

For custom chunking in Python, see references/advanced_usage.md.

Word-Level Timestamps

Extract precise timing for each word:

python

result = model.transcribe("audio.mp3", word_timestamps=True)

for segment in result["segments"]:
    for word in segment.get("words", []):
        print(f"{word['word']}: {word['start']:.2f}s - {word['end']:.2f}s")

Improve Accuracy with Context

Provide domain-specific context:

python

result = model.transcribe(
    "technical_meeting.mp3",
    initial_prompt="This is a technical meeting discussing APIs, databases, and cloud infrastructure."
)

Model Selection

Choose the right model for your use case:

Model	Use Case	Speed	Accuracy
`tiny`	Real-time, resource-constrained	Fastest	Lowest
`base`	Quick drafts, low-resource	Very fast	Low
`small`	Recommended default	Fast	Good
`medium`	High accuracy needed	Moderate	High
`large`	Maximum accuracy	Slow	Highest
`turbo`	Speed-critical production	Very fast	High

Notes:

•Use .en variants (e.g., small.en) for English-only audio
•turbo model does NOT support translation
•For translation, use medium or large

See references/models.md for complete model comparison and selection guide.

Output Formats

Whisper supports multiple output formats:

•txt: Plain text transcript
•vtt: WebVTT subtitles (web video)
•srt: SubRip subtitles (most video players)
•tsv: Tab-separated with timestamps (data analysis)
•json: Complete metadata (APIs, archival)

python

from whisper.utils import get_writer

result = model.transcribe("audio.mp3")

# Save in multiple formats
for fmt in ["txt", "srt", "json"]:
    writer = get_writer(fmt, "./output")
    writer(result, "audio.mp3")

See references/output_formats.md for detailed format specifications.

Error Handling and Fallback

Use the fallback script for automatic retry with lighter models:

bash

# Tries large → turbo → medium → small → base → tiny
python scripts/fallback_transcribe.py audio.mp3 --model large

This automatically falls back to lighter models if memory errors occur.

Language Support

Whisper supports 99 languages including:

High-accuracy languages: English, Spanish, French, German, Italian, Portuguese, Russian, Polish, Dutch, Turkish, Japanese, Korean, Chinese

All supported languages: See references/languages.md for complete list and language codes.

Advanced Features

For advanced usage, see references/advanced_usage.md:

•Temperature fallback strategies
•Initial prompts and context
•Word-level timestamps
•Hallucination detection
•Quality metrics and filtering
•Chunk processing for long audio
•Device management (GPU/CPU)
•Beam search and sampling
•VAD (Voice Activity Detection)

API Reference

For complete API documentation, see references/api_reference.md:

•whisper.load_model() - Load Whisper models
•model.transcribe() - Transcribe or translate audio
•model.detect_language() - Detect audio language
•whisper.load_audio() - Load audio files
•whisper.utils.get_writer() - Write output formats
•All parameters and return values

Bundled Scripts

This skill includes production-ready scripts in scripts/:

•transcribe.py: Basic transcription with all options
•translate.py: Translation to English
•batch_transcribe.py: Process multiple files
•detect_language.py: Language detection
•fallback_transcribe.py: Auto-fallback on errors

All scripts support --help for full usage information:

bash

python scripts/transcribe.py --help

Installation

Whisper requires installation:

bash

pip install -U openai-whisper

# Also requires ffmpeg
# Ubuntu/Debian:
sudo apt install ffmpeg

# macOS:
brew install ffmpeg

# Windows:
# Download from https://ffmpeg.org/download.html

Best Practices

•Start with small - Good balance of speed and accuracy
•Specify language - Better accuracy than auto-detection
•Use appropriate model - Don't use large when small suffices
•Provide context - Use initial_prompt for domain-specific content
•Enable word timestamps - Only when needed (adds processing time)
•Batch processing - Reuse loaded model for multiple files
•Monitor quality - Check avg_logprob and compression_ratio in JSON output
•Translation limitation - Use medium or large (not turbo) for translation

Example Workflows

Meeting Transcription

python

import whisper
from whisper.utils import get_writer

# Load model
model = whisper.load_model("small")

# Transcribe with context
result = model.transcribe(
    "meeting.mp3",
    language="en",
    initial_prompt="Business meeting with Alice, Bob, and Charlie discussing Q4 goals."
)

# Save as text and subtitles
for fmt in ["txt", "srt"]:
    writer = get_writer(fmt, "./output")
    writer(result, "meeting.mp3")

print(f"Transcription complete: {len(result['text'])} characters")

Video Subtitle Generation

bash

# Generate SRT subtitles with word-level timing
python scripts/transcribe.py video.mp4 \
    --model turbo \
    --output_format srt \
    --word_timestamps \
    --output_dir ./subtitles

Multilingual Batch Processing

bash

# Process all MP3 files with auto-detection
python scripts/batch_transcribe.py *.mp3 \
    --model small \
    --output_format all \
    --continue_on_error

High-Accuracy Professional Transcription

python

import whisper

model = whisper.load_model("large")

result = model.transcribe(
    "interview.mp3",
    language="en",
    temperature=(0.0, 0.2),  # Low temperature for consistency
    word_timestamps=True,
    initial_prompt="Professional interview discussing career and achievements."
)

# Filter high-confidence segments only
high_conf = [s for s in result["segments"] if s["avg_logprob"] > -0.3]