AgentSkillsCN

openai-whisper

使用OpenAI Whisper的全面音频转录与翻译工具包。适用于处理音频文件(.mp3、.wav、.m4a、.flac等)时:(1) 将语音转为99种语言的文本,(2) 将非英语音频翻译成英语,(3) 检测音频的语言,(4) 生成字幕(SRT、VTT),(5) 提取时间戳,(6) 处理长录音,(7) 批量转录多个文件,或(8) 构建需要自动语音识别(ASR)的应用程序。当用户提及“Whisper”、“转录音频”、“语音转文本”、“音频转文本”、“生成字幕”或“会议转录”时也可使用。

SKILL.md
--- frontmatter
name: openai-whisper
description: "Comprehensive audio transcription and translation toolkit using OpenAI Whisper. Use when working with audio files (.mp3, .wav, .m4a, .flac, etc.) for: (1) Transcribing speech to text in 99 languages, (2) Translating non-English audio to English, (3) Detecting language of audio, (4) Generating subtitles (SRT, VTT), (5) Extracting timestamps, (6) Processing long recordings, (7) Batch transcription of multiple files, or (8) Building applications that require automatic speech recognition (ASR). Also use when users mention 'Whisper', 'transcribe audio', 'speech-to-text', 'audio to text', 'generate subtitles', or 'meeting transcription'."

OpenAI Whisper

Whisper is a general-purpose speech recognition model supporting transcription, translation, and language detection for 99 languages.

Quick Start

Basic Transcription

python
import whisper

model = whisper.load_model("turbo")
result = model.transcribe("audio.mp3")
print(result["text"])

Using Scripts

bash
# Basic transcription
python scripts/transcribe.py audio.mp3

# Translation to English
python scripts/translate.py japanese_audio.mp3 --model medium

# Batch processing
python scripts/batch_transcribe.py audio1.mp3 audio2.mp3 audio3.mp3

# Language detection
python scripts/detect_language.py audio.mp3

# Fallback to lighter models on error
python scripts/fallback_transcribe.py audio.mp3 --model large

Common Tasks

Transcribe Audio

Simple transcription:

python
import whisper

model = whisper.load_model("small")
result = model.transcribe("meeting.mp3")
print(result["text"])

With language specification:

python
result = model.transcribe("meeting.mp3", language="ja")  # Japanese

Generate subtitles:

python
from whisper.utils import get_writer

result = model.transcribe("video.mp4")

# Save as SRT
writer = get_writer("srt", "./output")
writer(result, "video.mp4")

Script approach:

bash
python scripts/transcribe.py meeting.mp3 --model small --language ja --output_format srt

Translate to English

Translate non-English audio to English text:

python
model = whisper.load_model("medium")  # turbo doesn't support translation
result = model.transcribe("spanish_audio.mp3", task="translate")
print(result["text"])  # English translation

Script approach:

bash
python scripts/translate.py chinese_audio.mp3 --model medium --language zh

Detect Language

python
import whisper

model = whisper.load_model("turbo")
audio = whisper.load_audio("unknown.mp3")
audio = whisper.pad_or_trim(audio)
mel = whisper.log_mel_spectrogram(audio).to(model.device)

_, probs = model.detect_language(mel)
detected = max(probs, key=probs.get)
print(f"Detected language: {detected}")

Script approach:

bash
python scripts/detect_language.py unknown.mp3 --top_k 5

Batch Processing

Process multiple files efficiently:

python
import whisper

model = whisper.load_model("turbo")  # Load once
files = ["audio1.mp3", "audio2.mp3", "audio3.mp3"]

for audio_file in files:
    result = model.transcribe(audio_file)
    print(f"{audio_file}: {result['text']}")

Script approach:

bash
python scripts/batch_transcribe.py *.mp3 --model turbo --output_dir ./transcripts

Handle Long Audio

For recordings longer than 30 minutes, process in chunks or use the scripts directly (they handle long audio automatically):

bash
python scripts/transcribe.py long_meeting.mp3 --model small

For custom chunking in Python, see references/advanced_usage.md.

Word-Level Timestamps

Extract precise timing for each word:

python
result = model.transcribe("audio.mp3", word_timestamps=True)

for segment in result["segments"]:
    for word in segment.get("words", []):
        print(f"{word['word']}: {word['start']:.2f}s - {word['end']:.2f}s")

Improve Accuracy with Context

Provide domain-specific context:

python
result = model.transcribe(
    "technical_meeting.mp3",
    initial_prompt="This is a technical meeting discussing APIs, databases, and cloud infrastructure."
)

Model Selection

Choose the right model for your use case:

ModelUse CaseSpeedAccuracy
tinyReal-time, resource-constrainedFastestLowest
baseQuick drafts, low-resourceVery fastLow
smallRecommended defaultFastGood
mediumHigh accuracy neededModerateHigh
largeMaximum accuracySlowHighest
turboSpeed-critical productionVery fastHigh

Notes:

  • Use .en variants (e.g., small.en) for English-only audio
  • turbo model does NOT support translation
  • For translation, use medium or large

See references/models.md for complete model comparison and selection guide.

Output Formats

Whisper supports multiple output formats:

  • txt: Plain text transcript
  • vtt: WebVTT subtitles (web video)
  • srt: SubRip subtitles (most video players)
  • tsv: Tab-separated with timestamps (data analysis)
  • json: Complete metadata (APIs, archival)
python
from whisper.utils import get_writer

result = model.transcribe("audio.mp3")

# Save in multiple formats
for fmt in ["txt", "srt", "json"]:
    writer = get_writer(fmt, "./output")
    writer(result, "audio.mp3")

See references/output_formats.md for detailed format specifications.

Error Handling and Fallback

Use the fallback script for automatic retry with lighter models:

bash
# Tries large → turbo → medium → small → base → tiny
python scripts/fallback_transcribe.py audio.mp3 --model large

This automatically falls back to lighter models if memory errors occur.

Language Support

Whisper supports 99 languages including:

High-accuracy languages: English, Spanish, French, German, Italian, Portuguese, Russian, Polish, Dutch, Turkish, Japanese, Korean, Chinese

All supported languages: See references/languages.md for complete list and language codes.

Advanced Features

For advanced usage, see references/advanced_usage.md:

  • Temperature fallback strategies
  • Initial prompts and context
  • Word-level timestamps
  • Hallucination detection
  • Quality metrics and filtering
  • Chunk processing for long audio
  • Device management (GPU/CPU)
  • Beam search and sampling
  • VAD (Voice Activity Detection)

API Reference

For complete API documentation, see references/api_reference.md:

  • whisper.load_model() - Load Whisper models
  • model.transcribe() - Transcribe or translate audio
  • model.detect_language() - Detect audio language
  • whisper.load_audio() - Load audio files
  • whisper.utils.get_writer() - Write output formats
  • All parameters and return values

Bundled Scripts

This skill includes production-ready scripts in scripts/:

  • transcribe.py: Basic transcription with all options
  • translate.py: Translation to English
  • batch_transcribe.py: Process multiple files
  • detect_language.py: Language detection
  • fallback_transcribe.py: Auto-fallback on errors

All scripts support --help for full usage information:

bash
python scripts/transcribe.py --help

Installation

Whisper requires installation:

bash
pip install -U openai-whisper

# Also requires ffmpeg
# Ubuntu/Debian:
sudo apt install ffmpeg

# macOS:
brew install ffmpeg

# Windows:
# Download from https://ffmpeg.org/download.html

Best Practices

  1. Start with small - Good balance of speed and accuracy
  2. Specify language - Better accuracy than auto-detection
  3. Use appropriate model - Don't use large when small suffices
  4. Provide context - Use initial_prompt for domain-specific content
  5. Enable word timestamps - Only when needed (adds processing time)
  6. Batch processing - Reuse loaded model for multiple files
  7. Monitor quality - Check avg_logprob and compression_ratio in JSON output
  8. Translation limitation - Use medium or large (not turbo) for translation

Example Workflows

Meeting Transcription

python
import whisper
from whisper.utils import get_writer

# Load model
model = whisper.load_model("small")

# Transcribe with context
result = model.transcribe(
    "meeting.mp3",
    language="en",
    initial_prompt="Business meeting with Alice, Bob, and Charlie discussing Q4 goals."
)

# Save as text and subtitles
for fmt in ["txt", "srt"]:
    writer = get_writer(fmt, "./output")
    writer(result, "meeting.mp3")

print(f"Transcription complete: {len(result['text'])} characters")

Video Subtitle Generation

bash
# Generate SRT subtitles with word-level timing
python scripts/transcribe.py video.mp4 \
    --model turbo \
    --output_format srt \
    --word_timestamps \
    --output_dir ./subtitles

Multilingual Batch Processing

bash
# Process all MP3 files with auto-detection
python scripts/batch_transcribe.py *.mp3 \
    --model small \
    --output_format all \
    --continue_on_error

High-Accuracy Professional Transcription

python
import whisper

model = whisper.load_model("large")

result = model.transcribe(
    "interview.mp3",
    language="en",
    temperature=(0.0, 0.2),  # Low temperature for consistency
    word_timestamps=True,
    initial_prompt="Professional interview discussing career and achievements."
)

# Filter high-confidence segments only
high_conf = [s for s in result["segments"] if s["avg_logprob"] > -0.3]