AgentSkillsCN

whisperx

通过WhisperX实现语音转文字,支持词级时间戳、说话人区分以及强制对齐功能。基于fast-whisper构建,采用批处理推理,可达到70倍实时速度。

SKILL.md
--- frontmatter
name: whisperx
description: Speech-to-text with word-level timestamps, speaker diarization, and forced alignment using WhisperX. Built on faster-whisper with batched inference for 70x realtime speed.
version: 1.0.0
author: Sarah Mak
tags: ["audio", "transcription", "whisperx", "speech-to-text", "diarization", "alignment", "subtitles", "ml", "cuda", "gpu"]
homepage: https://github.com/ThePlasmak/whisperx
platforms: ["linux", "macos", "wsl2"]
metadata: {"openclaw":{"emoji":"🎙️","requires":{"bins":["ffmpeg","python3"]}}}

WhisperX

Speech-to-text with word-level timestamps, speaker diarization, and forced alignment — built on faster-whisper with batched inference for up to 70x realtime transcription speed.

WhisperX extends Whisper with three key capabilities that faster-whisper alone doesn't provide:

  1. Forced alignment — precise word-level timestamps via phoneme ASR models (wav2vec2)
  2. Speaker diarization — label who said what (via pyannote.audio)
  3. Batched inference — process audio in parallel chunks for massive speedup

When to Use

Use this skill when you need to:

  • Transcribe with word-level timing — subtitles, captions, karaoke-style highlighting
  • Identify speakers — meetings, interviews, podcasts, multi-speaker recordings
  • Generate subtitle files — SRT, VTT with accurate word timestamps
  • Translate speech to English — from any supported language
  • Batch transcribe — efficient processing of multiple files
  • Transcribe a section — extract just part of a long recording

Trigger phrases: "transcribe with speakers", "who said what", "diarize", "make subtitles", "word timestamps", "speaker identification", "meeting transcript", "karaoke subtitles"

When NOT to use:

  • Simple transcription without speaker/timing needs → use faster-whisper (lighter, faster)
  • Real-time/streaming transcription
  • Files <10 seconds where setup overhead isn't worth it

WhisperX vs faster-whisper:

Featurefaster-whisperWhisperX
Basic transcription
Word timestamps✅ (approximate)✅ (precise, aligned)
Speaker diarization
Forced alignment
Batched inference
Word-level subtitles✅ (karaoke-style)
Subtitle generationManualBuilt-in (SRT/VTT/TSV)
Time range trimming✅ (--start/--end)
Hotwords✅ (boost specific terms)
Initial prompt✅ (domain terms)
Speaker renaming✅ (--speaker-names)
Stdin pipe✅ (read from -)
Setup complexitySimpleRequires HF token for diarization

Quick Reference

All commands use ./scripts/transcribe (the skill wrapper), not the whisperx CLI directly. The wrapper applies a required PyTorch compatibility patch — see "PyTorch 2.6+ Compatibility" below.

TaskCommandNotes
Basic transcription./scripts/transcribe audio.mp3Word-aligned by default
With speakers./scripts/transcribe audio.mp3 --diarizeAuto-reads ~/.cache/huggingface/token
Clean speaker output./scripts/transcribe audio.mp3 --diarize --merge-speakersMerges consecutive same-speaker segments
Named speakers./scripts/transcribe audio.mp3 --diarize --speaker-names "Alice,Bob"Replaces SPEAKER_00, SPEAKER_01
SRT subtitles./scripts/transcribe audio.mp3 --srt -o subs.srtReady for video players
Word-level SRT./scripts/transcribe audio.mp3 --srt --word-levelKaraoke-style, one word per cue
Wrapped subtitles./scripts/transcribe audio.mp3 --srt --max-line-width 42Standard TV subtitle width
VTT subtitles./scripts/transcribe audio.mp3 --vtt -o subs.vttWeb-compatible with <v> speaker tags
JSON output./scripts/transcribe audio.mp3 --jsonFull data with word timestamps
TSV output./scripts/transcribe audio.mp3 --tsvSpreadsheet-friendly
Translate to English./scripts/transcribe audio.mp3 --translateAny language → English
Fast, no alignment./scripts/transcribe audio.mp3 --no-alignSkip forced alignment
Specific language./scripts/transcribe audio.mp3 -l enFaster than auto-detect
Partial transcription./scripts/transcribe audio.mp3 --start 1:30 --end 5:00Only a section
Boost terms./scripts/transcribe audio.mp3 --hotwords "Kubernetes gRPC"Improve rare term recognition
Domain accuracy./scripts/transcribe audio.mp3 --initial-prompt "OpenAI, GPT-4"Condition the model
From stdincat audio.mp3 | ./scripts/transcribe -Pipe from other tools
Auto-detect format./scripts/transcribe audio.mp3 -o out.srtFormat from extension

⚠️ Do NOT run whisperx CLI directly — it will crash on PyTorch 2.6+ with pyannote models. Always use this skill's ./scripts/transcribe wrapper.

Model Selection

ModelSizeSpeedAccuracyUse Case
tiny39MFastestBasicQuick drafts, testing
base74MVery fastGoodGeneral use
small244MFastBetterDefault for whisperx CLI
medium769MModerateHighQuality transcription
large-v21.5GBSlowerExcellentBest diarization compat
large-v31.5GBSlowerBestMaximum accuracy
large-v3-turbo809MFastExcellentRecommended (default)

Note: WhisperX defaults to small but this skill defaults to large-v3-turbo for the best speed/accuracy balance with GPU.

Setup

First-Time Setup

Prerequisites: Python 3.10+, ffmpeg, NVIDIA GPU with CUDA (strongly recommended)

Step 1: Install whisperx

bash
# Option A: Run the setup script (auto-detects GPU, creates venv if needed)
./setup.sh

# Option B: Install globally (if you prefer)
pip install whisperx

Step 2 (optional, for diarization): Set up Hugging Face token

Speaker diarization requires a free Hugging Face account and access to gated models. Skip this if you only need transcription/alignment.

  1. Create account at huggingface.co (if you don't have one)
  2. Go to huggingface.co/settings/tokens and create a read access token
  3. Accept the model agreement(s) (click "Agree and access repository"):
  4. Save the token so the script auto-detects it:
    bash
    mkdir -p ~/.cache/huggingface && echo -n "hf_YOUR_TOKEN" > ~/.cache/huggingface/token && chmod 600 ~/.cache/huggingface/token
    
    Alternatively: set HF_TOKEN env var, or pass --hf-token per-command.

Note: The token and model access are completely free. The models are just gated behind a click-to-agree license. Without step 3, you'll get a 403 error even with a valid token.

Subsequent Runs

No setup needed — just run ./scripts/transcribe. The wrapper script:

  • Auto-detects GPU/CPU and picks optimal compute type
  • Auto-reads the HF token from ~/.cache/huggingface/token for diarization
  • Applies the PyTorch 2.6+ compatibility patch automatically (see below)
  • First run for a new model downloads it to ~/.cache/huggingface/ (one-time per model)

Checking If It's Working

bash
# Quick test — should print transcript to stdout
./scripts/transcribe some_audio.mp3

# Test diarization — should show [SPEAKER_00], [SPEAKER_01], etc.
./scripts/transcribe some_audio.mp3 --diarize

# Check version
./scripts/transcribe --version

# If diarization fails with 403: model agreements not accepted (see step 3 above)
# If it crashes with pickle/weights_only error: you're running `whisperx` CLI directly instead of the wrapper

Platform Support

PlatformAccelerationSpeed
Linux + NVIDIA GPUCUDA (batched)~70x realtime 🚀
WSL2 + NVIDIA GPUCUDA (batched)~70x realtime 🚀
macOS Apple SiliconCPU~3-5x realtime
macOS IntelCPU~1-2x realtime
Linux (no GPU)CPU~1x realtime

Usage

All commands use ./scripts/transcribe — resolve the path relative to this skill's directory.

bash
# Basic transcription (word-aligned)
./scripts/transcribe audio.mp3

# With speaker diarization (auto-reads ~/.cache/huggingface/token)
./scripts/transcribe audio.mp3 --diarize

# Diarize with merged same-speaker segments (cleaner output)
./scripts/transcribe audio.mp3 --diarize --merge-speakers

# Rename speakers to real names
./scripts/transcribe audio.mp3 --diarize --speaker-names "Alice,Bob,Charlie"

# Generate SRT subtitles
./scripts/transcribe audio.mp3 --srt -o subtitles.srt

# SRT with line wrapping (standard TV width)
./scripts/transcribe audio.mp3 --srt --max-line-width 42 -o subtitles.srt

# Word-level karaoke subtitles (one word per cue with precise timing)
./scripts/transcribe audio.mp3 --srt --word-level -o karaoke.srt

# VTT with speaker voice tags (<v Speaker>text</v>)
./scripts/transcribe audio.mp3 --vtt --diarize -o subtitles.vtt

# Auto-detect format from output filename
./scripts/transcribe audio.mp3 -o transcript.json
./scripts/transcribe audio.mp3 -o subtitles.vtt

# TSV for spreadsheets/data analysis
./scripts/transcribe audio.mp3 --tsv -o transcript.tsv

# Transcribe only a section (useful for long recordings)
./scripts/transcribe podcast.mp3 --start 10:30 --end 15:00

# Boost recognition of specific terms (hotwords)
./scripts/transcribe meeting.mp3 --hotwords "Kubernetes gRPC OAuth2"

# Improve accuracy for domain context (initial prompt)
./scripts/transcribe meeting.mp3 --initial-prompt "Attendees: Alice, Bob. Topics: Kubernetes, gRPC, OAuth2"

# Maximum accuracy
./scripts/transcribe audio.mp3 --model large-v3

# Translate non-English audio to English
./scripts/transcribe audio.mp3 --translate -l ja

# Fast mode (skip alignment)
./scripts/transcribe audio.mp3 --no-align

# JSON with full metadata and word timestamps
./scripts/transcribe audio.mp3 --json -o transcript.json

# Specify known speaker count for better diarization
./scripts/transcribe audio.mp3 --diarize --min-speakers 2 --max-speakers 4

# Read audio from stdin (pipe from other tools)
cat audio.mp3 | ./scripts/transcribe -
ffmpeg -i video.mp4 -f wav - | ./scripts/transcribe - --diarize

Options

code
AUDIO_FILE               Path to audio/video file, or '-' to read from stdin

Model options:
  -m, --model NAME       Whisper model (default: large-v3-turbo)
  --batch-size N         Batch size for inference (default: 8, lower if OOM)
  --beam-size N          Beam search size (higher = slower but more accurate)
  --initial-prompt TEXT   Condition the model with domain terms, names, acronyms
  --hotwords TEXT         Space-separated hotwords to boost recognition of rare terms

Device options:
  --device               cpu, cuda, or auto (default: auto)
  --compute-type         int8, float16, float32, or auto (default: auto)
  --threads N            CPU threads for CTranslate2 inference (default: 4)

Language options:
  -l, --language CODE    Language code (auto-detects if omitted)
  --translate            Translate to English

Time range:
  --start TIME           Start time — seconds (90), MM:SS (1:30), HH:MM:SS
  --end TIME             End time — same formats as --start

Alignment options:
  --no-align             Skip forced alignment (no word timestamps)
  --align-model MODEL    Custom phoneme ASR model for alignment

Speaker diarization:
  --diarize              Enable speaker labels
  --hf-token TOKEN       Hugging Face access token (also reads ~/.cache/huggingface/token or HF_TOKEN env)
  --min-speakers N       Minimum speaker count hint
  --max-speakers N       Maximum speaker count hint
  --merge-speakers       Merge consecutive segments from same speaker (cleaner output)
  --speaker-names NAMES  Comma-separated names to replace SPEAKER_00, SPEAKER_01, etc.

Output options:
  -j, --json             JSON output with segments and word timestamps
  --srt                  SRT subtitle format
  --vtt                  WebVTT subtitle format (uses <v> voice tags for speakers)
  --tsv                  TSV (tab-separated values) for data analysis
  --word-level           Word-level subtitles (SRT/VTT only) — karaoke-style
  --max-line-width N     Maximum characters per subtitle line (wraps at word boundaries)
  --output-format FMT    Explicit format (srt, vtt, txt, json, tsv)
  -o, --output FILE      Save to file (format auto-detected from extension)

Miscellaneous:
  -V, --version          Show version
  -q, --quiet            Suppress progress messages

Output Formats

Plain Text (default)

code
Hello and welcome to the show.
Today we're talking about AI transcription.

Plain Text with Diarization (--diarize)

code
[SPEAKER_00] Hello and welcome to the show.
[SPEAKER_01] Thanks for having me.

Plain Text with Named Speakers (--diarize --speaker-names "Alice,Bob")

code
[Alice] Hello and welcome to the show.
[Bob] Thanks for having me.

SRT (--srt)

Standard subtitle format, compatible with VLC, YouTube, etc.

code
1
00:00:00,000 --> 00:00:03,500
Hello and welcome to the show.

2
00:00:03,500 --> 00:00:06,200
Today we're talking about AI transcription.

Word-Level SRT (--srt --word-level)

One word per cue — for karaoke-style highlighting or precise editing.

code
1
00:00:00,000 --> 00:00:00,320
Hello

2
00:00:00,320 --> 00:00:00,560
and

3
00:00:00,560 --> 00:00:01,100
welcome

WebVTT (--vtt)

Web-native subtitle format for HTML5 <video> and <track>. When used with --diarize, uses proper VTT <v> voice tags for speaker identification.

code
WEBVTT

00:00:00.000 --> 00:00:03.500
<v Alice>Hello and welcome to the show.</v>

00:00:03.500 --> 00:00:05.000
<v Bob>Thanks for having me.</v>

JSON (--json)

Structured output with word-level timestamps and confidence scores.

json
{
  "segments": [
    {
      "start": 0.0,
      "end": 3.5,
      "text": "Hello and welcome to the show.",
      "speaker": "SPEAKER_00",
      "words": [
        {"word": "Hello", "start": 0.0, "end": 0.32, "confidence": 0.98},
        {"word": "and", "start": 0.32, "end": 0.56, "confidence": 0.95}
      ]
    }
  ]
}

TSV (--tsv)

Tab-separated values for spreadsheets and data pipelines.

code
start	end	text
0.000	3.500	Hello and welcome to the show.
3.500	6.200	Today we're talking about AI transcription.

Examples

bash
# Transcribe a meeting recording with speakers (clean output)
./scripts/transcribe meeting.mp3 --diarize --merge-speakers \
  --speaker-names "Alice,Bob,Charlie" \
  --min-speakers 3 --max-speakers 5 --json -o meeting.json

# Generate subtitles for a video (wrapped to standard width)
./scripts/transcribe video.mp4 --srt --max-line-width 42 -o video.srt

# Karaoke-style word-level subtitles
./scripts/transcribe song.mp3 --vtt --word-level -o karaoke.vtt

# Transcribe just the interesting part of a podcast
./scripts/transcribe podcast.mp3 --start 45:00 --end 1:02:30 --diarize --merge-speakers

# Improve accuracy for technical content (hotwords + initial prompt)
./scripts/transcribe lecture.mp3 \
  --hotwords "PyTorch CTranslate2 whisperx" \
  --initial-prompt "A lecture on ML model optimization"

# Batch transcribe a folder
for file in recordings/*.mp3; do
  ./scripts/transcribe "$file" --json -o "${file%.mp3}.json"
done

# Transcribe YouTube audio (with yt-dlp)
yt-dlp -x --audio-format mp3 <URL> -o audio.mp3
./scripts/transcribe audio.mp3 --diarize --merge-speakers

# Pipe directly from ffmpeg (extract audio on the fly)
ffmpeg -i video.mp4 -f wav -ac 1 -ar 16000 - 2>/dev/null | ./scripts/transcribe -

# Quick draft (fast, no alignment)
./scripts/transcribe audio.mp3 --model base --no-align

# German audio with TSV output for analysis
./scripts/transcribe audio.mp3 -l de --tsv -o transcript.tsv

# Auto-detect format from filename
./scripts/transcribe audio.mp3 -o transcript.srt   # → SRT
./scripts/transcribe audio.mp3 -o data.json         # → JSON
./scripts/transcribe audio.mp3 -o export.tsv        # → TSV

Common Mistakes

MistakeProblemSolution
Using CPU when GPU available10-70x slowerCheck nvidia-smi; verify CUDA
Missing HF token for diarizeDiarization failsGet token from huggingface.co/settings/tokens
Not accepting model agreements403 error on diarization modelwhisperx ≥3.8.0: accept community-1. Earlier: accept both pyannote/speaker-diarization-3.1 AND segmentation-3.0 (see Setup)
Running whisperx CLI directlyCrashes on PyTorch 2.6+Always use ./scripts/transcribe wrapper (applies torch.load patch)
batch_size too highCUDA OOMLower --batch-size (try 4 or 2)
Using large-v3 when turbo worksUnnecessary slowdownlarge-v3-turbo is faster with near-identical accuracy
Forgetting --languageWastes time auto-detectingSpecify -l en when you know the language
Using WhisperX for simple transcriptionHeavier setup for no benefitUse faster-whisper for basic transcription
--word-level without --srt/--vttFlag is ignoredWord-level only applies to subtitle formats
--merge-speakers without --diarizeFlag is ignoredMerge only works when speakers are identified

Performance Notes

  • First run: Downloads model + alignment model (one-time)
  • GPU batched inference: Up to 70x realtime with large-v2
  • Diarization: Adds ~30-60s overhead for model loading
  • Memory (GPU VRAM):
    • large-v3-turbo: ~2-3GB
    • large-v3 + diarization: ~4-5GB
    • Reduce --batch-size if OOM
  • Completion stats: The tool prints segment/word counts, speed ratio, and speaker breakdown (when diarizing) at the end

Supported Languages

WhisperX supports all languages that Whisper supports (99 languages). Forced alignment (word timestamps) is available for a subset — if alignment fails for a language, the tool falls back gracefully to segment-level timestamps.

Languages with alignment support (common subset): en English, zh Chinese, de German, es Spanish, fr French, it Italian, ja Japanese, ko Korean, pt Portuguese, ru Russian, nl Dutch, pl Polish, tr Turkish, ar Arabic, sv Swedish, da Danish, fi Finnish, hu Hungarian, uk Ukrainian, el Greek, cs Czech, ro Romanian, vi Vietnamese, th Thai, hi Hindi, he Hebrew, id Indonesian, ms Malay, no Norwegian, fa Persian, bg Bulgarian, ca Catalan, hr Croatian, sk Slovak, sl Slovenian, ta Tamil, te Telugu, ur Urdu

For the full list, see whisperx/alignment.py.

Hotwords vs Initial Prompt

Both improve accuracy for specific terms, but they work differently:

--hotwords--initial-prompt
How it worksBoosts probability of specific tokens during decodingConditions the model as if these words appeared earlier
Best forRare terms, proper nouns, technical jargonSetting domain context, style, formatting
Example--hotwords "Kubernetes gRPC OAuth2"--initial-prompt "A technical meeting about cloud infrastructure"
Can combine✅ Yes, use both together for best results
Requireswhisperx ≥3.7.5Any version

Tip: Use hotwords for the specific words you need recognized correctly, and initial prompt for broader context about the audio content.

PyTorch 2.6+ Compatibility (CRITICAL)

⚠️ PyTorch 2.6 changed torch.load() to default to weights_only=True. This breaks pyannote.audio's model loading (used by both VAD and diarization), because the model checkpoints contain globals like omegaconf.listconfig.ListConfig and torch.torch_version.TorchVersion that aren't allowlisted.

Symptoms:

  • _pickle.UnpicklingError: Weights only load failed when loading VAD or diarization models
  • 'NoneType' object has no attribute 'to' (pipeline silently returns None)

How this skill handles it:

The scripts/transcribe.py uses the whisperx Python API directly (not as a subprocess) so it can monkey-patch torch.load before any model loading happens:

python
import torch
_original_torch_load = torch.load
def _patched_torch_load(*args, **kwargs):
    kwargs['weights_only'] = False  # Must FORCE, not setdefault — lightning_fabric passes True explicitly
    return _original_torch_load(*args, **kwargs)
torch.load = _patched_torch_load

Key details:

  • The patch must use kwargs['weights_only'] = False (forced override), NOT kwargs.setdefault('weights_only', False) — because lightning_fabric explicitly passes weights_only=True, which setdefault won't override
  • The patch must be applied before importing whisperx, pyannote, or any model loading code
  • This is why the skill uses the Python API instead of shelling out to whisperx CLI — a subprocess can't inherit the monkey-patch
  • VAD defaults to silero (not pyannote) to avoid a second torch.load issue in the VAD pipeline. Silero loads fine without the patch, but diarization still needs it

Note: whisperx ≥3.8.0 migrated to pyannote-audio v4 with speaker-diarization-community-1, which may resolve some of these compatibility issues. The patch is kept for broad version support.

If whisperx CLI is updated to fix this upstream, the monkey-patch can be removed and the script could switch back to subprocess mode. Track: whisperX#972

Troubleshooting

_pickle.UnpicklingError: Weights only load failed: PyTorch 2.6+ compat issue. If running via CLI (whisperx command directly), this can't be fixed without patching the installed library. Use this skill's scripts/transcribe wrapper instead, which applies the patch automatically. See "PyTorch 2.6+ Compatibility" section above.

"CUDA not available": Install PyTorch with CUDA (pip install torch --index-url https://download.pytorch.org/whl/cu121)

"No module named whisperx": Run ./setup.sh or pip install whisperx

Diarization 403 error: You must accept the model agreement(s). For whisperx ≥3.8.0: accept community-1. For earlier versions: accept both speaker-diarization-3.1 and segmentation-3.0. See Setup above.

Diarization fails but transcription continues: v1.1.0+ gracefully handles diarization failures — it prints a diagnostic error and continues without speaker labels instead of crashing.

'NoneType' object has no attribute 'to': Either the HF token is invalid, the model agreements haven't been accepted, or the torch.load patch isn't applied. Check all three.

OOM on GPU: Lower --batch-size to 4 or 2

Alignment fails for language X: The language may not have a wav2vec2 alignment model. The tool will fall back to segment-level timestamps and print a warning. Check supported languages in whisperx alignment.py.

Slow on CPU: Expected — use GPU for practical transcription. Even tiny model on CPU is ~5-10x slower than large-v3-turbo on a mid-range GPU.

Empty output / no segments: Audio may be silence or too short. Check with ffprobe audio.mp3 to verify the file has actual audio content. v1.1.0+ prints a warning and produces valid empty output instead of crashing.

Timestamps wrong after trimming: If using --start, timestamps in the output reflect the original file's timeline (not relative to the trim point). This is by design — subtitle timecodes stay correct for the source video.

"No speech detected" warning: The audio file may contain only music, silence, or non-speech sounds. This is expected behavior, not an error.

Upstream Changes

whisperx 3.8.0 (Feb 2026): Migrated to pyannote-audio v4 with speaker-diarization-community-1. This model has lower diarization error rates across all benchmarks compared to the older speaker-diarization-3.1. Upgrade recommended: pip install --upgrade whisperx

whisperx 3.7.5: Added --hotwords support for boosting recognition of specific terms. This skill exposes it via the --hotwords flag.

References