AgentSkillsCN

faster-whisper

本地语音转文本,采用faster-whisper模型。在保持与OpenAI Whisper相同准确率的前提下,速度提升4–6倍;借助GPU加速,实时转录速度可达约20倍。支持标准模型与精简模型,并提供逐字时间戳功能。

SKILL.md
--- frontmatter
name: faster-whisper
description: Local speech-to-text using faster-whisper. 4-6x faster than OpenAI Whisper with identical accuracy; GPU acceleration enables ~20x realtime transcription. Supports standard and distilled models with word-level timestamps.
version: 1.0.7
author: ThePlasmak
homepage: https://github.com/ThePlasmak/faster-whisper
tags: ["audio", "transcription", "whisper", "speech-to-text", "ml", "cuda", "gpu"]
platforms: ["linux", "macos", "wsl2"]
metadata: {"openclaw":{"emoji":"🗣️","requires":{"bins":["ffmpeg","python3"]}}}

Faster Whisper

Local speech-to-text using faster-whisper — a CTranslate2 reimplementation of OpenAI's Whisper that runs 4-6x faster with identical accuracy. With GPU acceleration, expect ~20x realtime transcription (a 10-minute audio file in ~30 seconds).

When to Use

Use this skill when you need to:

  • Transcribe audio/video files — meetings, interviews, podcasts, lectures, YouTube videos
  • Convert speech to text locally — no API costs, works offline (after model download)
  • Batch process multiple audio files — efficient for large collections
  • Generate subtitles/captions — word-level timestamps available
  • Multilingual transcription — supports 99+ languages with auto-detection

Trigger phrases: "transcribe this audio", "convert speech to text", "what did they say", "make a transcript", "audio to text", "subtitle this video"

When NOT to use:

  • Real-time/streaming transcription (use streaming-optimized tools instead)
  • Cloud-only environments without local compute
  • Files <10 seconds where API call latency doesn't matter

Quick Reference

TaskCommandNotes
Basic transcription./scripts/transcribe audio.mp3Uses default distil-large-v3
Faster English./scripts/transcribe audio.mp3 --model distil-medium.en --language enEnglish-only, 6.8x faster
Maximum accuracy./scripts/transcribe audio.mp3 --model large-v3-turbo --beam-size 10Slower but best quality
Word timestamps./scripts/transcribe audio.mp3 --word-timestampsFor subtitles/captions
JSON output./scripts/transcribe audio.mp3 --json -o output.jsonProgrammatic access
Multilingual./scripts/transcribe audio.mp3 --model large-v3-turboAuto-detects language
Remove silence./scripts/transcribe audio.mp3 --vadVoice activity detection

Model Selection

Choose the right model for your needs:

dot
digraph model_selection {
    rankdir=LR;
    node [shape=box, style=rounded];

    start [label="Start", shape=doublecircle];
    need_accuracy [label="Need maximum\naccuracy?", shape=diamond];
    multilingual [label="Multilingual\ncontent?", shape=diamond];
    resource_constrained [label="Resource\nconstraints?", shape=diamond];

    large_v3 [label="large-v3\nor\nlarge-v3-turbo", style="rounded,filled", fillcolor=lightblue];
    large_turbo [label="large-v3-turbo", style="rounded,filled", fillcolor=lightblue];
    distil_large [label="distil-large-v3\n(default)", style="rounded,filled", fillcolor=lightgreen];
    distil_medium [label="distil-medium.en", style="rounded,filled", fillcolor=lightyellow];
    distil_small [label="distil-small.en", style="rounded,filled", fillcolor=lightyellow];

    start -> need_accuracy;
    need_accuracy -> large_v3 [label="yes"];
    need_accuracy -> multilingual [label="no"];
    multilingual -> large_turbo [label="yes"];
    multilingual -> resource_constrained [label="no (English)"];
    resource_constrained -> distil_small [label="mobile/edge"];
    resource_constrained -> distil_medium [label="some limits"];
    resource_constrained -> distil_large [label="no"];
}

Model Table

Standard Models (Full Whisper)

ModelSizeSpeedAccuracyUse Case
tiny / tiny.en39MFastestBasicQuick drafts
base / base.en74MVery fastGoodGeneral use
small / small.en244MFastBetterMost tasks
medium / medium.en769MModerateHighQuality transcription
large-v1/v2/v31.5GBSlowerBestMaximum accuracy
large-v3-turbo809MFastExcellentRecommended for accuracy

Distilled Models (~6x Faster, ~1% WER difference)

ModelSizeSpeed vs StandardAccuracyUse Case
distil-large-v3756M~6.3x faster9.7% WERDefault, best balance
distil-large-v2756M~5.8x faster10.1% WERFallback
distil-medium.en394M~6.8x faster11.1% WEREnglish-only, resource-constrained
distil-small.en166M~5.6x faster12.1% WERMobile/edge devices

.en models are English-only and slightly faster/better for English content.

Setup

Linux / macOS / WSL2

bash
# Run the setup script (creates venv, installs deps, auto-detects GPU)
./setup.sh

Requirements:

  • Python 3.10+, ffmpeg

Platform Support

PlatformAccelerationSpeed
Linux + NVIDIA GPUCUDA~20x realtime 🚀
WSL2 + NVIDIA GPUCUDA~20x realtime 🚀
macOS Apple SiliconCPU*~3-5x realtime
macOS IntelCPU~1-2x realtime
Linux (no GPU)CPU~1x realtime

*faster-whisper uses CTranslate2 which is CPU-only on macOS, but Apple Silicon is fast enough for practical use.

GPU Support (IMPORTANT!)

The setup script auto-detects your GPU and installs PyTorch with CUDA. Always use GPU if available — CPU transcription is extremely slow.

HardwareSpeed9-min video
RTX 3070 (GPU)~20x realtime~27 sec
CPU (int8)~0.3x realtime~30 min

If setup didn't detect your GPU, manually install PyTorch with CUDA:

bash
# For CUDA 12.x
uv pip install --python .venv/bin/python torch --index-url https://download.pytorch.org/whl/cu121

# For CUDA 11.x
uv pip install --python .venv/bin/python torch --index-url https://download.pytorch.org/whl/cu118

Usage

bash
# Basic transcription
./scripts/transcribe audio.mp3

# With specific model
./scripts/transcribe audio.wav --model large-v3-turbo

# With word timestamps
./scripts/transcribe audio.mp3 --word-timestamps

# Specify language (faster than auto-detect)
./scripts/transcribe audio.mp3 --language en

# JSON output
./scripts/transcribe audio.mp3 --json

Options

code
--model, -m        Model name (default: distil-large-v3)
--language, -l     Language code (e.g., en, es, fr - auto-detect if omitted)
--word-timestamps  Include word-level timestamps
--beam-size        Beam search size (default: 5, higher = more accurate but slower)
--vad              Enable voice activity detection (removes silence)
--json, -j         Output as JSON
--output, -o       Save transcript to file
--device           cpu or cuda (auto-detected)
--compute-type     int8, float16, float32 (default: auto-optimized)
--quiet, -q        Suppress progress messages

Examples

bash
# Transcribe YouTube audio (after extraction with yt-dlp)
yt-dlp -x --audio-format mp3 <URL> -o audio.mp3
./scripts/transcribe audio.mp3

# Batch transcription with JSON output
for file in *.mp3; do
  ./scripts/transcribe "$file" --json > "${file%.mp3}.json"
done

# High-accuracy transcription with larger beam size
./scripts/transcribe audio.mp3 \
  --model large-v3-turbo --beam-size 10 --word-timestamps

# Fast English-only transcription
./scripts/transcribe audio.mp3 \
  --model distil-medium.en --language en

# Transcribe with VAD (removes silence)
./scripts/transcribe audio.mp3 --vad

Common Mistakes

MistakeProblemSolution
Using CPU when GPU available10-20x slower transcriptionCheck nvidia-smi; verify CUDA installation
Not specifying languageWastes time auto-detecting on known contentUse --language en when you know the language
Using wrong modelUnnecessary slowness or poor accuracyDefault distil-large-v3 is excellent; only use large-v3 if accuracy issues
Ignoring distilled modelsMissing 6x speedup with <1% accuracy lossTry distil-large-v3 before reaching for standard models
Forgetting ffmpegSetup fails or audio can't be processedSetup script handles this; manual installs need ffmpeg separately
Out of memory errorsModel too large for available VRAM/RAMUse smaller model or --compute-type int8
Over-engineering beam sizeDiminishing returns past beam-size 5-7Default 5 is fine; try 10 for critical transcripts

Performance Notes

  • First run: Downloads model to ~/.cache/huggingface/ (one-time)
  • GPU: Automatically uses CUDA if available (~10-20x faster)
  • Quantization: INT8 used on CPU for ~4x speedup with minimal accuracy loss
  • Memory:
    • distil-large-v3: ~2GB RAM / ~1GB VRAM
    • large-v3-turbo: ~4GB RAM / ~2GB VRAM
    • tiny/base: <1GB RAM

Why faster-whisper?

  • Speed: ~4-6x faster than OpenAI's original Whisper
  • Accuracy: Identical (uses same model weights)
  • Efficiency: Lower memory usage via quantization
  • Production-ready: Stable C++ backend (CTranslate2)
  • Distilled models: ~6x faster with <1% accuracy loss

Troubleshooting

"CUDA not available — using CPU": Install PyTorch with CUDA (see GPU Support above) Setup fails: Make sure Python 3.10+ is installed Out of memory: Use smaller model or --compute-type int8 Slow on CPU: Expected — use GPU for practical transcription Model download fails: Check ~/.cache/huggingface/ permissions

References