AgentSkillsCN

transcription

利用 OpenAI Whisper 进行音视频转录。涵盖安装配置、模型选择、转录格式(SRT、VTT、JSON)、时间同步,以及说话人分离功能。当您需要转录媒体内容,或生成字幕时,可调用此技能。

SKILL.md
--- frontmatter
name: transcription
description: Audio/video transcription using OpenAI Whisper. Covers installation, model selection, transcript formats (SRT, VTT, JSON), timing synchronization, and speaker diarization. Use when transcribing media or generating subtitles.

plugin: video-editing updated: 2026-01-20

Transcription with Whisper

Production-ready patterns for audio/video transcription using OpenAI Whisper.

System Requirements

Installation Options

Option 1: OpenAI Whisper (Python)

bash
# macOS/Linux/Windows
pip install openai-whisper

# Verify
whisper --help

Option 2: whisper.cpp (C++ - faster)

bash
# macOS
brew install whisper-cpp

# Linux - build from source
git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp && make

# Windows - use pre-built binaries or build with cmake

Option 3: Insanely Fast Whisper (GPU accelerated)

bash
pip install insanely-fast-whisper

Model Selection

ModelSizeVRAMAccuracySpeedUse Case
tiny39M~1GBLowFastestQuick previews
base74M~1GBMediumFastDraft transcripts
small244M~2GBGoodMediumGeneral use
medium769M~5GBBetterSlowQuality transcripts
large-v31550M~10GBBestSlowestFinal production

Recommendation: Start with small for speed/quality balance. Use large-v3 for final delivery.

Basic Transcription

Using OpenAI Whisper

bash
# Basic transcription (auto-detect language)
whisper audio.mp3 --model small

# Specify language and output format
whisper audio.mp3 --model medium --language en --output_format srt

# Multiple output formats
whisper audio.mp3 --model small --output_format all

# With timestamps and word-level timing
whisper audio.mp3 --model small --word_timestamps True

Using whisper.cpp

bash
# Download model first
./models/download-ggml-model.sh base.en

# Transcribe
./main -m models/ggml-base.en.bin -f audio.wav -osrt

# With timestamps
./main -m models/ggml-base.en.bin -f audio.wav -ocsv

Output Formats

SRT (SubRip Subtitle)

code
1
00:00:01,000 --> 00:00:04,500
Hello and welcome to this video.

2
00:00:05,000 --> 00:00:08,200
Today we'll discuss video editing.

VTT (WebVTT)

code
WEBVTT

00:00:01.000 --> 00:00:04.500
Hello and welcome to this video.

00:00:05.000 --> 00:00:08.200
Today we'll discuss video editing.

JSON (with word-level timing)

json
{
  "text": "Hello and welcome to this video.",
  "segments": [
    {
      "id": 0,
      "start": 1.0,
      "end": 4.5,
      "text": " Hello and welcome to this video.",
      "words": [
        {"word": "Hello", "start": 1.0, "end": 1.3},
        {"word": "and", "start": 1.4, "end": 1.5},
        {"word": "welcome", "start": 1.6, "end": 2.0},
        {"word": "to", "start": 2.1, "end": 2.2},
        {"word": "this", "start": 2.3, "end": 2.5},
        {"word": "video", "start": 2.6, "end": 3.0}
      ]
    }
  ]
}

Audio Extraction for Transcription

Before transcribing video, extract audio in optimal format:

bash
# Extract audio as WAV (16kHz, mono - optimal for Whisper)
ffmpeg -i video.mp4 -ar 16000 -ac 1 -c:a pcm_s16le audio.wav

# Extract as high-quality WAV for archival
ffmpeg -i video.mp4 -vn -c:a pcm_s16le audio.wav

# Extract as compressed MP3 (smaller, still works)
ffmpeg -i video.mp4 -vn -c:a libmp3lame -q:a 2 audio.mp3

Timing Synchronization

Convert Whisper JSON to FCP Timing

python
import json

def whisper_to_fcp_timing(whisper_json_path, fps=24):
    """Convert Whisper JSON output to FCP-compatible timing."""
    with open(whisper_json_path) as f:
        data = json.load(f)

    segments = []
    for seg in data.get("segments", []):
        segments.append({
            "start_time": seg["start"],
            "end_time": seg["end"],
            "start_frame": int(seg["start"] * fps),
            "end_frame": int(seg["end"] * fps),
            "text": seg["text"].strip(),
            "words": seg.get("words", [])
        })

    return segments

Frame-Accurate Timing

bash
# Get exact frame count and duration
ffprobe -v error -count_frames -select_streams v:0 \
  -show_entries stream=nb_read_frames,duration,r_frame_rate \
  -of json video.mp4

Speaker Diarization

For multi-speaker content, use pyannote.audio:

bash
pip install pyannote.audio
python
from pyannote.audio import Pipeline

pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization@2.1")
diarization = pipeline("audio.wav")

for turn, _, speaker in diarization.itertracks(yield_label=True):
    print(f"{turn.start:.1f}s - {turn.end:.1f}s: {speaker}")

Batch Processing

bash
#!/bin/bash
# Transcribe all videos in directory

MODEL="small"
OUTPUT_DIR="transcripts"
mkdir -p "$OUTPUT_DIR"

for video in *.mp4 *.mov *.avi; do
  [[ -f "$video" ]] || continue

  base="${video%.*}"

  # Extract audio
  ffmpeg -i "$video" -ar 16000 -ac 1 -c:a pcm_s16le "/tmp/${base}.wav" -y

  # Transcribe
  whisper "/tmp/${base}.wav" --model "$MODEL" \
    --output_format all \
    --output_dir "$OUTPUT_DIR"

  # Cleanup temp audio
  rm "/tmp/${base}.wav"

  echo "Transcribed: $video"
done

Quality Optimization

Improve Accuracy

  1. Noise reduction before transcription:
bash
ffmpeg -i noisy_audio.wav -af "highpass=f=200,lowpass=f=3000,afftdn=nf=-25" clean_audio.wav
  1. Use language hint:
bash
whisper audio.mp3 --language en --model medium
  1. Provide initial prompt for context:
bash
whisper audio.mp3 --initial_prompt "Technical discussion about video editing software."

Performance Tips

  1. GPU acceleration (if available):
bash
whisper audio.mp3 --model large-v3 --device cuda
  1. Process in chunks for long videos:
python
# Split audio into 10-minute chunks
# Transcribe each chunk
# Merge results with time offset adjustment

Error Handling

bash
# Validate audio file before transcription
validate_audio() {
  local file="$1"
  if ffprobe -v error -select_streams a:0 -show_entries stream=codec_type -of csv=p=0 "$file" 2>/dev/null | grep -q "audio"; then
    return 0
  else
    echo "Error: No audio stream found in $file"
    return 1
  fi
}

# Check Whisper installation
check_whisper() {
  if command -v whisper &> /dev/null; then
    echo "Whisper available"
    return 0
  else
    echo "Error: Whisper not installed. Run: pip install openai-whisper"
    return 1
  fi
}

Related Skills

  • ffmpeg-core - Audio extraction and preprocessing
  • final-cut-pro - Import transcripts as titles/markers