Caption Alignment Skill

Name: Caption Alignment
Rating: 92
Author: Nuva-Lab

Align text to audio timestamps for karaoke-style rolling captions.

Usage

python

from skills.align_captions import CaptionAligner

aligner = CaptionAligner()

# Single audio file
captions = await aligner.align_audio(
    audio_path=Path("speech.wav"),
    text="Hello world!",
    language="English",
)
# Returns list of AlignedCaption with word-level timing

# Multi-file dialogue
captions = await aligner.align_dialogue_lines(
    audio_paths=[Path("line1.wav"), Path("line2.wav")],
    texts=["Hello!", "Hi there!"],
)

Output Format

python

@dataclass
class WordSegment:
    text: str
    start_ms: int
    end_ms: int

@dataclass
class AlignedCaption:
    text: str        # Phrase text
    start_ms: int    # Phrase start
    end_ms: int      # Phrase end
    words: list[WordSegment]  # Word-level timing for karaoke

Dependencies

•qwen-asr (Qwen3-ForcedAligner)
•jieba (for Chinese word segmentation)
•torch

Install

bash

pip install qwen-asr jieba torch