Talking-Head Video Processing
Process direct-to-camera videos with sentence-level precision, rolling captions, and multi-format output.
Full Pipeline Overview
Raw Video (47+ min)
↓ smart_chunk.py
Chunks (2.5-3.5 min)
↓ mlx_transcribe.py --word-timestamps
Transcript (word-level timestamps)
↓ sentence_split.py (precise cuts, no overlap)
Sentence Clips (~10s each, frame-accurate)
↓ analyze_script.py (RECALL-FIRST)
Topics with complete arcs (hook → elaborate → conclude)
↓ [Claude Code: Select topic]
↓ stitch_clips.py (precise cuts)
Raw Topic Video (~60-180s)
↓ precision_trim.py (remove fillers, stutters)
Polished Video (~30-120s)
↓ generate_captions.py (timestamp remapping)
Captions + Section Markers
↓ render_with_captions.py (Remotion)
Final Output: 16:9 + 9:16 with rolling captions
Key Insight: Recall-First Topic Analysis
Old approach (precision-first): Find 5-10 "highlight moments" → Often incomplete thoughts
New approach (recall-first): Find complete TOPICS with full arcs → Then trim for precision
For each topic, Gemini identifies:
- •Full clip range: ALL clips that might be relevant (optimize for recall)
- •Arc structure: Hook → Elaboration → Conclusion
- •Trimming guide: Essential vs supporting vs skippable clips
- •Reorder suggestions: If moving clips would improve flow
Why this works:
- •Complete narratives: Each topic has setup, body, and resolution
- •No missing context: Better to include extra and trim than miss key content
- •Flexible editing: Full version for review, trimmed version for final cut
Quick Start
# Full pipeline (pauses for clip selection) python skills/talking-head/process_video.py raw_footage.mp4 -o ./output/ # Auto-stitch top clips python skills/talking-head/process_video.py raw_footage.mp4 -o ./output/ --auto-stitch
Claude Code Integration
from process_video import run_pipeline, stitch_selected_clips
# Run pipeline (stops at clip review)
state = run_pipeline("raw_footage.mp4", "./output/")
# Present review to user
if state.get("awaiting_user_input"):
print(state["clip_review"]["display"])
# After user says "Use clips 7, 12, 3"
result = stitch_selected_clips("./output/", [7, 12, 3])
Skills Reference
sentence_split.py
Split at sentence boundaries, create sentence index.
python sentence_split.py video.mp4 transcript.json -o clips/
Output: clip_index.json with sentence-to-clip mapping.
analyze_script.py
Text-first highlight analysis.
python analyze_script.py clips/clip_index.json -o clip_scores.json
Gemini reads full transcript, returns highlights with:
- •
sentence_range: Which sentences (e.g., "15-18") - •
clip_ids: Mapped clip IDs - •
viral_score,hook_potential,standalone
lowres_convert.py (Optional)
Only needed if you want video verification of selected clips.
python lowres_convert.py clips/ -o lowres/ --resolution 480p
batch_analyze.py (Optional)
Video-based verification of specific clips. Use after script analysis to double-check selected clips visually.
python batch_analyze.py lowres/ -o verification.json --clips 7,12,3
stitch_clips.py
Concatenate approved clips.
python stitch_clips.py clip_scores.json --approved 7,12,3 -o final.mp4
Output: clip_scores.json
{
"analysis_type": "script_first",
"clips": [
{
"clip_id": 7,
"viral_score": 9,
"hook_potential": 10,
"key_quote": "I've raised tens of millions...",
"recommended_use": "opening"
}
],
"highlights": [
{
"sentence_range": "15-18",
"clip_ids": [7],
"quote": "I've raised tens of millions...",
"type": "hook",
"viral_score": 9,
"why": "Strong credibility opener"
}
]
}
Sentence Index
The key data structure that enables precise mapping:
{
"sentences": [
{"id": 1, "text": "First sentence.", "clip_id": 1},
{"id": 2, "text": "Second sentence.", "clip_id": 1},
{"id": 3, "text": "Third sentence.", "clip_id": 2}
],
"sentence_to_clip": {1: 1, 2: 1, 3: 2},
"clip_to_sentences": {1: [1, 2], 2: [3]}
}
When Gemini says "sentences 15-18 are great", we look up sentence_to_clip to get the exact clip IDs.
Precise Cutting (Zero Overlap)
By default, all clips are cut with frame-accurate precision using FFmpeg re-encoding and zero padding. This ensures:
- •Zero overlapping frames between clips
- •No repeated words at clip boundaries
- •Clean audio cuts (no pops or artifacts)
# Default: precise cut (re-encode) with zero padding python sentence_split.py video.mp4 transcript.json -o clips/ # Fast mode: stream copy (may have keyframe boundary issues) python sentence_split.py video.mp4 transcript.json -o clips/ --fast # Add padding only if you want overlap for crossfades python sentence_split.py video.mp4 transcript.json -o clips/ --padding 100
Why zero padding by default?
- •Qwen3-ForcedAligner provides precise word timestamps (~30ms accuracy)
- •Natural speech has gaps between sentences (pauses)
- •Adding padding (e.g., 100ms) causes clips to overlap:
- •Clip N extracts to
end + 100ms - •Clip N+1 extracts from
start - 100ms - •If gap < 200ms → overlap and repeated words!
- •Clip N extracts to
- •Zero padding = clips fit together perfectly with natural pauses
Why re-encode (not stream copy)?
- •Stream copy (
-c copy) cuts on keyframes, not exact timestamps - •This causes extra frames at start (previous keyframe) or missing frames at end
- •Re-encoding with
-crf 18maintains quality while ensuring exact cuts
Key Learnings
Why text-first? Earlier versions uploaded 60s+ video clips to Gemini:
- •Slow: ~30s upload per 20MB clip
- •Expensive: Video processing tokens
- •Limited context: Gemini only saw chunks, not full narrative
Why sentence index? Clips are cut at pauses, roughly 1:1 with sentences. The index handles edge cases where sentences span clips.
Why precise cuts? Stream copy creates overlapping frames:
- •FFmpeg seeks to nearest keyframe (not exact timestamp)
- •Adjacent clips may share the same frames at boundaries
- •Re-encoding fixes this by decoding and re-encoding at exact timestamps
When to use video analysis? Only for final verification of selected clips if you want to check:
- •Speaker visibility
- •Audio quality
- •Visual issues
Phase 2: Precision Trimming
After selecting a topic, remove filler words, stutters, and repetitions.
precision_trim.py
Two-stage approach: ASR for precise timestamps, Gemini for semantic decisions.
# Full pipeline python precision_trim.py run topic_video.mp4 -o output/ # Step by step python precision_trim.py transcribe topic_video.mp4 -o word_transcript.json python precision_trim.py identify topic_video.mp4 word_transcript.json -o cuts.json python precision_trim.py apply topic_video.mp4 cuts.json word_transcript.json -o trimmed.mp4
Key insight: Gemini identifies WHAT to cut (by word index), ASR provides WHERE to cut (~30ms precision).
Micro-segment fix: min_keep_duration=1.0s prevents jittery playback from tiny segments.
Phase 3: Rolling Captions + Multi-Format
Add karaoke-style captions and render both horizontal and vertical formats.
generate_captions.py
Maps word timestamps from original video to trimmed video, groups into phrases.
python generate_captions.py word_transcript.json kept_segments.json -o output/
Timestamp remapping logic:
# For each word in original video: adjusted_time = word_start - segment_start + cumulative_offset
Phrase grouping by aspect ratio:
| Aspect | Max Words | Max Chars | Max Duration |
|---|---|---|---|
| Horizontal (16:9) | 8 | 50 | 3.5s |
| Vertical (9:16) | 5 | 30 | 2.5s |
Output:
- •
captions_horizontal.json: Phrases for 16:9 - •
captions_vertical.json: Phrases for 9:16 (shorter)
generate_sections.py
Uses Gemini to create meaningful section titles from the transcript.
python generate_sections.py word_transcript.json kept_segments.json -o sections.json -n 4
Output example:
{
"sections": [
{"title": "VC HYPOCRISY EXPOSED", "startMs": 0, "durationMs": 3000},
{"title": "READING VC SIGNALS", "startMs": 47500, "durationMs": 3000}
]
}
render_with_captions.py
Renders both 16:9 and 9:16 formats using Remotion with cinematic overlays.
python render_with_captions.py trimmed.mp4 captions_dir/ \ --sections sections.json \ --speaker "Speaker Name" \ --speaker-title "Title/Role" \ --speaker-center 0.5 \ -o output/
Output:
| Format | Resolution | Use Case |
|---|---|---|
*_captioned.mp4 | 1920x1080 | YouTube, LinkedIn |
*_vertical.mp4 | 1080x1920 | TikTok, Reels, Shorts |
Vertical crop: Centers on speaker position (--speaker-center 0.5 = center).
Remotion Components
The pipeline uses these cinematic Remotion components:
| Component | Style |
|---|---|
TalkingHeadClip.tsx | Main composition with all overlays |
RollingCaption.tsx | Karaoke captions with gold word highlighting |
SectionTitle.tsx | Cinematic gradient text + animated lines |
SpeakerLabel.tsx | Gradient border card with slide-in animation |
Caption styling:
- •Background: Semi-transparent black with rounded corners
- •Text: White, gold highlight on current word
- •Word spacing:
marginRight: 0.3em(CSS inline-block fix)
Section title styling:
- •Background dim overlay
- •Gradient text (white → gold)
- •Animated top/bottom lines
- •3-second duration with bounce animation
Speaker label styling:
- •Gradient border (gold → orange)
- •Large name (52px) with slide-in
- •Gradient title text
- •8-second display at video start
Complete Example
# 1. Chunk and transcribe python skills/chunk-process/smart_chunk.py raw.mp4 -o chunks/ python skills/chunk-process/mlx_transcribe.py chunks/ --batch --word-timestamps # 2. Split into sentences and analyze python skills/talking-head/sentence_split.py raw.mp4 chunks/transcript.json -o clips/ python skills/talking-head/analyze_script.py clips/clip_index.json -o topics.json # 3. Select topic and stitch (Claude Code reviews topics, user picks one) python skills/talking-head/stitch_clips.py topics.json --topic 10 -o topic10_raw.mp4 # 4. Precision trim (removes fillers, stutters) python skills/talking-head/precision_trim.py run topic10_raw.mp4 -o trimmed/ # 5. Generate captions (aspect-ratio aware) python skills/talking-head/generate_captions.py \ trimmed/word_transcript.json trimmed/kept_segments.json \ --speaker "Xiaoyin Qu" --speaker-title "Founder of heyboss.ai" \ -o trimmed/ # 6. Generate section titles (Gemini) python skills/talking-head/generate_sections.py \ trimmed/word_transcript.json trimmed/kept_segments.json \ -o trimmed/sections.json -n 4 # 7. Render both formats python skills/talking-head/render_with_captions.py \ trimmed/trimmed.mp4 trimmed/ \ --sections trimmed/sections.json \ --speaker "Xiaoyin Qu" --speaker-title "Founder of heyboss.ai" \ -o final/
Result:
- •
final/trimmed_captioned.mp4(16:9, ~140MB) - •
final/trimmed_vertical.mp4(9:16, ~178MB)
Both with rolling captions, speaker label, and cinematic section titles!