Multi-Camera Music Video Generator
Generate professional multi-camera music videos from audio input using AI.
Pipeline Overview
Audio Input → [ElevenLabs Transcribe + Gemini Analysis] → Aligned Storyboard → 4K Collage (16:9) → Split 9 Frames → Accurate Audio Chunks → LTX Video → Merge → Burn Titles → Final (16:9)
DEFAULT: Titles/lyrics are burned onto the final video. Skip only if user explicitly says "no titles".
CRITICAL: Timing Alignment
Problem: Gemini's audio timing is often inaccurate. Solution: Use ElevenLabs transcription (word-level timing) as ground truth.
Pipeline:
- •ElevenLabs Transcribe → Get word-level timing (ground truth)
- •Gemini Analysis → Get musical structure, shot suggestions, LYRICS per shot
- •Claude Aligns → Match Gemini's lyrics to ElevenLabs timing
- •Create Accurate Chunks → Based on aligned timing
Alignment Process:
Gemini says: [1:30-1:33] ANGLE_2 - Singer | LYRICS: "וגם אני חולם" ElevenLabs says: "וגם" at 92.5s, "אני" at 93.1s, "חולם" at 93.8s Aligned timing: [1:32.5-1:34.5] (based on actual word positions)
Trust order: ElevenLabs timing > Gemini timing
Default Format: 16:9
CRITICAL: All assets use 16:9 aspect ratio by default:
- •Collage: 16:9 (e.g., 3840x2160 for 4K)
- •Each frame: 16:9 (3x3 grid, no borders!)
- •Final video: 16:9
When generating the collage, ALWAYS use -a 16:9 flag with image-generation skill.
The 3x3 grid of 16:9 frames naturally creates a 16:9 overall collage:
- •3 columns × 16 = 48
- •3 rows × 9 = 27
- •48:27 = 16:9 ✓
Quick Start
Claude orchestrates the entire pipeline. Provide:
- •Audio file (MP3, WAV) OR prompt to generate music
- •Optional: Duration limit (default: finds best vocal section)
- •Optional: "cheap mode" for faster/cheaper generation
Modes
| Mode | Description |
|---|---|
| Default | Best quality - Gemini for images, full LTX quality |
| Cheap | Budget mode - uses --cheap flag in image-generation (fal.ai FLUX klein), lower video quality |
When user asks for "cheap mode" or budget/quick generation:
- •Use
--cheapflag with image-generation skill - •Use
--quality lowwith audio-to-video - •Shorter video segments (max 10 sec per clip)
Project Structure
IMPORTANT: Each project creates a subfolder in the LOCAL project root ./projects/ (NOT inside the skill folder). This keeps all assets easily accessible for review.
projects/
└── rock-video-20260203/
├── audio/
│ ├── original.mp3
│ └── chunks/
│ ├── shot_01.mp3
│ └── ...
├── images/
│ ├── collage.jpg (4K 3x3)
│ └── angles/
│ ├── angle_1.jpg (wide stage)
│ ├── angle_2.jpg (singer closeup)
│ ├── angle_3.jpg (guitar)
│ ├── angle_4.jpg (drums)
│ ├── angle_5.jpg (bass)
│ ├── angle_6.jpg (crowd)
│ ├── angle_7.jpg (silhouette)
│ ├── angle_8.jpg (low angle)
│ └── angle_9.jpg (behind band)
├── videos/
│ ├── clips/
│ │ └── shot_XX.mp4
│ └── final.mp4
└── storyboard.md
Pipeline Steps
1a. Transcribe Audio (ElevenLabs) - GROUND TRUTH TIMING ⚠️ MANDATORY FIRST
CRITICAL: This step MUST be done BEFORE audio chunking. Word-level timing is essential for:
- •Aligning shot boundaries to actual lyrics
- •Generating accurate SRT for titles overlay
- •Matching Gemini's suggested lyrics to real timestamps
Use the global transcribe skill to get word-level timing:
cd skills/transcribe/scripts npx tsx transcribe.ts -i <audio.mp3> -o <project>/subtitles --json
This outputs:
- •
subtitles- JSON with word-level timing (the source of truth) - •
subtitles.srt- SRT file for burning titles
NEVER chunk audio based on Gemini timing alone. Always cross-reference with ElevenLabs word timing.
1b. Audio Analysis (Gemini)
cd .claude/skills/audio-to-video/scripts npx ts-node analyze_audio.ts <audio.mp3> <duration_seconds> <output.md>
Gemini listens deeply and outputs readable markdown:
- •Finds sections with clear vocals/lyrics
- •Identifies instrument highlights (guitar solos, drum fills)
- •Maps what we HEAR to what we SHOW
- •Recommends best segment for partial videos
- •Creates shot list with AUDIO REASON and LYRICS for each cut
IMPORTANT: Gemini must output LYRICS for each shot so Claude can align with ElevenLabs.
1c. Claude Aligns Timing (Two-Step Refinement)
The Workflow:
- •Gemini suggests shots with LYRICS
- •Fuzzy search (threshold 0.8-0.9) in SRT for the sentence/words
- •Find the closest occurrence to Gemini's suggested timestamp
- •Targeted JSON search for precise word timing (don't read full JSON!)
- •Refine Gemini's timing with accurate values
- •Continue with refined shot list
Step 1: Fuzzy Search in SRT (Find Approximate Location)
SRT is compact - use it to find which subtitle entry contains the target lyrics:
# Find the sentence/phrase in SRT grep -n "מחבר וזה טוב" subtitles.srt # Returns: line number and approximate timing # For repeated lyrics (chorus), find ALL occurrences: grep -n "טה טה טה" subtitles.srt | head -5 # Pick the one closest to Gemini's suggested time
Fuzzy matching: If exact phrase not found, search for key words with partial match (80-90% similarity). Songs have repeated lyrics - always pick the closest occurrence to Gemini's time.
Step 2: Precise Timing from JSON (Targeted Search)
Once you know the approximate location, search only for those specific words:
# NEVER read the full JSON - it's too long!
# Search for specific words only
python3 << 'EOF'
import json
with open('subtitles', 'r') as f: # The word-level JSON
data = json.load(f)
# Target words from Gemini's lyrics for this shot
keywords = ['מחבר', 'וזה', 'טוב']
target_time = 7.0 # Gemini's approximate time
matches = []
for w in data['words']:
word = w['word'].strip().replace('.', '').replace(',', '')
if word in keywords:
matches.append((w['start'], w['end'], w['word']))
# Find occurrence closest to target_time
closest = min(matches, key=lambda x: abs(x[0] - target_time))
print(f"Shot starts at: {closest[0]:.3f}s")
EOF
Refinement Rules
- •Gemini timing → approximate guide
- •SRT search → find correct occurrence (especially for repeated lyrics)
- •JSON search → exact millisecond timing
- •Use JSON timing for audio chunk boundaries
Key rule: Show what we hear. Vocals = singer. Guitar solo = guitarist. Drums = drummer.
2. Generate 4K Collage (16:9)
Use image-generation skill with -a 16:9 and detailed 3x3 grid prompt:
# ALWAYS use 16:9 aspect ratio! npx ts-node generate_poster.ts -d collage.jpg -a 16:9 -q 2K \ "A 3x3 grid of 9 camera angles, SEAMLESS with ZERO borders between frames..."
CRITICAL prompt rules:
- •Include "SEAMLESS with ZERO borders between frames"
- •Each frame must show MID-ACTION movement (not static poses)
- •Emphasize "LIVE PERFORMANCE" feel
3. Split Collage
bash scripts/split_collage.sh <collage.jpg> <output_dir>
4. Trim Audio Chunks (Using Aligned Timing)
CRITICAL: Use word-level timing from Step 1a to refine Gemini's suggested boundaries.
Before chunking:
- •Look at Gemini's shot list with LYRICS
- •Find those exact words in ElevenLabs JSON
- •Adjust start/end times to word boundaries
# Use ALIGNED timing, not raw Gemini timing ffmpeg -i audio.mp3 -ss <aligned_start> -t <duration> -y chunk_N.mp3
Example alignment:
- •Gemini says: Shot starts at 0:07 with lyrics "מחבר וזה"
- •ElevenLabs shows: "מחבר" starts at 7.179
- •Use 7.179 as the real start time
5. Generate Video Clips
For each shot, use audio-to-video:
npx ts-node generate.ts --audio chunk.mp3 --image angle_X.jpg -d clip.mp4 "Description"
Limit: LTX max 481 frames (~19 sec at 25fps).
6. Merge Clips (Smooth Audio)
CRITICAL: Don't concatenate audio chunks - use continuous original audio to avoid choppy sound.
# Step 1: Extract continuous audio segment from original
ffmpeg -i original.mp3 -ss 72.5 -t 29.5 -y segment_audio.mp3
# Step 2: Create concat list for videos
cat > concat.txt << EOF
file 'clips/shot_01.mp4'
file 'clips/shot_02.mp4'
...
EOF
# Step 3: Concatenate videos WITHOUT audio
ffmpeg -f concat -safe 0 -i concat.txt -an -c:v copy video_only.mp4
# Step 4: Mux video with continuous audio + FADE OUT (smooth endings)
# IMPORTANT: Always add 2-second fade out for segment videos
DURATION=$(ffprobe -v error -show_entries format=duration -of csv=p=0 video_only.mp4)
FADE_START=$(echo "$DURATION - 2" | bc)
ffmpeg -i video_only.mp4 -i segment_audio.mp3 \
-vf "fade=t=out:st=${FADE_START}:d=2" \
-af "afade=t=out:st=${FADE_START}:d=2" \
-c:v libx264 -c:a aac -shortest final.mp4
FADE OUT is CRITICAL for segment videos - prevents abrupt endings.
This approach ensures:
- •Video clips sync to their individual audio during generation
- •Final merge uses ONE continuous audio track (no seams)
- •No choppy sound from audio chunk boundaries
- •Smooth 2-second fade out at the end
7. Add Lyrics Overlay (DEFAULT) - Use lyrics-overlay Skill
This step is ON by default. Skip only if user explicitly says "no titles".
Style Selection Guide
Choose style based on song genre, mood, and energy:
| Style | Component | Best For | When to Use |
|---|---|---|---|
karaoke | LyricsOverlay | Pop, dance, singalong | Default - energetic, accessible |
minimal | LyricsOverlay | Ballads, acoustic | Clean, don't distract from visuals |
fade | LyricsOverlay | Narration, spoken word | Gentle, smooth |
neon | LyricsOverlayNeon | Electronic, EDM, synthwave | Cyberpunk, futuristic, high-energy |
cinematic | LyricsOverlayCinematic | Epic, rock, trailers | CENTER - dramatic, powerful, movie-like |
bounce | LyricsOverlayBounce | Kids, fun, upbeat | Playful, colorful, joyful |
typewriter | LyricsOverlayTypewriter | Indie, retro, storytelling | Nostalgic, intimate, artistic |
Decision Logic:
IF genre == electronic/EDM → neon ELSE IF genre == rock/epic/powerful → cinematic (CENTER, large text) ELSE IF genre == kids/fun → bounce ELSE IF genre == indie/retro → typewriter ELSE IF energy == low (ballad) → minimal ELSE → karaoke (default)
Quick Usage
- •Copy video + subtitles JSON to Remotion public folder:
cp videos/final.mp4 ~/remotion-assistant/public/videos/<project>.mp4 cp subtitles ~/remotion-assistant/public/lyrics/<project>.json
- •Create temporary composition in Remotion:
// ~/remotion-assistant/src/compositions/TempLyrics.tsx
import { LyricsOverlayCinematic, parseElevenLabsTranscript } from './LyricsOverlayCinematic';
import { staticFile } from 'remotion';
const transcript = require('../../public/lyrics/<project>.json');
export const TempLyrics: React.FC = () => {
const lyrics = parseElevenLabsTranscript(transcript, {
maxWordsPerLine: 5,
lineGapThreshold: 0.6
});
return (
<LyricsOverlayCinematic
videoSrc={staticFile('videos/<project>.mp4')}
lyrics={lyrics}
fontSize={90}
accentColor="#FF0000" // Red for rock
useOffthreadVideo={true}
/>
);
};
- •Register in Root.tsx, render, then cleanup:
cd ~/remotion-assistant npx remotion render TempLyrics out/final_with_lyrics.mp4 # CLEANUP: Remove temp composition after render rm src/compositions/TempLyrics.tsx # Remove import and Composition from Root.tsx (manual or sed)
IMPORTANT: Keep Remotion clean - only template components stay permanently. After each project render, delete the project-specific composition file.
For Segment Videos: Offset Timing
If video is extracted from longer song:
import { shiftLyricsTiming } from '../utils/lyricsParser';
const offsetLyrics = shiftLyricsTiming(lyrics, -105); // Shift back by segment start
See full documentation: .claude/skills/lyrics-overlay/SKILL.md
Storyboard Output Format
Gemini outputs readable markdown (not JSON) with VIDEO PROMPT for each shot:
# MUSIC VIDEO STORYBOARD ## Audio Events Timeline **Vocals:** - 0:32-0:48 - Chorus vocals, strong declaration of freedom **Guitar highlights:** - 1:07-1:24 - Guitar solo ## RECOMMENDED SEGMENT (for partial video) - **Start at:** 0:32 (chorus begins) - **Why:** Clear vocals, high energy ## Shot List Format: [START-END] ANGLE_X - Description | AUDIO REASON | LYRICS: "lyrics" | PROMPT: "video prompt" [0:00-0:04] ANGLE_2 - Singer close-up | Vocals start | LYRICS: "וגם אני" | PROMPT: "LIVE CONCERT: Singer closeup, SINGING with mouth moving, veins in neck, intense emotion" [0:04-0:06] ANGLE_6 - Crowd shot | Energy peak | PROMPT: "LIVE CONCERT: Crowd jumping, hands in air, stage lights pulsing" ...
PROMPT field is CRITICAL - Used directly by audio-to-video for each clip generation.
Limits
- •LTX max frames: 481 (~19 sec at 25fps)
- •9 camera angles from single collage
- •Recommended shot length: 2-5 seconds
Dependencies
- •
@google/genai- Gemini audio analysis - •
fal-ai- LTX video generation - •
ffmpeg- Audio/video processing - •
imagemagick- Image splitting - •Global
image-generationskill - •Project-scoped
audio-to-videoskill