AgentSkillsCN

audio-to-video

利用fal.ai LTX-2将音频与图像合成视频。适用于:演讲头像、唇形同步、音频驱动的视频制作。

SKILL.md
--- frontmatter
name: audio-to-video
description: "Generate video from audio + image using fal.ai LTX-2. Use for: talking head, lip sync, audio-driven video."
allowed-tools: Bash, Read, Write

Audio to Video

Generate video from audio + optional image using fal.ai LTX-2 19B.

Usage

bash
cd .claude/skills/audio-to-video/scripts
npx ts-node generate.ts \
  --audio "/path/to/speech.mp3" \
  --image "/path/to/face.png" \
  -d /tmp/output.mp4 \
  "A woman speaks to camera, natural lighting"

Required Flags

FlagDescription
--audio, -aAudio file (mp3, wav, ogg, m4a, aac)
-d, --destinationOutput video path

Optional Flags

FlagDefaultDescription
--image, -i-Starting frame image
--end-image-Ending frame image
--size, -slandscape_16_9Video size
--fps25Frames per second
--qualityhighlow, medium, high, maximum
--cameranonedolly_in, dolly_out, jib_up, jib_down, static
--no-match-length-Don't auto-match video to audio duration

Video Sizes

landscape_16_9, landscape_4_3, portrait_16_9, portrait_4_3, square_hd, square, auto

Limits

  • Max frames: 481 frames
  • Max duration at 25fps: ~19 seconds
  • Max duration at 24fps: ~20 seconds

For longer videos, generate multiple clips and concatenate with ffmpeg.

Pricing

~$0.001/megapixel. Example: 1280x720x121 frames = ~$0.11

API Key

Uses FAL_KEY from ~/.claude/skills/image-generation/scripts/.env