Factorized Video Generation
3-stage pipeline that decomposes text-to-video into Reasoning, Composition, and Temporal Synthesis for dramatically better results.
Pipeline Overview
Prompt --> [Stage 1: Reasoning] --> [Stage 2: Composition] --> [Stage 3: Synthesis] --> Video
LLM rewrite T2I anchor frame Anchor-conditioned I2V
Why factorize? Direct T2V models fail at spatial composition 40-50% of the time. By generating a high-quality anchor frame first, then animating it, quality improves 41-53% on compositional benchmarks (arxiv:2512.16371).
Quick Start
python3 ~/.claude/skills/factorized-video-gen/scripts/factorized_generate.py \ --prompt "A cat jumping from a bookshelf to the couch" \ --style photorealistic \ --output ./outputs/cat-jump.mp4
Or run stages individually (see Workflow below).
Requirements
- •AI_GATEWAY_API_KEY environment variable
- •generate-image skill at
~/.claude/skills/generate-image/ - •generate-video skill at
~/.claude/skills/generate-video/ - •Python 3, Node.js
Check dependencies:
bash ~/.claude/skills/factorized-video-gen/scripts/setup.sh
Workflow
Stage 1: Reasoning (LLM Prompt Rewriting)
Rewrites a video prompt into a first-frame caption describing only the initial scene state. Removes motion/temporal elements, focuses on spatial layout, lighting, and composition.
python3 ~/.claude/skills/factorized-video-gen/scripts/rewrite_prompt.py \ "A cat jumping from a bookshelf to the couch" \ --style photorealistic --aspect-ratio 16:9 \ --output first_frame.txt
Options: --style (photorealistic, cinematic, anime, illustration, 3d-render, oil-painting, watercolor), --aspect-ratio (16:9, 9:16, 1:1), --model
Stage 2: Composition (Anchor Frame Generation)
Generates a high-quality image from the first-frame caption.
python3 ~/.claude/skills/factorized-video-gen/scripts/generate_anchor.py \ "A tabby cat perched on a tall wooden bookshelf, muscles tensed..." \ --model google/gemini-3-pro-image-preview \ --style photorealistic --enhance \ --output anchor.png
Options: --model, --aspect-ratio, --style, --enhance (appends photography terms), --input-file
Stage 3: Temporal Synthesis (Video Generation)
Generates video conditioned on the anchor frame + original prompt.
python3 ~/.claude/skills/factorized-video-gen/scripts/generate_video_from_anchor.py \ --anchor anchor.png \ --prompt "A cat jumping from a bookshelf to the couch" \ --model google/veo-3.1-generate-preview \ --duration 6 --aspect-ratio 16:9 \ --output video.mp4
Options: --prompt-strategy (combined, original, motion-only), --duration, --timeout
Orchestrator Flags
The orchestrator supports skipping stages:
# Skip reasoning, use prompt as-is for image generation python3 factorized_generate.py --prompt "..." --skip-reasoning # Skip anchor generation, use existing image python3 factorized_generate.py --prompt "..." --skip-anchor --anchor-path my-image.png # Save intermediate files for inspection python3 factorized_generate.py --prompt "..." --save-intermediate --work-dir ./debug/
Model Selection
| Stage | Model | Best For |
|---|---|---|
| Stage 2 (T2I) | google/gemini-3-pro-image-preview | Highest quality (default) |
| Stage 2 (T2I) | google/gemini-2.5-flash-image | Faster generation |
| Stage 2 (T2I) | byteplus/seedream-4-5 | Artistic styles |
| Stage 3 (Video) | google/veo-3.1-generate-preview | Best quality (default) |
| Stage 3 (Video) | google/veo-3.1-fast-generate-preview | Faster video |
| Stage 3 (Video) | openai/sora-2-pro | Cinematic quality |
For detailed model configs and benchmarks, see references/model-configs.md.
Prompt Engineering
For best results:
- •Write video prompts describing actions and motion (the pipeline handles decomposition)
- •Use
--styleto set visual tone consistently across stages - •The
combinedprompt strategy (default) works best for most cases - •See
references/prompt-templates.mdfor the system prompt, examples, and tips
When to Use Factorized vs Direct T2V
| Scenario | Recommendation |
|---|---|
| Complex scenes with multiple objects | Factorized (41% better composition) |
| Specific spatial arrangements needed | Factorized (66% better controllability) |
| Speed is priority over quality | Direct T2V via generate-video skill |
| Simple single-subject motion | Either works, factorized slightly better |
| User provides their own reference image | Use --skip-anchor --anchor-path |