Factorized Video Generation

3-stage pipeline that decomposes text-to-video into Reasoning, Composition, and Temporal Synthesis for dramatically better results.

Pipeline Overview

code

Prompt --> [Stage 1: Reasoning] --> [Stage 2: Composition] --> [Stage 3: Synthesis] --> Video
              LLM rewrite             T2I anchor frame         Anchor-conditioned I2V

Why factorize? Direct T2V models fail at spatial composition 40-50% of the time. By generating a high-quality anchor frame first, then animating it, quality improves 41-53% on compositional benchmarks (arxiv:2512.16371).

Quick Start

bash

python3 ~/.claude/skills/factorized-video-gen/scripts/factorized_generate.py \
  --prompt "A cat jumping from a bookshelf to the couch" \
  --style photorealistic \
  --output ./outputs/cat-jump.mp4

Or run stages individually (see Workflow below).

Requirements

•AI_GATEWAY_API_KEY environment variable
•generate-image skill at ~/.claude/skills/generate-image/
•generate-video skill at ~/.claude/skills/generate-video/
•Python 3, Node.js

Check dependencies:

bash

bash ~/.claude/skills/factorized-video-gen/scripts/setup.sh

Workflow

Stage 1: Reasoning (LLM Prompt Rewriting)

Rewrites a video prompt into a first-frame caption describing only the initial scene state. Removes motion/temporal elements, focuses on spatial layout, lighting, and composition.

bash

python3 ~/.claude/skills/factorized-video-gen/scripts/rewrite_prompt.py \
  "A cat jumping from a bookshelf to the couch" \
  --style photorealistic --aspect-ratio 16:9 \
  --output first_frame.txt

Options: --style (photorealistic, cinematic, anime, illustration, 3d-render, oil-painting, watercolor), --aspect-ratio (16:9, 9:16, 1:1), --model

Stage 2: Composition (Anchor Frame Generation)

Generates a high-quality image from the first-frame caption.

bash

python3 ~/.claude/skills/factorized-video-gen/scripts/generate_anchor.py \
  "A tabby cat perched on a tall wooden bookshelf, muscles tensed..." \
  --model google/gemini-3-pro-image-preview \
  --style photorealistic --enhance \
  --output anchor.png

Options: --model, --aspect-ratio, --style, --enhance (appends photography terms), --input-file

Stage 3: Temporal Synthesis (Video Generation)

Generates video conditioned on the anchor frame + original prompt.

bash

python3 ~/.claude/skills/factorized-video-gen/scripts/generate_video_from_anchor.py \
  --anchor anchor.png \
  --prompt "A cat jumping from a bookshelf to the couch" \
  --model google/veo-3.1-generate-preview \
  --duration 6 --aspect-ratio 16:9 \
  --output video.mp4

Options: --prompt-strategy (combined, original, motion-only), --duration, --timeout

Orchestrator Flags

The orchestrator supports skipping stages:

bash

# Skip reasoning, use prompt as-is for image generation
python3 factorized_generate.py --prompt "..." --skip-reasoning

# Skip anchor generation, use existing image
python3 factorized_generate.py --prompt "..." --skip-anchor --anchor-path my-image.png

# Save intermediate files for inspection
python3 factorized_generate.py --prompt "..." --save-intermediate --work-dir ./debug/

Model Selection

Stage	Model	Best For
Stage 2 (T2I)	`google/gemini-3-pro-image-preview`	Highest quality (default)
Stage 2 (T2I)	`google/gemini-2.5-flash-image`	Faster generation
Stage 2 (T2I)	`byteplus/seedream-4-5`	Artistic styles
Stage 3 (Video)	`google/veo-3.1-generate-preview`	Best quality (default)
Stage 3 (Video)	`google/veo-3.1-fast-generate-preview`	Faster video
Stage 3 (Video)	`openai/sora-2-pro`	Cinematic quality

For detailed model configs and benchmarks, see references/model-configs.md.

Prompt Engineering

For best results:

•Write video prompts describing actions and motion (the pipeline handles decomposition)
•Use --style to set visual tone consistently across stages
•The combined prompt strategy (default) works best for most cases
•See references/prompt-templates.md for the system prompt, examples, and tips

When to Use Factorized vs Direct T2V

Scenario	Recommendation
Complex scenes with multiple objects	Factorized (41% better composition)
Specific spatial arrangements needed	Factorized (66% better controllability)
Speed is priority over quality	Direct T2V via generate-video skill
Simple single-subject motion	Either works, factorized slightly better
User provides their own reference image	Use `--skip-anchor --anchor-path`