AgentSkillsCN

factorized-video-gen

基于 arxiv:2512.16371 的因子化视频生成流水线。该方案将文本转视频的过程分解为三个阶段:(1) 使用大语言模型进行推理,将原始提示改写为第一帧的字幕;(2) 通过 T2I 模型生成高质量的锚定帧;(3) 依托锚定帧进行时序合成,完成视频的生成。相比直接的 T2V 方法,该方案在视频质量上可提升 41%–53%。当用户希望获得更高质量的人工智能视频生成效果、期待生成更具构图美感与空间精准度的视频、提出因子化或锚定式视频生成的需求、希望对生成视频的初始帧进行精细控制,或明确提及“因子化视频”“锚定帧视频”“更优的视频生成”等关键词时,此方案尤为适用。

SKILL.md
--- frontmatter
name: factorized-video-gen
description: "Factorized video generation pipeline based on arxiv:2512.16371. Decomposes text-to-video into three stages: (1) LLM reasoning to rewrite prompts into first-frame captions, (2) T2I composition to generate a high-quality anchor frame, (3) anchor-conditioned temporal synthesis for video generation. Achieves 41-53% quality improvement over direct T2V. Use when the user wants higher-quality AI video generation, wants to generate a video with better composition and spatial accuracy, asks for factorized or anchor-based video generation, wants to control the initial frame of a generated video, or mentions 'factorized video', 'anchor frame video', or 'better video generation'."
allowed-tools: Bash, Read, Write, Task, Glob, Grep, WebFetch

Factorized Video Generation

3-stage pipeline that decomposes text-to-video into Reasoning, Composition, and Temporal Synthesis for dramatically better results.

Pipeline Overview

code
Prompt --> [Stage 1: Reasoning] --> [Stage 2: Composition] --> [Stage 3: Synthesis] --> Video
              LLM rewrite             T2I anchor frame         Anchor-conditioned I2V

Why factorize? Direct T2V models fail at spatial composition 40-50% of the time. By generating a high-quality anchor frame first, then animating it, quality improves 41-53% on compositional benchmarks (arxiv:2512.16371).

Quick Start

bash
python3 ~/.claude/skills/factorized-video-gen/scripts/factorized_generate.py \
  --prompt "A cat jumping from a bookshelf to the couch" \
  --style photorealistic \
  --output ./outputs/cat-jump.mp4

Or run stages individually (see Workflow below).

Requirements

  • AI_GATEWAY_API_KEY environment variable
  • generate-image skill at ~/.claude/skills/generate-image/
  • generate-video skill at ~/.claude/skills/generate-video/
  • Python 3, Node.js

Check dependencies:

bash
bash ~/.claude/skills/factorized-video-gen/scripts/setup.sh

Workflow

Stage 1: Reasoning (LLM Prompt Rewriting)

Rewrites a video prompt into a first-frame caption describing only the initial scene state. Removes motion/temporal elements, focuses on spatial layout, lighting, and composition.

bash
python3 ~/.claude/skills/factorized-video-gen/scripts/rewrite_prompt.py \
  "A cat jumping from a bookshelf to the couch" \
  --style photorealistic --aspect-ratio 16:9 \
  --output first_frame.txt

Options: --style (photorealistic, cinematic, anime, illustration, 3d-render, oil-painting, watercolor), --aspect-ratio (16:9, 9:16, 1:1), --model

Stage 2: Composition (Anchor Frame Generation)

Generates a high-quality image from the first-frame caption.

bash
python3 ~/.claude/skills/factorized-video-gen/scripts/generate_anchor.py \
  "A tabby cat perched on a tall wooden bookshelf, muscles tensed..." \
  --model google/gemini-3-pro-image-preview \
  --style photorealistic --enhance \
  --output anchor.png

Options: --model, --aspect-ratio, --style, --enhance (appends photography terms), --input-file

Stage 3: Temporal Synthesis (Video Generation)

Generates video conditioned on the anchor frame + original prompt.

bash
python3 ~/.claude/skills/factorized-video-gen/scripts/generate_video_from_anchor.py \
  --anchor anchor.png \
  --prompt "A cat jumping from a bookshelf to the couch" \
  --model google/veo-3.1-generate-preview \
  --duration 6 --aspect-ratio 16:9 \
  --output video.mp4

Options: --prompt-strategy (combined, original, motion-only), --duration, --timeout

Orchestrator Flags

The orchestrator supports skipping stages:

bash
# Skip reasoning, use prompt as-is for image generation
python3 factorized_generate.py --prompt "..." --skip-reasoning

# Skip anchor generation, use existing image
python3 factorized_generate.py --prompt "..." --skip-anchor --anchor-path my-image.png

# Save intermediate files for inspection
python3 factorized_generate.py --prompt "..." --save-intermediate --work-dir ./debug/

Model Selection

StageModelBest For
Stage 2 (T2I)google/gemini-3-pro-image-previewHighest quality (default)
Stage 2 (T2I)google/gemini-2.5-flash-imageFaster generation
Stage 2 (T2I)byteplus/seedream-4-5Artistic styles
Stage 3 (Video)google/veo-3.1-generate-previewBest quality (default)
Stage 3 (Video)google/veo-3.1-fast-generate-previewFaster video
Stage 3 (Video)openai/sora-2-proCinematic quality

For detailed model configs and benchmarks, see references/model-configs.md.

Prompt Engineering

For best results:

  • Write video prompts describing actions and motion (the pipeline handles decomposition)
  • Use --style to set visual tone consistently across stages
  • The combined prompt strategy (default) works best for most cases
  • See references/prompt-templates.md for the system prompt, examples, and tips

When to Use Factorized vs Direct T2V

ScenarioRecommendation
Complex scenes with multiple objectsFactorized (41% better composition)
Specific spatial arrangements neededFactorized (66% better controllability)
Speed is priority over qualityDirect T2V via generate-video skill
Simple single-subject motionEither works, factorized slightly better
User provides their own reference imageUse --skip-anchor --anchor-path