AgentSkillsCN

spatial-intelligence-world-modeling

构建一套框架,用于设计能够推理物理3D世界的AI系统。当您希望超越基于文本的LLM,转而面向机器人技术、沉浸式3D环境、虚拟制作,或物理世界模拟时,可使用此框架。

SKILL.md
--- frontmatter
name: spatial-intelligence-world-modeling
description: A framework for designing AI systems that reason about the physical 3D world. Use this when moving beyond text-based LLMs to build for robotics, immersive 3D environments, virtual production, or physical-world simulations.

Spatial Intelligence and World Modeling

Spatial intelligence is the ability to create, reason, and interact within 3D and 4D worlds. While Large Language Models (LLMs) focus on conversational intelligence, World Models provide the foundation for "Embodied AI"—machines that can understand spatial relationships, plan movements, and interact with objects in a physical environment.

The Core Ingredients of a World Model

To build or leverage a world model effectively, you must align three foundational pillars:

  1. 3D Geometry (Structure): Unlike video models that generate flat 2D pixels, world models generate 3D structures (meshes or point clouds) that can be navigated from any angle.
  2. Interaction (Agency): The model must allow for "touch" or "navigation." Users or agents (robots) should be able to manipulate objects or move through the scene with consistent physics.
  3. Simulation to Reality: Use the model to create "synthetic data" for training physical systems where real-world data is too expensive or dangerous to collect.

Workflow: Applying Spatial Intelligence to Product Design

1. Identify the Dimensionality

Determine if your problem is a "Cave Allegory" problem (passive 2D observation) or a "Spatial" problem (active 3D reasoning).

  • Language/2D: Summarizing text, generating a single image, or identifying objects in a photo.
  • Spatial: Planning a robot's path through a kitchen, designing a movie set with camera trajectories, or creating a virtual therapy environment.

2. Define the Actionable Output

Determine what the "agent" needs to do within the world.

  • Navigation: Does it need to move from Point A to Point B without hitting obstacles?
  • Manipulation: Does it need to pick up, rotate, or change an object?
  • Creation: Does it need to generate a new, structurally sound environment from a text prompt (Prompt-to-World)?

3. Gap Analysis: Brain vs. Body

Recognize that for physical applications (like robotics), the "Brain" (the world model) is only half the battle.

  • Brain: Reasoning, spatial awareness, and planning.
  • Body: Physical hardware, supply chains, and sensors.
  • The Bridge: Use the world model to simulate thousands of scenarios to train the "Brain" before deploying it into the "Body."

Implementation Examples

Example 1: Virtual Production for VFX

  • Context: A director needs to shoot a scene in a specific, non-existent environment (e.g., a "dystopian Shire").
  • Input: Text prompt and reference images.
  • Application: Generate a 3D world model rather than a 2D video. Export the 3D mesh into a game engine.
  • Output: A navigable set where the director can move a virtual camera in 3D space to find the perfect angle, reducing production time by up to 40x.

Example 2: Robotics Simulation for Home Cleaning

  • Context: A robotics company needs to train a robot to tidy a messy room, but lacks thousands of real-world "messy room" videos with labeled actions.
  • Input: Parameters for "messy," "clean," and specific objects (e.g., toys, books).
  • Application: Use a world model to generate infinite synthetic variations of messy rooms.
  • Output: A training environment where the robot can practice millions of grasp and move attempts safely in simulation before moving to a physical prototype.

Common Pitfalls

  • Confusing Video Generation with World Modeling: Generating a 2D video of a cat is not the same as understanding the cat's 3D volume. If you cannot "walk around" the object or predict its hidden side, it is not a world model.
  • Ignoring the "Bitter Lesson": Attempting to build complex, hand-coded rules for spatial reasoning instead of using large-scale data and compute. Simple models with massive data usually win.
  • The Objective Function Mismatch: In language, tokens are the goal. In spatial intelligence, action is the goal. Do not assume a model that "talks" well about a room can actually "navigate" the room.
  • Removing the Human: Designing AI as a replacement rather than an augmentation. Always include "human agency" hooks (e.g., allowing a designer to tweak the 3D output manually) to ensure the result has dignity and utility.