Token-Efficient Delegation

⛔ CRITICAL: No Claude Subagents

HARD STOP: Never use model="haiku", model="sonnet", or model="opus" in Task tool calls.

•Always use: model parameter omitted (routes to MiniMax/GLM)
•Or explicitly: general-purpose agent (uses cheaper default)
•Enforced in: .claude/settings.local.json deny rules

Core Rule: Claude = reasoning/decisions, MiniMax/GLM = research/exploration

The Three Goals (Balance All Three)

Goal	Priority	Strategy
Quality	Keep Opus-level output	Claude synthesizes, subagents gather
Cost	Minimize Claude tokens	Delegate exploration, batch analysis
Speed	Fast execution	Parallel subagents, right model for task

Quick Decision Tree

Before ANY task, ask:

•Research/exploration? → Delegate to MiniMax (fast) or GLM (creative)
•Web search needed? → Delegate to MiniMax
•Reading specific known file? → Claude direct (Read tool)
•Analyzing 1-3 images? → Claude direct
•Analyzing 10+ images? → Parallel subagents (GLM for quality, MiniMax for speed)
•Complex reasoning/code? → Claude direct
•Broad codebase search? → Delegate to general-purpose agent (cheaper default)

Model Characteristics (Official Guidance)

Model	Cost	Speed	Best For	Source
Claude Opus	50x	Fast	Complex reasoning, architecture, final decisions	Anthropic Docs
Claude Sonnet	10x	Fast	Coding, medium reasoning, code review	Anthropic Docs
Claude Haiku	2x	Fastest	Exploration, classification, simple tasks	Anthropic Docs
MiniMax M2.1	1x	Fast	Instruction-following, agents, multi-lang coding	MiniMax Docs
GLM-4.7	1x	Medium	Creative problem-solving, math, tool use	Z.AI Docs
GLM-4V	1x	Medium	Vision, 66K multimodal context, GUI tasks	Z.AI Docs

When to Use Each Model

MiniMax M2.1 - Fast Instruction-Follower

Source: MiniMax Official Docs

•Quick web searches
•Structured data extraction
•File reading with specific questions
•Tasks with clear instructions
•8% of Claude cost, 2x speed
•Preserve reasoning chains (return full response with thinking field)

GLM-4.7 - Creative Problem-Solver

Source: Z.AI GLM-4.7 Docs

•Complex problem exploration
•Mathematical reasoning (95.7% on AIME 2025)
•Creative solutions needed
•Tool use orchestration (42.8%, ties GPT-5.1)
•Parallel batch tasks (latency doesn't matter)
•Use thinking mode for complex tasks, disable for simple queries

GLM-4V - Vision Tasks

Source: GLM-V GitHub

•66K-token multimodal context
•Screenshot/GUI analysis
•Document parsing (PDFs, charts)
•Video understanding with timestamps
•Multi-image analysis

Claude (Haiku/Sonnet/Opus) - Reasoning & Decisions

Source: Anthropic Subagents Docs

•Haiku: Fast exploration, read-only tasks, classification
•Sonnet: Coding, medium-complexity reasoning
•Opus: Architecture decisions, synthesis, complex reasoning
•Use subagents to isolate verbose output from main context

Quality Preservation Patterns

Pattern 1: Orchestrator-Worker Split

code

Claude (Orchestrator) → Spawns subagents → Subagents research → Claude synthesizes

Quality preserved because: Claude makes all decisions based on subagent research.

Pattern 2: Parallel Research + Single Synthesis

code

5 MiniMax agents research in parallel → Claude receives summaries → Claude decides

Quality preserved because: Multiple perspectives, Claude does final reasoning.

Pattern 3: Two-Stage Review

code

Subagent implements → Reviewer subagent checks → Claude integrates

Quality preserved because: Cross-validation catches errors before Claude acts.

Pattern 4: Specific Prompts

Source: MiniMax Best Practices

"State the 'why' behind your request - when models understand purpose, they provide more accurate answers."

code

BAD:  "Research authentication"
GOOD: "Find how Godot 4.5 handles input authentication. I need to implement player login. Return: method name, file location, example usage."

Speed Optimization Patterns

Pattern 1: Parallel Execution

Launch independent tasks simultaneously:

code

Task(prompt="Research X")  ←─┐
Task(prompt="Research Y")  ←─┼─ Single message, parallel execution
Task(prompt="Research Z")  ←─┘

Pattern 2: Right Model for Latency

Need	Model	Latency
Instant response	Haiku	~100ms
Quick research	MiniMax	~200ms
Deep analysis	GLM-4.7	~500ms
Complex reasoning	Opus	~300ms

Pattern 3: Batch Over Sequential

code

BAD:  Analyze image 1 → wait → Analyze image 2 → wait → ...
GOOD: Spawn 10 parallel agents → each analyzes 1 image → aggregate

Cost Optimization Patterns

Pattern 1: Delegate Exploration

Source: Anthropic Advanced Tool Use

"Delegate verbose operations (tests, log processing, documentation fetching) to subagents so output stays isolated from main conversation."

Pattern 2: Context Isolation

Subagents prevent context bloat:

code

Main context: 5000 tokens (stays lean)
Subagent reads 10 files: 15000 tokens (isolated, discarded after)
Result passed back: 500 tokens (only what matters)

Pattern 3: Cascade Routing

Source: Anthropic Model Selection

code

60-70% of queries → Haiku (cheapest)
20-30% of queries → Sonnet (medium)
10-15% of queries → Opus (only when needed)

Pattern 4: Token Suspension via Background Execution

The Problem: Claude tokens burn while waiting for subagent results.

The Solution: Use run_in_background=true and end the turn early.

code

┌─────────────────────────────────────────────────────────┐
│ BLOCKING (expensive)                                    │
├─────────────────────────────────────────────────────────┤
│ Claude spawns subagent → waits 60s → gets result       │
│ Cost: 60 seconds of Opus tokens BURNED                  │
├─────────────────────────────────────────────────────────┤
│ BACKGROUND + END TURN (cheap)                           │
├─────────────────────────────────────────────────────────┤
│ Claude spawns with run_in_background=true → ends turn  │
│ Cost: ~0 seconds of Opus tokens (turn ended)           │
│ Subagent runs: 60s of cheap tokens                      │
│ User says "continue" → Claude retrieves with TaskOutput│
└─────────────────────────────────────────────────────────┘

When to suspend (use background):

•Research/exploration tasks >30 seconds
•Batch image analysis (10+ images)
•Web searches with multiple queries
•Code review by subagent

When NOT to suspend:

•Quick lookups (<5 seconds)
•Claude needs result to continue current reasoning
•Interactive debugging requiring rapid iteration

Fire-and-Retrieve Workflow:

code

1. Task(prompt="Research X", run_in_background=true)
   → Returns immediately with task_id and output_file
2. Claude responds: "Research dispatched. Say 'continue' when ready."
3. User prompts again
4. TaskOutput(task_id="...", block=true)
   → Returns full results
5. Claude synthesizes and delivers answer

Key phrase: For tasks >30 seconds, use background execution and end turn early.

Anti-Patterns to Avoid

Anti-Pattern	Problem	Fix
Claude reads 10+ files	Context bloat, expensive	Spawn Explore agent
Sequential subagent calls	Slow, wasted time	Parallel in single message
Re-reading subagent results	Duplicates tokens	Trust the summary, decide
Opus for simple classification	Overkill, expensive	Use Haiku
GLM for quick lookups	Too slow	Use MiniMax

Integration with Project Workflows

HPV (Playtesting)

•Use MCP for game state inspection (local, fast)
•Delegate screenshot analysis to GLM-4V for batch checks

Visual Development

•1-3 sprites: Claude direct
•10+ sprites: Parallel GLM-4V agents
•Quality gate: Claude reviews subagent findings

Code Implementation

•Research patterns: MiniMax subagents
•Write code: Claude direct
•Review: Haiku subagent for spec compliance

Official Documentation Sources

[Claude Opus 4.5 - 2026-01-29]