AgentSkillsCN

token-efficient-delegation

自动注入高效用 token 的委派模式。Claude 的 token 价格是 MiniMax/GLM/Kimi 的 50 倍之高。

SKILL.md
--- frontmatter
name: token-efficient-delegation
description: Auto-injects token-efficient delegation patterns. Claude tokens are 50x more expensive than MiniMax/GLM/Kimi.

Token-Efficient Delegation

⛔ CRITICAL: No Claude Subagents

HARD STOP: Never use model="haiku", model="sonnet", or model="opus" in Task tool calls.

  • Always use: model parameter omitted (routes to MiniMax/GLM)
  • Or explicitly: general-purpose agent (uses cheaper default)
  • Enforced in: .claude/settings.local.json deny rules

Core Rule: Claude = reasoning/decisions, MiniMax/GLM = research/exploration

The Three Goals (Balance All Three)

GoalPriorityStrategy
QualityKeep Opus-level outputClaude synthesizes, subagents gather
CostMinimize Claude tokensDelegate exploration, batch analysis
SpeedFast executionParallel subagents, right model for task

Quick Decision Tree

Before ANY task, ask:

  1. Research/exploration? → Delegate to MiniMax (fast) or GLM (creative)
  2. Web search needed? → Delegate to MiniMax
  3. Reading specific known file? → Claude direct (Read tool)
  4. Analyzing 1-3 images? → Claude direct
  5. Analyzing 10+ images? → Parallel subagents (GLM for quality, MiniMax for speed)
  6. Complex reasoning/code? → Claude direct
  7. Broad codebase search? → Delegate to general-purpose agent (cheaper default)

Model Characteristics (Official Guidance)

ModelCostSpeedBest ForSource
Claude Opus50xFastComplex reasoning, architecture, final decisionsAnthropic Docs
Claude Sonnet10xFastCoding, medium reasoning, code reviewAnthropic Docs
Claude Haiku2xFastestExploration, classification, simple tasksAnthropic Docs
MiniMax M2.11xFastInstruction-following, agents, multi-lang codingMiniMax Docs
GLM-4.71xMediumCreative problem-solving, math, tool useZ.AI Docs
GLM-4V1xMediumVision, 66K multimodal context, GUI tasksZ.AI Docs

When to Use Each Model

MiniMax M2.1 - Fast Instruction-Follower

Source: MiniMax Official Docs

  • Quick web searches
  • Structured data extraction
  • File reading with specific questions
  • Tasks with clear instructions
  • 8% of Claude cost, 2x speed
  • Preserve reasoning chains (return full response with thinking field)

GLM-4.7 - Creative Problem-Solver

Source: Z.AI GLM-4.7 Docs

  • Complex problem exploration
  • Mathematical reasoning (95.7% on AIME 2025)
  • Creative solutions needed
  • Tool use orchestration (42.8%, ties GPT-5.1)
  • Parallel batch tasks (latency doesn't matter)
  • Use thinking mode for complex tasks, disable for simple queries

GLM-4V - Vision Tasks

Source: GLM-V GitHub

  • 66K-token multimodal context
  • Screenshot/GUI analysis
  • Document parsing (PDFs, charts)
  • Video understanding with timestamps
  • Multi-image analysis

Claude (Haiku/Sonnet/Opus) - Reasoning & Decisions

Source: Anthropic Subagents Docs

  • Haiku: Fast exploration, read-only tasks, classification
  • Sonnet: Coding, medium-complexity reasoning
  • Opus: Architecture decisions, synthesis, complex reasoning
  • Use subagents to isolate verbose output from main context

Quality Preservation Patterns

Pattern 1: Orchestrator-Worker Split

code
Claude (Orchestrator) → Spawns subagents → Subagents research → Claude synthesizes

Quality preserved because: Claude makes all decisions based on subagent research.

Pattern 2: Parallel Research + Single Synthesis

code
5 MiniMax agents research in parallel → Claude receives summaries → Claude decides

Quality preserved because: Multiple perspectives, Claude does final reasoning.

Pattern 3: Two-Stage Review

code
Subagent implements → Reviewer subagent checks → Claude integrates

Quality preserved because: Cross-validation catches errors before Claude acts.

Pattern 4: Specific Prompts

Source: MiniMax Best Practices

"State the 'why' behind your request - when models understand purpose, they provide more accurate answers."

code
BAD:  "Research authentication"
GOOD: "Find how Godot 4.5 handles input authentication. I need to implement player login. Return: method name, file location, example usage."

Speed Optimization Patterns

Pattern 1: Parallel Execution

Launch independent tasks simultaneously:

code
Task(prompt="Research X")  ←─┐
Task(prompt="Research Y")  ←─┼─ Single message, parallel execution
Task(prompt="Research Z")  ←─┘

Pattern 2: Right Model for Latency

NeedModelLatency
Instant responseHaiku~100ms
Quick researchMiniMax~200ms
Deep analysisGLM-4.7~500ms
Complex reasoningOpus~300ms

Pattern 3: Batch Over Sequential

code
BAD:  Analyze image 1 → wait → Analyze image 2 → wait → ...
GOOD: Spawn 10 parallel agents → each analyzes 1 image → aggregate

Cost Optimization Patterns

Pattern 1: Delegate Exploration

Source: Anthropic Advanced Tool Use

"Delegate verbose operations (tests, log processing, documentation fetching) to subagents so output stays isolated from main conversation."

Pattern 2: Context Isolation

Subagents prevent context bloat:

code
Main context: 5000 tokens (stays lean)
Subagent reads 10 files: 15000 tokens (isolated, discarded after)
Result passed back: 500 tokens (only what matters)

Pattern 3: Cascade Routing

Source: Anthropic Model Selection

code
60-70% of queries → Haiku (cheapest)
20-30% of queries → Sonnet (medium)
10-15% of queries → Opus (only when needed)

Pattern 4: Token Suspension via Background Execution

The Problem: Claude tokens burn while waiting for subagent results.

The Solution: Use run_in_background=true and end the turn early.

code
┌─────────────────────────────────────────────────────────┐
│ BLOCKING (expensive)                                    │
├─────────────────────────────────────────────────────────┤
│ Claude spawns subagent → waits 60s → gets result       │
│ Cost: 60 seconds of Opus tokens BURNED                  │
├─────────────────────────────────────────────────────────┤
│ BACKGROUND + END TURN (cheap)                           │
├─────────────────────────────────────────────────────────┤
│ Claude spawns with run_in_background=true → ends turn  │
│ Cost: ~0 seconds of Opus tokens (turn ended)           │
│ Subagent runs: 60s of cheap tokens                      │
│ User says "continue" → Claude retrieves with TaskOutput│
└─────────────────────────────────────────────────────────┘

When to suspend (use background):

  • Research/exploration tasks >30 seconds
  • Batch image analysis (10+ images)
  • Web searches with multiple queries
  • Code review by subagent

When NOT to suspend:

  • Quick lookups (<5 seconds)
  • Claude needs result to continue current reasoning
  • Interactive debugging requiring rapid iteration

Fire-and-Retrieve Workflow:

code
1. Task(prompt="Research X", run_in_background=true)
   → Returns immediately with task_id and output_file
2. Claude responds: "Research dispatched. Say 'continue' when ready."
3. User prompts again
4. TaskOutput(task_id="...", block=true)
   → Returns full results
5. Claude synthesizes and delivers answer

Key phrase: For tasks >30 seconds, use background execution and end turn early.


Anti-Patterns to Avoid

Anti-PatternProblemFix
Claude reads 10+ filesContext bloat, expensiveSpawn Explore agent
Sequential subagent callsSlow, wasted timeParallel in single message
Re-reading subagent resultsDuplicates tokensTrust the summary, decide
Opus for simple classificationOverkill, expensiveUse Haiku
GLM for quick lookupsToo slowUse MiniMax

Integration with Project Workflows

HPV (Playtesting)

  • Use MCP for game state inspection (local, fast)
  • Delegate screenshot analysis to GLM-4V for batch checks

Visual Development

  • 1-3 sprites: Claude direct
  • 10+ sprites: Parallel GLM-4V agents
  • Quality gate: Claude reviews subagent findings

Code Implementation

  • Research patterns: MiniMax subagents
  • Write code: Claude direct
  • Review: Haiku subagent for spec compliance

Official Documentation Sources


[Claude Opus 4.5 - 2026-01-29]