Token Usage Estimation and Optimization
Deliver practical token estimates and a prioritized optimization plan without reducing output quality unnecessarily.
Quick Start
- •Clarify the objective, target model(s), and constraints (budget, latency, max input/output tokens).
- •Inventory prompt components (system, developer, tool list, history, retrieved docs, user input).
- •Estimate tokens for each component and find the top 2-3 drivers.
- •Apply the smallest, highest-impact optimizations first.
- •Re-estimate and validate output quality.
Estimation Heuristics
Use a tokenizer or estimation library when available. If not, use fast heuristics and add a buffer.
- •English prose: tokens ≈ words * 1.3 or chars / 4
- •Code: tokens ≈ chars / 3
- •JSON/structured: tokens ≈ chars / 3.5
- •Buffer: add 10-20% for safety
Template:
text
input_tokens ≈ (system + tools + history + retrieval + user) output_tokens ≈ target_response_length total_tokens ≈ input_tokens + output_tokens
Optimization Playbook
Input Tokens
- •Trim tool list: only include relevant tools for the request.
- •Conditional tool instructions: include tool-specific guidance only when tool is present.
- •Prompt caching: place stable instructions at the top for caching benefits.
- •Summarize history: keep last N turns + compact summary + key decisions.
- •Reduce retrieval size: tighten query, limit top-k, dedupe, and remove boilerplate.
- •Shorten system prompt: remove redundant policies or examples.
Output Tokens
- •Set caps: use
max_tokensor explicit length limits. - •Constrain format: bullet lists, tables, or schemas reduce verbosity.
- •Ask for concise: specify maximum bullets, sentences, or words.
Model Strategy
- •Use smaller models for classification, routing, summarization, and labeling.
- •Escalate to larger models only for complex reasoning or high-stakes outputs.
Retry Reduction
- •Return structured error messages with next steps to prevent retry loops.
- •Make tool descriptions unambiguous to reduce misfires.
Decision Trees
Reduce Input Size
- •If prompt exceeds budget by <15% → trim tool list and boilerplate first.
- •If exceeded by 15-40% → summarize older history and remove low-value examples.
- •If exceeded by >40% → switch to retrieval + short summary + last N turns.
- •If still too large → move large static guidance to cached prefix or references.
Choose Model Tier
- •Classification/routing/summarization → smallest available model.
- •Simple edits or formatting → small/fast model.
- •Complex reasoning or high-stakes → larger model.
- •If output quality drops → increase model tier before expanding context.
NEVER List
- •NEVER trim tool schemas or argument constraints; it causes invalid calls.
- •NEVER drop safety or system constraints to save tokens.
- •NEVER include unused tools; every tool adds token cost.
- •NEVER summarize the last user request; keep it verbatim.
- •NEVER compress examples that are required for format compliance.
Agent-Specific Guidance
- •Keep tool names and descriptions clear and specific.
- •Filter tools before each call to avoid paying for unused tool metadata.
- •Avoid stuffing large static instructions into every turn; move to cached prefix.
Validation Checklist
- •Recalculate token estimates after changes.
- •Spot-check response quality vs baseline.
- •Verify you did not remove critical instructions or constraints.
- •Record changes and observed token deltas for future runs.
Scripts
- •
scripts/token_estimate.mjs: fast token estimation via tokenx (approximate)
token_estimate.mjs Usage
Requirements:
bash
node + npx available on PATH
Estimate tokens from a file or raw text:
bash
node scripts/token_estimate.mjs --file AGENTS.md
bash
node scripts/token_estimate.mjs --text "Hello"
Tune conservatism with a custom chars-per-token value:
bash
node scripts/token_estimate.mjs --file AGENTS.md --chars-per-token 4
References
- •Read
references/token-optimization-sources.mdonly when asked for citations or rationale.