Agentic Systems Design
Agent Architecture Selection
| Architecture | Use When | Complexity | Latency |
|---|---|---|---|
| Single-agent ReAct | 1-5 tools, linear reasoning | Low | Low |
| Plan-and-execute | Multi-step tasks needing upfront planning | Medium | Medium |
| Tree-of-Thought | Tasks with branching solutions, math/logic | Medium | High |
| LATS (Language Agent Tree Search) | Complex search + evaluation loops | High | Very High |
| Multi-agent supervisor | Specialized sub-tasks, delegation | High | Medium |
| Multi-agent debate | Tasks needing verification, fact-checking | High | High |
| Multi-agent chain | Sequential pipeline, each agent transforms output | Medium | Medium-High |
Decision rule: Start with single-agent ReAct. Escalate to plan-and-execute if the agent frequently fails mid-task. Use multi-agent only when a single model cannot hold all required expertise in context.
Planning Patterns
ReAct (Reasoning + Acting)
The default pattern. Model alternates between reasoning (think) and acting (tool call).
import anthropic
client = anthropic.Anthropic()
def react_agent(question: str, tools: list[dict], max_steps: int = 10) -> str:
system = """You are a helpful agent. For each step:
1. Think about what you need to do next
2. Use a tool if needed
3. When you have enough information, provide the final answer"""
messages = [{"role": "user", "content": question}]
for step in range(max_steps):
response = client.messages.create(
model="claude-sonnet-4-5-20250929",
max_tokens=4096,
system=system,
tools=tools,
messages=messages,
)
if response.stop_reason == "end_turn":
text_blocks = [b.text for b in response.content if b.type == "text"]
return "\n".join(text_blocks)
# Process tool calls
tool_results = []
for block in response.content:
if block.type == "tool_use":
result = execute_tool(block.name, block.input)
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": str(result),
})
messages.append({"role": "assistant", "content": response.content})
messages.append({"role": "user", "content": tool_results})
return "Max steps reached without resolution"
Plan-and-Execute
Separate planning from execution. Model generates a plan upfront, then executes steps sequentially.
def plan_and_execute(question: str, tools: list[dict]) -> str:
# Phase 1: Generate plan
plan_response = client.messages.create(
model="claude-sonnet-4-5-20250929",
max_tokens=2048,
messages=[{"role": "user", "content": f"""Create a step-by-step plan to answer this question.
Return a numbered list of steps. Each step should be a single action.
Question: {question}"""}],
)
plan = plan_response.content[0].text
# Phase 2: Execute each step
context = []
for step in parse_plan_steps(plan):
step_result = react_agent(
f"Execute this step: {step}\n\nContext from previous steps:\n{chr(10).join(context)}",
tools=tools,
max_steps=3,
)
context.append(f"Step: {step}\nResult: {step_result}")
# Phase 3: Synthesize
synthesis = client.messages.create(
model="claude-sonnet-4-5-20250929",
max_tokens=2048,
messages=[{"role": "user", "content": f"""Original question: {question}
Execution results:
{chr(10).join(context)}
Synthesize a final answer."""}],
)
return synthesis.content[0].text
Tree-of-Thought
Generate multiple reasoning paths, evaluate each, expand the most promising.
def tree_of_thought(problem: str, breadth: int = 3, depth: int = 3) -> str:
def generate_thoughts(state: str, n: int) -> list[str]:
response = client.messages.create(
model="claude-sonnet-4-5-20250929",
max_tokens=2048,
messages=[{"role": "user", "content": f"""Problem: {problem}
Current reasoning: {state}
Generate {n} distinct next reasoning steps. Return each on a new line prefixed with [THOUGHT]."""}],
)
return parse_thoughts(response.content[0].text)
def evaluate_thought(state: str) -> float:
response = client.messages.create(
model="claude-sonnet-4-5-20250929",
max_tokens=256,
messages=[{"role": "user", "content": f"""Rate this reasoning path from 0.0 to 1.0 for correctness and progress toward solving: {problem}
Reasoning: {state}
Return only a number."""}],
)
return float(response.content[0].text.strip())
# BFS with pruning
current_states = [""]
for _ in range(depth):
candidates = []
for state in current_states:
thoughts = generate_thoughts(state, breadth)
for thought in thoughts:
new_state = f"{state}\n{thought}" if state else thought
score = evaluate_thought(new_state)
candidates.append((score, new_state))
candidates.sort(reverse=True, key=lambda x: x[0])
current_states = [s for _, s in candidates[:breadth]]
return current_states[0]
Tool Design Principles
Schema Design
# Good: specific description, constrained types, clear required fields
{
"name": "search_orders",
"description": "Search customer orders by order ID, customer email, or date range. Returns up to 10 matching orders with status and total.",
"input_schema": {
"type": "object",
"properties": {
"order_id": {"type": "string", "description": "Exact order ID (e.g., ORD-12345)"},
"customer_email": {"type": "string", "format": "email"},
"date_from": {"type": "string", "description": "ISO 8601 date (YYYY-MM-DD)"},
"date_to": {"type": "string", "description": "ISO 8601 date (YYYY-MM-DD)"},
"status": {"type": "string", "enum": ["pending", "shipped", "delivered", "cancelled"]},
},
"required": [], # All optional -- at least one should be provided
},
}
Tool description rules:
- •Start with a verb: "Search", "Create", "Calculate", "Retrieve"
- •Mention return format: "Returns a JSON list of...", "Returns a single..."
- •Include example inputs in description when format is ambiguous
- •Keep under 200 words; models parse long descriptions less reliably
Error Handling in Tool Results
def execute_tool(name: str, inputs: dict) -> str:
try:
result = TOOL_REGISTRY[name](**inputs)
return json.dumps({"status": "success", "data": result})
except KeyError:
return json.dumps({"status": "error", "message": f"Unknown tool: {name}"})
except ValidationError as e:
return json.dumps({"status": "error", "message": f"Invalid input: {e}"})
except Exception as e:
return json.dumps({"status": "error", "message": f"Tool execution failed: {e}"})
Always return structured errors. Models recover better from {"status": "error", "message": "..."} than from raw exceptions or empty strings.
Multi-Agent Patterns
Supervisor Pattern
One orchestrator agent delegates to specialist agents.
def supervisor_agent(question: str, specialists: dict[str, callable]) -> str:
router_tools = [
{
"name": "delegate",
"description": "Delegate a sub-task to a specialist agent.",
"input_schema": {
"type": "object",
"properties": {
"specialist": {
"type": "string",
"enum": list(specialists.keys()),
"description": "Which specialist to delegate to",
},
"task": {"type": "string", "description": "The sub-task description"},
},
"required": ["specialist", "task"],
},
}
]
messages = [{"role": "user", "content": question}]
for _ in range(10):
response = client.messages.create(
model="claude-sonnet-4-5-20250929",
max_tokens=4096,
system="You are a supervisor. Break the task into sub-tasks and delegate to specialists. Synthesize results.",
tools=router_tools,
messages=messages,
)
if response.stop_reason == "end_turn":
return response.content[0].text
tool_results = []
for block in response.content:
if block.type == "tool_use":
specialist_fn = specialists[block.input["specialist"]]
result = specialist_fn(block.input["task"])
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": result,
})
messages.append({"role": "assistant", "content": response.content})
messages.append({"role": "user", "content": tool_results})
Debate Pattern
Two agents argue for/against, a judge decides.
def debate_agents(question: str, rounds: int = 2) -> str:
pro_history, con_history = [], []
for r in range(rounds):
pro = client.messages.create(
model="claude-sonnet-4-5-20250929", max_tokens=1024,
system="You argue FOR the proposition. Be specific and cite evidence.",
messages=[{"role": "user", "content": f"Question: {question}\nRound {r+1}. Previous debate:\n{format_debate(pro_history, con_history)}"}],
).content[0].text
pro_history.append(pro)
con = client.messages.create(
model="claude-sonnet-4-5-20250929", max_tokens=1024,
system="You argue AGAINST the proposition. Counter the pro arguments specifically.",
messages=[{"role": "user", "content": f"Question: {question}\nRound {r+1}. Previous debate:\n{format_debate(pro_history, con_history)}"}],
).content[0].text
con_history.append(con)
# Judge synthesizes
verdict = client.messages.create(
model="claude-sonnet-4-5-20250929", max_tokens=1024,
system="You are an impartial judge. Evaluate both sides and give a final verdict with reasoning.",
messages=[{"role": "user", "content": f"Question: {question}\n\nFull debate:\n{format_debate(pro_history, con_history)}"}],
)
return verdict.content[0].text
Agent Evaluation
| Metric | What It Measures | How to Compute |
|---|---|---|
| Task completion | Did the agent solve the problem? | Human eval or automated check against gold answer |
| Tool accuracy | Did it call the right tools with right args? | Compare tool call trace to expected trace |
| Step efficiency | How many steps to solve? | Count tool calls; compare to optimal path |
| Cost | Total tokens consumed | Sum input + output tokens across all turns |
| Hallucination rate | Did it fabricate tool results or facts? | Check claims against tool outputs |
def evaluate_agent(agent_fn, test_cases: list[dict]) -> dict:
results = []
for case in test_cases:
trace = []
result = agent_fn(case["question"], trace_callback=trace.append)
results.append({
"question": case["question"],
"expected": case["expected_answer"],
"actual": result,
"correct": check_answer(result, case["expected_answer"]),
"num_steps": len(trace),
"tools_used": [t["name"] for t in trace],
"expected_tools": case.get("expected_tools", []),
"tool_accuracy": compute_tool_accuracy(trace, case.get("expected_tools", [])),
})
return {
"task_completion": sum(r["correct"] for r in results) / len(results),
"avg_steps": sum(r["num_steps"] for r in results) / len(results),
"tool_accuracy": sum(r["tool_accuracy"] for r in results) / len(results),
"results": results,
}
Guardrails
| Guardrail | Default | Why |
|---|---|---|
| Max iterations | 10-15 | Prevents infinite loops |
| Timeout | 60-120s total | Caps wall-clock time |
| Token budget | 50K-100K per task | Caps cost per execution |
| Human-in-the-loop | On destructive actions | Prevents irreversible damage |
| Tool allowlist | Explicit per agent | Limits blast radius |
| Output validation | Schema check on final output | Ensures usable result |
import time
from dataclasses import dataclass
@dataclass
class AgentBudget:
max_steps: int = 15
max_tokens: int = 100_000
timeout_seconds: float = 120.0
require_approval_for: list[str] | None = None # Tool names needing human approval
def check(self, steps: int, tokens: int, start_time: float):
if steps >= self.max_steps:
raise BudgetExceeded(f"Max steps ({self.max_steps}) exceeded")
if tokens >= self.max_tokens:
raise BudgetExceeded(f"Token budget ({self.max_tokens}) exceeded")
elapsed = time.time() - start_time
if elapsed >= self.timeout_seconds:
raise BudgetExceeded(f"Timeout ({self.timeout_seconds}s) exceeded")
def needs_approval(self, tool_name: str) -> bool:
if self.require_approval_for is None:
return False
return tool_name in self.require_approval_for
Gotchas
Tool Description Quality
Vague tool descriptions cause wrong tool selection. "Gets data" is bad. "Retrieves customer order history by email address, returning last 30 days of orders with status and totals" is good.
Infinite Loops
Agents can loop calling the same tool with the same args. Track call history and inject "You already called {tool} with these args. Try a different approach." after 2 duplicate calls.
Context Window Overflow
Long agent runs accumulate tokens fast. Summarize older tool results once context exceeds 50% of window. Keep the last 2-3 tool results verbatim.
Overly Eager Tool Use
Models sometimes call tools when they already have the answer in context. Add "Only use a tool if you cannot answer from information you already have" to the system prompt.
Multi-Agent Communication Overhead
Each handoff between agents adds latency and token cost. Minimize cross-agent calls. If two agents always work together, merge them into one with a richer tool set.
Evaluation Pitfalls
- •Don't evaluate agents only on final answer; inspect the full tool call trace
- •Agent behavior is non-deterministic; run evals 3-5 times and report variance
- •Test adversarial inputs: ambiguous questions, impossible tasks, tasks requiring tools the agent doesn't have