12-Factor Agents Compliance Analysis
Reference: 12-Factor Agents
Input Parameters
| Parameter | Description | Required |
|---|---|---|
docs_path | Path to documentation directory (for existing analyses) | Optional |
codebase_path | Root path of the codebase to analyze | Required |
Analysis Framework
Factor 1: Natural Language to Tool Calls
Principle: Convert natural language inputs into structured, deterministic tool calls using schema-validated outputs.
Search Patterns:
# Look for Pydantic schemas grep -r "class.*BaseModel" --include="*.py" grep -r "TaskDAG\|TaskResponse\|ToolCall" --include="*.py" # Look for JSON schema generation grep -r "model_json_schema\|json_schema" --include="*.py" # Look for structured output generation grep -r "output_type\|response_model" --include="*.py"
File Patterns: **/agents/*.py, **/schemas/*.py, **/models/*.py
Compliance Criteria:
| Level | Criteria |
|---|---|
| Strong | All LLM outputs use Pydantic/dataclass schemas with validators |
| Partial | Some outputs typed, but dict returns or unvalidated strings exist |
| Weak | LLM returns raw strings parsed manually or with regex |
Anti-patterns:
- •
json.loads(llm_response)without schema validation - •
output.split()or regex parsing of LLM responses - •
dict[str, Any]return types from agents - •No validation between LLM output and handler execution
Factor 2: Own Your Prompts
Principle: Treat prompts as first-class code you control, version, and iterate on.
Search Patterns:
# Look for embedded prompts grep -r "SYSTEM_PROMPT\|system_prompt" --include="*.py" grep -r '""".*You are' --include="*.py" # Look for template systems grep -r "jinja\|Jinja\|render_template" --include="*.py" find . -name "*.jinja2" -o -name "*.j2" # Look for prompt directories find . -type d -name "prompts"
File Patterns: **/prompts/**, **/templates/**, **/agents/*.py
Compliance Criteria:
| Level | Criteria |
|---|---|
| Strong | Prompts in separate files, templated (Jinja2), versioned |
| Partial | Prompts as module constants, some parameterization |
| Weak | Prompts hardcoded inline in functions, f-strings only |
Anti-patterns:
- •
f"You are a {role}..."inline in agent methods - •Prompts mixed with business logic
- •No way to iterate on prompts without code changes
- •No prompt versioning or A/B testing capability
Factor 3: Own Your Context Window
Principle: Control how history, state, and tool results are formatted for the LLM.
Search Patterns:
# Look for context/message management grep -r "AgentMessage\|ChatMessage\|messages" --include="*.py" grep -r "context_window\|context_compiler" --include="*.py" # Look for custom serialization grep -r "to_xml\|to_context\|serialize" --include="*.py" # Look for token management grep -r "token_count\|max_tokens\|truncate" --include="*.py"
File Patterns: **/context/*.py, **/state/*.py, **/core/*.py
Compliance Criteria:
| Level | Criteria |
|---|---|
| Strong | Custom context format, token optimization, typed events, compaction |
| Partial | Basic message history with some structure |
| Weak | Raw message accumulation, standard OpenAI format only |
Anti-patterns:
- •Unbounded message accumulation
- •Large artifacts embedded inline (diffs, files)
- •No agent-specific context filtering
- •Same context for all agent types
Factor 4: Tools Are Structured Outputs
Principle: Tools produce schema-validated JSON that triggers deterministic code, not magic function calls.
Search Patterns:
# Look for tool/response schemas grep -r "class.*Response.*BaseModel" --include="*.py" grep -r "ToolResult\|ToolOutput" --include="*.py" # Look for deterministic handlers grep -r "def handle_\|def execute_" --include="*.py" # Look for validation layer grep -r "model_validate\|parse_obj" --include="*.py"
File Patterns: **/tools/*.py, **/handlers/*.py, **/agents/*.py
Compliance Criteria:
| Level | Criteria |
|---|---|
| Strong | All tool outputs schema-validated, handlers type-safe |
| Partial | Most tools typed, some loose dict returns |
| Weak | Tools return arbitrary dicts, no validation layer |
Anti-patterns:
- •Tool handlers that directly execute LLM output
- •
eval()orexec()on LLM-generated code - •No separation between decision (LLM) and execution (code)
- •Magic method dispatch based on string matching
Factor 5: Unify Execution State
Principle: Merge execution state (step, retries) with business state (messages, results).
Search Patterns:
# Look for state models grep -r "ExecutionState\|WorkflowState\|Thread" --include="*.py" # Look for dual state systems grep -r "checkpoint\|MemorySaver" --include="*.py" grep -r "sqlite\|database\|repository" --include="*.py" # Look for state reconstruction grep -r "load_state\|restore\|reconstruct" --include="*.py"
File Patterns: **/state/*.py, **/models/*.py, **/database/*.py
Compliance Criteria:
| Level | Criteria |
|---|---|
| Strong | Single serializable state object with all execution metadata |
| Partial | State exists but split across systems (memory + DB) |
| Weak | Execution state scattered, requires multiple queries to reconstruct |
Anti-patterns:
- •Retry count stored separately from task state
- •Error history in logs but not in state
- •LangGraph checkpoints + separate database storage
- •No unified event thread
Factor 6: Launch/Pause/Resume
Principle: Agents support simple APIs for launching, pausing at any point, and resuming.
Search Patterns:
# Look for REST endpoints grep -r "@router.post\|@app.post" --include="*.py" grep -r "start_workflow\|pause\|resume" --include="*.py" # Look for interrupt mechanisms grep -r "interrupt_before\|interrupt_after" --include="*.py" # Look for webhook handlers grep -r "webhook\|callback" --include="*.py"
File Patterns: **/routes/*.py, **/api/*.py, **/orchestrator/*.py
Compliance Criteria:
| Level | Criteria |
|---|---|
| Strong | REST API + webhook resume, pause at any point including mid-tool |
| Partial | Launch/pause/resume exists but only at coarse-grained points |
| Weak | CLI-only launch, no pause/resume capability |
Anti-patterns:
- •Blocking
input()orconfirm()calls - •No way to resume after process restart
- •Approval only at plan level, not per-tool
- •No webhook-based resume from external systems
Factor 7: Contact Humans with Tools
Principle: Human contact is a tool call with question, options, and urgency.
Search Patterns:
# Look for human input mechanisms
grep -r "typer.confirm\|input(\|prompt(" --include="*.py"
grep -r "request_human_input\|human_contact" --include="*.py"
# Look for approval patterns
grep -r "approval\|approve\|reject" --include="*.py"
# Look for structured question formats
grep -r "question.*options\|HumanInputRequest" --include="*.py"
File Patterns: **/agents/*.py, **/tools/*.py, **/orchestrator/*.py
Compliance Criteria:
| Level | Criteria |
|---|---|
| Strong | request_human_input tool with question/options/urgency/format |
| Partial | Approval gates exist but hardcoded in graph structure |
| Weak | Blocking CLI prompts, no tool-based human contact |
Anti-patterns:
- •
typer.confirm()in agent code - •Human contact hardcoded at specific graph nodes
- •No way for agents to ask clarifying questions
- •Single response format (yes/no only)
Factor 8: Own Your Control Flow
Principle: Custom control flow, not framework defaults. Full control over routing, retries, compaction.
Search Patterns:
# Look for routing logic grep -r "add_conditional_edges\|route_\|should_continue" --include="*.py" # Look for custom loops grep -r "while True\|for.*in.*range" --include="*.py" | grep -v test # Look for execution mode control grep -r "execution_mode\|agentic\|structured" --include="*.py"
File Patterns: **/orchestrator/*.py, **/graph/*.py, **/core/*.py
Compliance Criteria:
| Level | Criteria |
|---|---|
| Strong | Custom routing functions, conditional edges, execution mode control |
| Partial | Framework control flow with some customization |
| Weak | Default framework loop with no custom routing |
Anti-patterns:
- •Single path through graph with no branching
- •No distinction between tool types (all treated same)
- •Framework-default error handling only
- •No rate limiting or resource management
Factor 9: Compact Errors into Context
Principle: Errors in context enable self-healing. Track consecutive errors, escalate after threshold.
Search Patterns:
# Look for error handling grep -r "except.*Exception\|error_history\|consecutive_errors" --include="*.py" # Look for retry logic grep -r "retry\|backoff\|max_attempts" --include="*.py" # Look for escalation grep -r "escalate\|human_escalation" --include="*.py"
File Patterns: **/agents/*.py, **/orchestrator/*.py, **/core/*.py
Compliance Criteria:
| Level | Criteria |
|---|---|
| Strong | Errors in context, retry with threshold, automatic escalation |
| Partial | Errors logged and returned, no automatic retry loop |
| Weak | Errors logged only, not fed back to LLM, task fails immediately |
Anti-patterns:
- •
logger.error()without adding to context - •No retry mechanism (fail immediately)
- •No consecutive error tracking
- •No escalation to humans after repeated failures
Factor 10: Small, Focused Agents
Principle: Each agent has narrow responsibility, 3-10 steps max.
Search Patterns:
# Look for agent classes grep -r "class.*Agent\|class.*Architect\|class.*Developer" --include="*.py" # Look for step definitions grep -r "steps\|tasks" --include="*.py" | head -20 # Count methods per agent grep -r "async def\|def " agents/*.py 2>/dev/null | wc -l
File Patterns: **/agents/*.py
Compliance Criteria:
| Level | Criteria |
|---|---|
| Strong | 3+ specialized agents, each with single responsibility, step limits |
| Partial | Multiple agents but some have broad scope |
| Weak | Single "god" agent that handles everything |
Anti-patterns:
- •Single agent with 20+ tools
- •Agent with unbounded step count
- •Mixed responsibilities (planning + execution + review)
- •No step or time limits on agent execution
Factor 11: Trigger from Anywhere
Principle: Workflows triggerable from CLI, REST, WebSocket, Slack, webhooks, etc.
Search Patterns:
# Look for entry points grep -r "@cli.command\|@router.post\|@app.post" --include="*.py" # Look for WebSocket support grep -r "WebSocket\|websocket" --include="*.py" # Look for external integrations grep -r "slack\|discord\|webhook" --include="*.py" -i
File Patterns: **/routes/*.py, **/cli/*.py, **/main.py
Compliance Criteria:
| Level | Criteria |
|---|---|
| Strong | CLI + REST + WebSocket + webhooks + chat integrations |
| Partial | CLI + REST API available |
| Weak | CLI only, no programmatic access |
Anti-patterns:
- •Only
if __name__ == "__main__"entry point - •No REST API for external systems
- •No event streaming for real-time updates
- •Trigger logic tightly coupled to execution
Factor 12: Stateless Reducer
Principle: Agents as pure functions: (state, input) -> (state, output). No side effects in agent logic.
Search Patterns:
# Look for state mutation patterns grep -r "\.status = \|\.field = " --include="*.py" # Look for immutable updates grep -r "model_copy\|\.copy(\|with_" --include="*.py" # Look for side effects in agents grep -r "write_file\|subprocess\|requests\." agents/*.py 2>/dev/null
File Patterns: **/agents/*.py, **/nodes/*.py
Compliance Criteria:
| Level | Criteria |
|---|---|
| Strong | Immutable state updates, side effects isolated to tools/handlers |
| Partial | Mostly immutable, some in-place mutations |
| Weak | State mutated in place, side effects mixed with agent logic |
Anti-patterns:
- •
state.field = new_value(mutation) - •File writes inside agent methods
- •HTTP calls inside agent decision logic
- •Shared mutable state between agents
Factor 13: Pre-fetch Context
Principle: Fetch likely-needed data upfront rather than mid-workflow.
Search Patterns:
# Look for context pre-fetching grep -r "pre_fetch\|prefetch\|fetch_context" --include="*.py" # Look for RAG/embedding systems grep -r "embedding\|vector\|semantic_search" --include="*.py" # Look for related file discovery grep -r "related_tests\|similar_\|find_relevant" --include="*.py"
File Patterns: **/context/*.py, **/retrieval/*.py, **/rag/*.py
Compliance Criteria:
| Level | Criteria |
|---|---|
| Strong | Automatic pre-fetch of related tests, files, docs before planning |
| Partial | Manual context passing, design doc support |
| Weak | No pre-fetching, LLM must request all context via tools |
Anti-patterns:
- •Architect starts with issue only, no codebase context
- •No semantic search for similar past work
- •Related tests/files discovered only during execution
- •No RAG or document retrieval system
Output Format
Executive Summary Table
| Factor | Status | Notes | |--------|--------|-------| | 1. Natural Language -> Tool Calls | **Strong/Partial/Weak** | [Key finding] | | 2. Own Your Prompts | **Strong/Partial/Weak** | [Key finding] | | ... | ... | ... | | 13. Pre-fetch Context | **Strong/Partial/Weak** | [Key finding] | **Overall**: X Strong, Y Partial, Z Weak
Per-Factor Analysis
For each factor, provide:
- •
Current Implementation
- •Evidence with file:line references
- •Code snippets showing patterns
- •
Compliance Level
- •Strong/Partial/Weak with justification
- •
Gaps
- •What's missing vs. 12-Factor ideal
- •
Recommendations
- •Actionable improvements with code examples
Analysis Workflow
- •
Initial Scan
- •Run search patterns for all factors
- •Identify key files for each factor
- •Note any existing compliance documentation
- •
Deep Dive (per factor)
- •Read identified files
- •Evaluate against compliance criteria
- •Document evidence with file paths
- •
Gap Analysis
- •Compare current vs. 12-Factor ideal
- •Identify anti-patterns present
- •Prioritize by impact
- •
Recommendations
- •Provide actionable improvements
- •Include before/after code examples
- •Reference roadmap if exists
- •
Summary
- •Compile executive summary table
- •Highlight strengths and critical gaps
- •Suggest priority order for improvements
Quick Reference: Compliance Scoring
| Score | Meaning | Action |
|---|---|---|
| Strong | Fully implements principle | Maintain, minor optimizations |
| Partial | Some implementation, significant gaps | Planned improvements |
| Weak | Minimal or no implementation | High priority for roadmap |
When to Use This Skill
- •Evaluating new LLM-powered systems
- •Reviewing agent architecture decisions
- •Auditing production agentic applications
- •Planning improvements to existing agents
- •Comparing frameworks or implementations