Architecture Documentation
Overview
Generates in-depth technical architecture documentation from codebases. Produces engineer-focused documentation with system diagrams, data flow analysis, component deep dives, and architectural decision rationale.
Core principle: Depth over breadth. Technical rigor over high-level summaries.
When to Use
- •User provides a codebase and asks for architecture documentation
- •User requests system design documentation
- •User needs technical documentation for handoff or onboarding
- •User asks to document architectural decisions
- •User needs diagrams showing system structure and data flow
Workflow Checklist
Copy this checklist and check off items as you complete them:
Architecture Documentation Progress: - [ ] Phase 1: Codebase exploration (structure, entry points, dependencies) - [ ] Phase 2: Components identified (services, modules, databases) - [ ] Phase 3: Data flow traced (request lifecycle, transformations) - [ ] Phase 4: Business context extracted (README, comments, code) - [ ] Phase 5: Documentation generated following structure below - [ ] Phase 6: Diagrams created with PlantUML syntax (rendered via Kroki) - [ ] Phase 7: Engineering analysis complete (all "why" questions answered) - [ ] Phase 8: Quality validation passed
Document Structure
Follow this structure (see example-output.pdf for full reference):
Required Sections
- •
Abstract
- •Formal research paper abstract after table of contents
- •Delineates system purpose, architecture approach, key technologies
- •Written in formal tone
- •
Context & Scope
- •Business goals, stakeholders
- •System context diagram (PlantUML via Kroki)
- •
Architecture Constraints & Principles
- •Why this approach? Immutable rules
- •
High-Level Architecture
- •Container diagram showing major components
- •Data flow walkthrough with transformations (Input → Output at each stage)
- •
Component Deep Dives
- •Component Responsibility Matrix: Table summarizing all components (see template below)
- •Individual Component Sections (repeat for each component):
- •Purpose: One sentence
- •Implementation Details: Stack, algorithms, dependencies (with WHY chosen)
- •Engineering Analysis: Trade-offs, configuration rationale, edge cases
- •Component diagram if complex
- •
Cross-Cutting Concerns
- •Observability (logging, metrics, tracing)
- •Failure modes & recovery
- •Deployment & infrastructure
- •
Decision Log (ADRs)
- •Major decisions with context and consequences
Optional Appendices
Appendix A: Technology Stack Summary
- •Table organized by category (Backend, AI/ML, Data Storage, Infrastructure, etc.)
- •Columns: Technology | Version | Purpose | Architectural Layer
- •Quick reference for all technologies used
Appendix B: API Endpoint Reference
- •Complete endpoint documentation
- •For each endpoint: Method, Path, Auth requirements, Request/Response schemas
- •Include streaming event types if applicable
- •Error response codes and formats
Template format:
## 4. Component Deep Dives ### Component Responsibility Matrix | Component | Primary Responsibility | Key Dependencies | Input/Output | Failure Modes | Recovery Strategies | |-----------|----------------------|------------------|--------------|---------------|--------------------| | [Name] | What it does (1 sentence) | Services/DBs it needs | What goes in → What comes out | How it breaks | How it recovers | | [Name] | ... | ... | ... | ... | ... | ### 4.1 [Component Name] [PlantUML diagram of internal logic] **Purpose:** One sentence summary. **Implementation Details (The "How"):** - **Stack:** Technologies used - **Key Algorithms:** How does it work? - **Dependencies:** Libraries/services with citation (why chosen over alternatives) **Engineering Analysis (The "Why"):** - **Trade-offs:** Why this approach? What was rejected and why? - **Configuration:** Why these specific settings? (timeouts, limits, buffer sizes) - **State Management:** Stateless or stateful? Where persisted? How consistent? - **Edge Cases:** What errors are handled? Retry logic? Failure modes?
Workflow
Phase 1: Codebase Exploration
Determine documentation type first:
- •Creating brand new documentation? → Follow complete workflow below
- •Updating existing documentation? → Read existing docs first, update changed sections only, validate updates
For new documentation:
- •
Understand project structure:
- •Read package.json, requirements.txt, go.mod, Cargo.toml (dependency files)
- •Identify main entry points (main.py, index.js, main.go, etc.)
- •Map out directory structure
- •
Identify components:
- •Find services, modules, packages
- •Identify databases, message queues, external APIs
- •Map dependencies between components
- •
Analyze data flow:
- •Trace request lifecycle from entry to response
- •Document transformations at each stage
- •Capture exact payload examples when possible
Phase 2: Documentation Generation
- •
Abstract:
- •Write formal research paper abstract
- •Delineate system purpose, architectural approach, key technologies
- •Example: "This document delineates the architectural design of [System Name], a cloud-native platform engineered to [purpose]. Leveraging [technologies], the system [key approach] to deliver [outcomes]. The architecture adheres to the C4 model, decomposing abstractions from high-level system context to granular component implementation."
- •
Business Context:
- •Extract from README, comments, or infer from code
- •Identify stakeholders (who uses this?)
- •
System Context Diagram:
- •Create PlantUML diagram using C4 model or cloud icon macros (see kroki-syntax.md)
- •Show: system as a box, external actors (users, services, databases), connections
- •
High-Level Architecture:
- •Create C4 Container diagram showing major components
- •Document data flow with concrete example ("hero scenario")
- •Show transformations: Input → Output at each stage
- •
Component Deep Dives:
- •Create Component Responsibility Matrix first:
- •Table with columns: Component | Primary Responsibility | Key Dependencies | Input/Output | Failure Modes | Recovery Strategies
- •One row per major component
- •Provides quick reference for all components before detailed sections
- •For each major component:
- •Purpose (one sentence)
- •Implementation details (stack, algorithms, dependencies)
- •Engineering analysis (WHY this way, trade-offs, configuration rationale)
- •Create component-level diagram if complex
- •Create Component Responsibility Matrix first:
- •
Cross-Cutting Concerns:
- •Document observability approach
- •Identify failure modes from code (error handling, retries)
- •Extract deployment configuration
- •
Decision Log:
- •Document WHY decisions were made
- •Include context and consequences
- •
Optional Appendices (if applicable):
- •Technology Stack Summary: Extract all technologies from dependency files and component details; organize by category
- •API Endpoint Reference: Document public/internal APIs with request/response schemas from code
Phase 3: Diagram Generation
For each diagram, generate PlantUML code using cloud icon macros (see kroki-syntax.md and icon-reference.md):
Wrap all diagrams in PlantUML code blocks:
```plantuml @startuml System Context !include <awslib/AWSCommon> !include <awslib/General/Users> !include <awslib/Compute/Lambda> !include <awslib/Database/Aurora> left to right direction Users(users, "End Users", "") Lambda(api, "API Service", "Node.js 20") Aurora(db, "Database", "PostgreSQL 15") users --> api : HTTPS/JSON api --> db : SQL Queries @enduml ```
Use the correct Kroki endpoint for the diagram type:
- •PlantUML with cloud icons:
POST https://kroki.io/plantuml/svg - •C4 model diagrams:
POST https://kroki.io/c4plantuml/svg
Reference materials for diagrams:
- •kroki-syntax.md - Complete PlantUML syntax reference, Kroki API usage, C4 model macros
- •icon-reference.md - Cloud icon catalog (AWS 900+, Azure 400+, GCP, K8s macros)
- •diagram-examples.md - Real-world examples with complete PlantUML code
Engineering Analysis Requirements
For each component, answer ALL these questions with specifics (not vague statements):
Trade-offs
- •What alternatives were considered?
- •Why was this approach chosen over alternatives?
- •What are the downsides of this choice?
Configuration
- •Why these specific values? (timeouts, limits, buffer sizes)
- •What happens if misconfigured?
- •How were these values determined? (SLAs, benchmarks, constraints)
State Management
- •Stateless or stateful?
- •Where is state persisted?
- •How is consistency maintained?
- •What happens on restart?
Edge Cases
- •What errors are explicitly handled?
- •What retry logic exists?
- •What are the failure modes?
- •How does the system recover?
Documentation Quality Validation
After generating documentation, validate immediately using this checklist:
1. Structural Validation
- • All required sections present (Context, Constraints, Architecture, Components, Cross-Cutting, Decisions)
- • Component Responsibility Matrix table present at start of Component Deep Dives section
- • Matrix includes all major components with all required columns (Component, Primary Responsibility, Key Dependencies, Input/Output, Failure Modes, Recovery Strategies)
- • Each component has Purpose, Implementation Details, Engineering Analysis
- • Data flow walkthrough includes Input → Output transformations
- • Diagrams use correct PlantUML syntax (verified against kroki-syntax.md)
- • All diagram code blocks wrapped in
@startuml/@endumlwithin ```plantuml fences
2. Depth Validation (Critical)
- • Every technical decision has "why" explanation (not just "what")
- • Trade-offs documented (what alternatives were rejected and why)
- • Configuration values justified (why these specific settings, how determined)
- • Failure modes documented (what breaks, how system recovers)
- • Concrete examples included (real payloads, actual code snippets, exact transformations)
- • Library choices explained (why this library over alternatives, with specifics)
3. Diagram Validation
- • Icons match actual technologies (checked against icon-reference.md)
- • Connections labeled with what flows (data format, protocol)
- • Groups show logical boundaries (VPCs, subnets, services)
- • Direction set appropriately (
left to right direction/top to bottom direction)
4. Example Quality Check
- • Payload examples show exact JSON/data structures
- • Transformations show before AND after
- • Code snippets include file paths (e.g., "in auth.py:42")
- • No vague terms ("handles requests", "processes data")
If validation fails:
- •Note each gap with specific section reference
- •Add missing content
- •Run validation checklist again
- •Only finalize when all checks pass
Good vs. Bad Examples
Bad Example - Vague and Shallow
### API Gateway **Purpose:** Handles API requests. **Implementation:** - Uses FastAPI - Routes requests to backend services **Why:** It's fast and easy to use.
Problems: No WHY for library choice, no trade-offs, no configuration rationale, no edge cases, generic statements.
Good Example - Component Responsibility Matrix
### Component Responsibility Matrix | Component | Primary Responsibility | Key Dependencies | Input/Output | Failure Modes | Recovery Strategies | |-----------|----------------------|------------------|--------------|---------------|--------------------| | **Query Router** | Routes user queries to specialized agents via LLM intent classification | Azure OpenAI (gpt-5-mini) | User query (string) → Agent selection + args (JSON) | LLM fails to select tool, API timeout, non-English input | Fall back to default agent; log exceptions to App Insights; reject non-English with fixed response | | **PostgreSQL DB** | Persistent storage for user profiles, query history, embeddings | PostGIS extension, pgvector | SQL queries → Result sets | Connection pool exhaustion, deadlocks, disk full | Auto-reconnect with exponential backoff; query timeout (30s); read replicas for queries | | **Redis Cache** | Session state, rate limiting, hot data caching | None (standalone) | Key-value GET/SET → Cached data or miss | Cache miss, eviction, connection failure | Graceful degradation to DB; 5-minute TTL; connection retry (3x with backoff) |
What makes this good: Concise summary of each component's role, dependencies, data flow, failure scenarios, and recovery—providing quick reference before detailed sections.
Good Example - Detailed Component Section
### Query Router (stream.py)
**Purpose:** Routes user queries to appropriate specialized agent based on query intent.
**Implementation Details:**
- **Stack:** Python 3.11, FastAPI, AsyncAzureOpenAI v1.x
- **Key Algorithm:** LLM-based intent classification via function calling
- **Dependencies:**
- `openai==1.x`: Official SDK with native async streaming; type-safe; actively maintained. Chosen over `langchain` for direct control and lower overhead.
- `httpx==0.x`: Required for custom timeout configuration; `trust_env=False` prevents proxy interference
**Engineering Analysis:**
- **Trade-offs:**
- LLM-based router vs keyword regex: Regex is 10x faster but brittle and fails on paraphrased queries. LLM handles ambiguity and phrasing variations with 95%+ accuracy vs 60% for regex in testing.
- Model Selection (gpt-5-mini): Classification is simpler than generation. Mini model offers 10x cost reduction ($0.15 vs $1.50 per 1M tokens) and lower latency (~200ms vs ~800ms) compared to gpt-4o, without sacrificing routing accuracy (both achieved 96% in our test set).
- **Configuration:**
- `timeout_keep_alive=300s`: LLM responses can take 30-60s for complex queries; 5-minute keep-alive prevents client disconnect during long-running requests. Determined from p95 latency metrics showing 45s max.
- `httpx.Timeout(read=30.0)`: 30s read timeout balances allowing slow responses and failing fast on stalled connections. Based on upstream SLA of 25s + 5s buffer.
- `max_retries=0`: Streaming responses cannot be retried mid-stream; client must handle retry. Retrying would cause duplicate partial responses.
- **State Management:** Stateless component; all routing decisions made per-request from query content alone. No session state persisted.
- **Edge Cases:**
- No Tool Selection: If LLM fails to select tool (< 0.1% of requests), system falls back to direct response with default agent.
- Execution Exceptions: All tool exceptions caught, logged with full traceback to Application Insights, and returned to user via friendly SSE error message (HTTP 200 with error type in stream).
- Non-English Input: Intercepted immediately via Router LLM system prompt ("Only process English queries"). Returns fixed fallback response without further LLM calls, preventing unnecessary inference costs for unsupported languages.
What makes this good: Specific numbers, alternatives considered, trade-off analysis, configuration rationale linked to requirements, concrete failure scenarios.
Depth Focus Areas
Prioritize technical depth in:
- •Data Transformations - Show exact Input → Output at each stage
- •Library Choices - Document WHY chosen (performance numbers, features, alternatives rejected)
- •Configuration Rationale - Explain WHY each value (link to SLAs, benchmarks, constraints)
- •Failure Handling - Document retry logic, fallbacks, circuit breakers with specific thresholds
- •Performance Decisions - Buffer sizes, connection pools, cache strategies with justification
- •Security Measures - Auth, encryption, validation, rate limiting with rationale
Writing Style: Research Paper Tone
Adopt formal language throughout:
Formal Vocabulary:
- •"sends" → "transmits"
- •"uses" → "employs/utilizes"
- •"shows" → "depicts/delineates/illustrates"
- •"allows" → "permits/enables"
- •"handles" → "accommodates/addresses"
- •"creates" → "instantiates"
- •"needs" → "requires"
- •"gets" → "retrieves"
- •Expand all contractions ("doesn't" → "does not")
Diagram Presentation:
- •Remove "What It Shows" bullet lists (diagrams are self-explanatory)
- •Figure labels: Use
**Figure X: Title**only (no descriptive subtitle) - •Remove explicit "How It Works:" headers - numbered explanations flow naturally after figure
- •Use numerals (1, 2, 3) instead of "Step 1", "Step 2", "Step 3"
Trade-offs Format:
- •Write trade-offs in prose format with detailed examples and specific numbers
- •Each trade-off: decision → alternatives considered → rationale with metrics
- •Example: "gpt-4o for code generation over gpt-5-mini. Code generation requires stronger reasoning capabilities. gpt-4o demonstrates significantly better performance on code-related tasks, justifying the higher token cost ($X vs $Y per 1M tokens)."
- •Keep Configuration Rationale and External Libraries as tables (not prose)
Header Usage:
- •Use headers sparingly
- •Prefer numbered lists for processes
- •Keep content concise and flowing
Common Mistakes to Avoid
Don't:
- •Write high-level summaries without technical details
- •Skip the "why" behind decisions
- •Generate generic diagrams without real component names
- •Document WHAT without explaining WHY
- •Use vague or informal language ("handles requests", "processes data", "improves performance")
- •Include time-sensitive information ("If doing this before August 2025...")
- •Mix terminology (don't alternate "API endpoint", "URL", "route", "path" - pick one)
- •Use "Step" terminology or "How It Works:" headers
- •Format trade-offs as tables
- •Add "What It Shows" sections to diagrams
Do:
- •Include concrete examples (exact payloads, actual code snippets with line numbers)
- •Write trade-offs in prose with specific numbers and alternatives considered
- •Use formal language throughout
- •Use real component names, library versions, specific technologies
- •Document failure scenarios and recovery mechanisms
- •Show data transformations with before/after examples
- •Justify every configuration value
- •Use consistent terminology throughout
- •Let diagram explanations flow naturally with numbered points
Reference Materials
This skill includes comprehensive reference materials:
- •kroki-syntax.md - Complete PlantUML syntax reference, Kroki API usage, C4 model macros, cloud icon includes
- •icon-reference.md - Cloud icon catalog (AWS 900+, Azure 400+, GCP, K8s with exact macro signatures)
- •diagram-examples.md - 10 real-world diagram examples with complete PlantUML code
- •example-output.pdf - Gold standard example of expected documentation quality
When generating diagrams, reference these files to:
- •Find correct cloud icon macros and include paths (use icon-reference.md)
- •Learn PlantUML syntax for groups, connections, C4 model, styling (use kroki-syntax.md)
- •See real-world patterns and examples (use diagram-examples.md)
Kroki Diagram Integration
After generating documentation:
- •Extract all PlantUML diagram code blocks (wrapped in
@startuml/@enduml) - •Use the
render-kroki-diagrams.jsscript to render via Kroki API to SVG/PNG - •Optionally replace code blocks with embedded images in final document
# Render all diagrams from a markdown file ./render-kroki-diagrams.js Architecture.md --format svg # Replace code blocks with image references ./render-kroki-diagrams.js Architecture.md --format svg --replace # Use self-hosted Kroki ./render-kroki-diagrams.js Architecture.md --base-url http://localhost:8000
The script auto-detects C4 diagrams and routes them to the /c4plantuml/ endpoint.
Optional Appendices Templates
Appendix A: Technology Stack Summary
## Appendix A: Technology Stack Summary | Category | Technology | Version | Purpose | Layer | |----------|-----------|---------|---------|-------| | **Backend** | FastAPI | 0.104.x | HTTP framework, async routing | Application | | **AI/ML** | Azure OpenAI | gpt-4o, gpt-5-mini | LLM inference, intent classification | AI Service | | **Data Storage** | PostgreSQL | 15.x | Persistent storage, user profiles | Data | | **Observability** | Application Insights | Latest | APM, distributed tracing | Infrastructure |
Appendix B: API Endpoint Reference
## Appendix B: API Endpoint Reference
### POST /api/chat
**Purpose:** Stream chat responses with agent routing
**Authentication:** Bearer JWT (HS256)
**Request:**
```json
{
"query": "What is the status of order #12345?",
"user_id": "user_abc123"
}
Response: Server-Sent Events (SSE)
Event Types:
- •
agent_selected:{"agent": "order_lookup", "args": {...}} - •
content_delta:{"delta": "The order status is..."} - •
done:{"finish_reason": "stop"}
Error Responses:
- •
401 Unauthorized: Invalid/missing JWT - •
429 Too Many Requests: Rate limit exceeded
## Output Format Generate a single markdown file with: - All sections from the template structure - PlantUML diagram code blocks (wrapped in `@startuml`/`@enduml` within ```plantuml fences) - Cloud icon macros matching actual technologies (AWS/Azure/GCP icons from icon-reference.md) - Inline code examples where relevant (with file paths and line numbers) - Tables for configuration rationale, trade-offs analysis - Concrete examples throughout - Exact payload transformations showing before/after - Optional appendices if system has APIs or uses multiple technologies ## Final Checklist Before finalizing documentation: - [ ] All 8 checklist phases completed - [ ] All validation checks passed - [ ] Every component has detailed engineering analysis - [ ] All diagrams use correct PlantUML syntax with proper cloud icon macros - [ ] No vague statements (all specifics with numbers, examples) - [ ] Consistent terminology used throughout - [ ] Trade-offs explained for major decisions - [ ] Configuration values justified - [ ] Failure modes documented - [ ] Real examples included (payloads, code with line numbers)