Infrastructure Documenter Skill
Overview
This skill helps you create clear, maintainable infrastructure documentation. Covers architecture diagrams, runbooks, system documentation, operational procedures, and documentation-as-code practices.
Documentation Philosophy
Principles
- •Living documentation: Keep it in sync with reality
- •Audience-aware: Different docs for different readers
- •Actionable: Every doc should help someone do something
- •Version-controlled: Documentation changes tracked with code
Document Types
| Type | Audience | Purpose |
|---|---|---|
| Architecture | Engineers | Understand system design |
| Runbooks | Ops/SRE | Handle incidents |
| API Docs | Developers | Integrate with system |
| Onboarding | New hires | Get up to speed |
| Decision Records | Future you | Understand why |
Architecture Documentation
System Architecture Overview
# System Architecture ## Overview [Project Name] is a [type] application that [purpose]. ## High-Level Architecture
┌─────────────────────────────────────────────────────────────┐ │ Users │ └─────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ Vercel Edge │ │ ┌─────────────────┐ ┌─────────────────┐ │ │ │ Next.js App │ │ Edge Functions │ │ │ └─────────────────┘ └─────────────────┘ │ └─────────────────────────────────────────────────────────────┘ │ ┌───────────────┼───────────────┐ ▼ ▼ ▼ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ Supabase │ │ Redis │ │ Stripe │ │ - PostgreSQL │ │ - Session │ │ - Payments │ │ - Auth │ │ - Cache │ │ - Webhooks │ │ - Realtime │ │ │ │ │ │ - Storage │ │ │ │ │ └─────────────────┘ └─────────────────┘ └─────────────────┘
## Components ### Frontend (Next.js App) - **Location**: Vercel Edge Network - **Framework**: Next.js 14 (App Router) - **Styling**: Tailwind CSS + shadcn/ui - **State**: Zustand + React Query ### Backend Services | Service | Provider | Purpose | |---------|----------|---------| | Database | Supabase | PostgreSQL with RLS | | Auth | Supabase Auth | User authentication | | Storage | Supabase Storage | File uploads | | Cache | Upstash Redis | Session & API cache | | Payments | Stripe | Subscriptions | | Email | Resend | Transactional emails | ### Data Flow 1. User request → Vercel Edge 2. SSR/API Route processes request 3. Database queries via Supabase client 4. Response cached at edge (when applicable) 5. Response returned to user ## Security ### Authentication Flow 1. User signs in via Supabase Auth 2. JWT token issued and stored in cookie 3. Server validates token on each request 4. RLS policies enforce data access ### Data Protection - All data encrypted at rest (AES-256) - TLS 1.3 for data in transit - Secrets stored in Vercel environment - PII fields encrypted in database
Mermaid Diagrams
## Request Flow
```mermaid
sequenceDiagram
participant U as User
participant V as Vercel
participant N as Next.js
participant S as Supabase
participant R as Redis
U->>V: HTTPS Request
V->>N: Route to App
alt Cached Response
N->>R: Check Cache
R-->>N: Cache Hit
N-->>U: Return Cached
else Cache Miss
N->>S: Query Database
S-->>N: Data
N->>R: Store in Cache
N-->>U: Return Response
end
Database Schema
erDiagram
users ||--o{ projects : owns
users {
uuid id PK
text email
text name
timestamp created_at
}
projects ||--o{ tasks : contains
projects {
uuid id PK
uuid user_id FK
text name
text status
}
tasks {
uuid id PK
uuid project_id FK
text title
boolean completed
}
## Runbooks ### Runbook Template ```markdown # Runbook: [Service Name] - [Issue Type] ## Overview Brief description of the issue and when this runbook applies. ## Severity - **P1 (Critical)**: Complete outage - **P2 (High)**: Degraded service - **P3 (Medium)**: Minor impact - **P4 (Low)**: No user impact ## Detection How this issue is typically detected: - [ ] Alert from [monitoring system] - [ ] User report - [ ] Automated check failure ## Impact Assessment - **Users affected**: All / Segment / None - **Data at risk**: Yes / No - **Revenue impact**: High / Medium / Low / None ## Prerequisites - [ ] Access to [system/dashboard] - [ ] Credentials for [service] - [ ] Contact info for [team/person] ## Resolution Steps ### Step 1: Verify the Issue ```bash # Check service status curl -I https://api.example.com/health # Check logs vercel logs --follow
Step 2: Identify Root Cause
Common causes:
- • Database connection pool exhausted
- • Memory limit reached
- • External service down
- • Bad deployment
Step 3: Apply Fix
If Database Issue:
# Check connection count SELECT count(*) FROM pg_stat_activity; # Kill idle connections SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND query_start < now() - interval '1 hour';
If Bad Deployment:
# Rollback to previous deployment vercel rollback
Step 4: Verify Fix
# Check service health curl https://api.example.com/health # Monitor error rates for 15 minutes
Escalation
If unable to resolve within 30 minutes:
- •Page on-call engineer: [contact]
- •Notify stakeholders in #incidents
- •Update status page
Post-Incident
- • Create incident report
- • Schedule post-mortem (P1/P2 only)
- • Update this runbook if needed
Related Links
### Database Runbooks ```markdown # Runbook: Database Performance Issues ## Symptoms - Slow API responses (>1s) - Timeout errors in logs - High database CPU in dashboard ## Quick Checks ### 1. Check Active Connections ```sql SELECT state, count(*), max(now() - query_start) as max_duration FROM pg_stat_activity GROUP BY state;
2. Find Long-Running Queries
SELECT pid, now() - query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' AND now() - query_start > interval '30 seconds' ORDER BY duration DESC;
3. Check Table Sizes
SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname || '.' || tablename)) as size FROM pg_tables WHERE schemaname = 'public' ORDER BY pg_total_relation_size(schemaname || '.' || tablename) DESC LIMIT 10;
4. Check Missing Indexes
SELECT relname, seq_scan, idx_scan, seq_scan - idx_scan AS difference FROM pg_stat_user_tables WHERE seq_scan > idx_scan ORDER BY difference DESC;
Resolution
Kill Problematic Queries
SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE pid = [PID_FROM_ABOVE];
Add Missing Index
CREATE INDEX CONCURRENTLY idx_table_column ON table_name (column_name);
## Decision Records (ADRs) ### ADR Template ```markdown # ADR-001: Choose Supabase for Database ## Status Accepted ## Context We need a database solution for [Project Name] that supports: - PostgreSQL compatibility - Real-time subscriptions - Built-in authentication - Easy local development - Generous free tier ## Decision We will use Supabase as our primary database and auth provider. ## Alternatives Considered ### PlanetScale **Pros:** - Excellent scaling - Branching for schema changes - MySQL compatible **Cons:** - No built-in auth - No real-time subscriptions - Additional services needed ### Firebase **Pros:** - Real-time built-in - Mature platform - Good mobile SDKs **Cons:** - NoSQL (not ideal for our use case) - Vendor lock-in concerns - Complex security rules ## Consequences ### Positive - Single provider for DB + Auth + Storage - Great developer experience - Row Level Security for data protection - Local development with supabase CLI ### Negative - PostgreSQL-specific features tie us to provider - Supabase still maturing (some rough edges) - Limited to their managed offering ### Risks - Supabase scaling limitations at high traffic - Migration cost if we need to move ## References - [Supabase Documentation](https://supabase.com/docs) - [Comparison: Supabase vs Firebase](https://...)
API Documentation
Endpoint Documentation
# API Reference ## Base URL
Production: https://api.example.com/v1 Staging: https://staging-api.example.com/v1
## Authentication All API requests require authentication via Bearer token. ```bash curl -H "Authorization: Bearer YOUR_TOKEN" \ https://api.example.com/v1/users
Endpoints
Users
Get Current User
GET /users/me
Response:
{
"id": "usr_123",
"email": "user@example.com",
"name": "John Doe",
"created_at": "2024-01-01T00:00:00Z"
}
Update User
PATCH /users/me
Request Body:
| Field | Type | Required | Description |
|---|---|---|---|
| name | string | No | Display name |
| avatar_url | string | No | Profile image URL |
Example:
curl -X PATCH \
-H "Authorization: Bearer YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{"name": "Jane Doe"}' \
https://api.example.com/v1/users/me
Error Responses
| Status | Code | Description |
|---|---|---|
| 400 | BAD_REQUEST | Invalid request body |
| 401 | UNAUTHORIZED | Missing or invalid token |
| 403 | FORBIDDEN | Insufficient permissions |
| 404 | NOT_FOUND | Resource not found |
| 429 | RATE_LIMITED | Too many requests |
| 500 | INTERNAL_ERROR | Server error |
Error Response Format:
{
"error": {
"code": "NOT_FOUND",
"message": "User not found"
}
}
## Environment Documentation ### Environment Matrix ```markdown # Environments ## Overview | Environment | URL | Purpose | Deploy | |-------------|-----|---------|--------| | Production | https://myapp.com | Live users | Manual (main) | | Staging | https://staging.myapp.com | Pre-release testing | Auto (main) | | Preview | https://pr-*.vercel.app | PR review | Auto (PR) | | Development | http://localhost:3000 | Local dev | Manual | ## Configuration ### Production ```env NODE_ENV=production DATABASE_URL=[Supabase Production] NEXT_PUBLIC_APP_URL=https://myapp.com
Staging
NODE_ENV=production DATABASE_URL=[Supabase Staging Branch] NEXT_PUBLIC_APP_URL=https://staging.myapp.com
Development
NODE_ENV=development DATABASE_URL=[Local Supabase] NEXT_PUBLIC_APP_URL=http://localhost:3000
Access
Production
- •Vercel: Admin only
- •Database: Read-only for devs, write for admin
- •Logs: All engineers
Staging
- •Vercel: All engineers
- •Database: All engineers
- •Logs: All engineers
Secrets Rotation
| Secret | Rotation | Last Rotated |
|---|---|---|
| Database password | 90 days | 2024-01-15 |
| API keys | 90 days | 2024-01-15 |
| JWT secret | Never | Initial setup |
## Documentation-as-Code ### Documentation Structure
docs/ ├── README.md # Documentation index ├── architecture/ │ ├── overview.md # System architecture │ ├── data-flow.md # Data flow diagrams │ └── decisions/ # ADRs │ ├── 001-database.md │ └── 002-hosting.md ├── runbooks/ │ ├── README.md # Runbook index │ ├── database.md # Database issues │ ├── deployment.md # Deployment issues │ └── outage.md # Service outage ├── api/ │ └── reference.md # API documentation └── onboarding/ ├── setup.md # Local setup └── contributing.md # How to contribute
### Auto-Generated Documentation
```yaml
# .github/workflows/docs.yml
name: Generate Docs
on:
push:
branches: [main]
paths:
- 'src/**'
- 'docs/**'
jobs:
generate-docs:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Generate API docs from OpenAPI
run: |
npx @redocly/cli build-docs openapi.yaml \
--output docs/api/index.html
- name: Generate TypeDoc
run: npx typedoc --out docs/api/typescript
- name: Deploy to GitHub Pages
uses: peaceiris/actions-gh-pages@v3
with:
github_token: ${{ secrets.GITHUB_TOKEN }}
publish_dir: ./docs
Documentation Checklist
Architecture Docs
- • System overview diagram
- • Component descriptions
- • Data flow documentation
- • Security architecture
- • Technology decisions (ADRs)
Operational Docs
- • Runbooks for common issues
- • Deployment procedures
- • Monitoring and alerting
- • Incident response plan
- • On-call procedures
Developer Docs
- • Local setup guide
- • API reference
- • Contributing guidelines
- • Code conventions
- • Testing guide
Maintenance
- • Documentation review schedule
- • Ownership assigned
- • Change process defined
- • Versioning strategy
When to Use This Skill
Invoke this skill when:
- •Creating architecture documentation
- •Writing runbooks for operations
- •Documenting decision rationale (ADRs)
- •Setting up documentation structure
- •Creating onboarding materials
- •Building automated documentation
- •Planning incident response procedures