AgentSkillsCN

well-architected

六支柱架构卓越框架:涵盖运营卓越、安全性、可靠性、性能、成本优化,以及开发者体验。

SKILL.md
--- frontmatter
name: well-architected
description: Six-pillar well-architected framework for evaluating architecture quality — operational excellence, security, reliability, performance, cost optimization, and developer experience

Well-Architected Framework

A structured framework for evaluating architecture quality across six pillars. Inspired by the AWS Well-Architected Framework, adapted for startups and modern product development. Technology-agnostic — applies to any stack.

Use this skill when generating the Well-Architected Review deliverable in /architect:blueprint or when running /architect:well-architected as a standalone evaluation.


The Six Pillars

1. Operational Excellence

Question: Can you deploy, monitor, and improve the system with confidence?

CriteriaWhat Good Looks LikeScore Guide
CI/CD pipelineAutomated lint → test → build → deploy on every push1 = manual deploys, 5 = full CI/CD with rollback
Infrastructure as CodeDeployment config in repo (Vercel config, Dockerfile, Terraform)1 = manual setup, 5 = fully reproducible
ObservabilityStructured logs + error tracking + health checks1 = console.log only, 5 = full observability stack
Incident responseAlerts → runbook → mitigation → post-mortem process1 = no plan, 5 = documented runbooks and on-call
Change managementFeature flags, staged rollouts, database migrations versioned1 = YOLO deploys, 5 = staged rollouts with flags

Key questions to ask:

  • How do you deploy a change today? How long does it take?
  • When something breaks at 2am, how do you find out? How do you fix it?
  • Can a new developer deploy to staging on their first day?

2. Security

Question: Is user data protected, and are attack surfaces minimized?

CriteriaWhat Good Looks LikeScore Guide
AuthenticationManaged auth provider, MFA for admin, token rotation1 = DIY password hashing, 5 = managed auth + MFA
AuthorizationRole-based or attribute-based access control on every endpoint1 = no authorization checks, 5 = RBAC/ABAC middleware
Data protectionEncryption at rest and in transit, PII identified and handled1 = plaintext everywhere, 5 = encrypted + PII policies
API securityRate limiting, input validation, CORS, security headers1 = none, 5 = all OWASP top 10 mitigated
Secrets managementNo hardcoded secrets, env vars or secrets manager, rotation policy1 = secrets in code, 5 = secrets manager + rotation
Dependency securityAutomated vulnerability scanning in CI1 = never audited, 5 = automated audit + auto-fix

Key questions to ask:

  • What happens if an API key leaks? How fast can you rotate it?
  • Can a regular user access admin endpoints by guessing the URL?
  • Are you storing anything you shouldn't be? (passwords, full card numbers, unnecessary PII)

3. Reliability

Question: Does the system keep working when things go wrong?

CriteriaWhat Good Looks LikeScore Guide
Fault isolationOne service failing doesn't cascade to others1 = monolith with no error boundaries, 5 = circuit breakers + fallbacks
RecoveryAutomated restarts, health checks, self-healing1 = manual restart, 5 = auto-restart + health checks + failover
Data durabilityAutomated backups, point-in-time recovery, tested restores1 = no backups, 5 = automated backups + tested restores
Retry & timeoutRetries with exponential backoff, timeouts on all external calls1 = no retries/timeouts, 5 = retry policies on all external calls
Graceful degradationSystem works (reduced functionality) when a dependency is down1 = hard crash, 5 = graceful fallback for each dependency
ScalingHandles 10x current load without architecture changes1 = breaks at 2x, 5 = auto-scaling with no code changes

Key questions to ask:

  • What happens when the database goes down for 5 minutes?
  • What happens when a third-party API (Stripe, SendGrid) is slow or down?
  • Can the system handle a sudden traffic spike (e.g., HackerNews front page)?

4. Performance Efficiency

Question: Are resources used efficiently, and is the user experience fast?

CriteriaWhat Good Looks LikeScore Guide
Response timeAPI p95 < 500ms, page load < 2s, AI response < 5s1 = > 3s average, 5 = p95 < 200ms
Right-sizingResources match actual load (not over/under-provisioned)1 = fixed large instances, 5 = auto-scaled to demand
CachingAppropriate caching at each layer (CDN, API, database)1 = no caching, 5 = multi-layer caching strategy
Async processingHeavy work offloaded to background jobs / queues1 = everything synchronous, 5 = async where appropriate
Database efficiencyIndexed queries, connection pooling, no N+1 problems1 = unoptimized queries, 5 = optimized + monitored
Frontend performanceCode splitting, lazy loading, optimized images, CDN1 = single bundle, no CDN, 5 = optimized + CDN + edge

Key questions to ask:

  • What's the slowest user-facing operation? Can it be made async?
  • Are you paying for compute that's idle 90% of the time?
  • Is there a caching layer, or does every request hit the database?

5. Cost Optimization

Question: Are you spending the minimum needed for the current scale?

CriteriaWhat Good Looks LikeScore Guide
Free tier usageMaximizing free tiers before paying1 = paying for everything, 5 = optimized free tier usage
Right-sizingResources match actual usage, not theoretical maximum1 = over-provisioned, 5 = auto-scaled or right-sized
Cost awarenessTeam knows monthly cost breakdown, alerts on overspend1 = no idea of costs, 5 = cost dashboards + budget alerts
Service selectionChosen services fit the scale (not enterprise tools for MVP)1 = enterprise tools for 10 users, 5 = appropriate for scale
LLM cost controlToken optimization, caching, model selection, rate limits1 = GPT-4 for everything, 5 = tiered models + prompt caching
Scaling economicsCosts scale sub-linearly with users1 = linear cost scaling, 5 = strong economies of scale

Key questions to ask:

  • What's your monthly cloud bill? Do you know what each line item is?
  • Are you using the cheapest option that meets your requirements?
  • When you 10x users, does cost 10x too, or less?

6. Developer Experience

Question: Can developers build, test, and ship features quickly and confidently?

CriteriaWhat Good Looks LikeScore Guide
OnboardingNew dev productive in < 1 day (README, setup scripts, seed data)1 = tribal knowledge, 5 = automated setup + docs
Local developmentdocker compose up or npm run dev and everything works1 = complex manual setup, 5 = one-command startup
Type safetyTypeScript/Python type hints throughout, shared types across services1 = untyped, 5 = strict types + shared type packages
TestingFast unit tests, meaningful integration tests, CI runs all tests1 = no tests, 5 = comprehensive test suite < 5 min
Code organizationClear folder structure, consistent patterns, separation of concerns1 = spaghetti code, 5 = clear architecture + conventions
DocumentationAPI docs (OpenAPI), architecture docs, runbooks1 = no docs, 5 = auto-generated API docs + architecture docs

Key questions to ask:

  • How long does it take a new developer to submit their first PR?
  • Can you run the entire system locally without cloud credentials?
  • Is there a style guide or do code reviews catch inconsistencies?

Scoring Methodology

Per-Pillar Scoring

Each pillar is scored 1-5 based on the average of its criteria:

ScoreLabelMeaning
1CriticalFundamental gaps. Address immediately before building further.
2Needs WorkSignificant gaps. Plan to address in next sprint/milestone.
3AdequateMeets minimum bar. Acceptable for MVP, improve for production.
4GoodSolid implementation. Minor improvements possible.
5ExcellentBest practices followed. Ready for production scale.

Overall Architecture Rating

Average of all 6 pillar scores:

OverallRatingInterpretation
1.0 - 2.0FragileArchitecture has critical gaps. Not production-ready.
2.1 - 3.0DevelopingFunctional but risky. Acceptable for early MVP with a plan to improve.
3.1 - 3.5SolidGood foundation. Typical for well-planned MVP or early startup.
3.6 - 4.0StrongProduction-quality. Ready for real users and growth.
4.1 - 5.0ExemplaryMature architecture. Enterprise-ready. Rare for early-stage products.

Stage-Appropriate Expectations

Not every project needs a 5/5 on every pillar. Set expectations by stage:

StageTarget ScoreAcceptable Lows
Proof of concept2.0 overallSecurity 2, Reliability 1, DevEx 2
MVP2.5 - 3.0Reliability 2, Performance 2
Early product (paying users)3.0 - 3.5Performance 3, Cost 2
Growth stage3.5 - 4.0None below 3
Production / enterprise4.0+None below 4

Output Format

IMPORTANT: Be comprehensive and thorough in all sections.

1. Visual Score Summary (REQUIRED)

Visualize pillar scores as a horizontal bar table:

code
Operational Excellence  ████░  4/5 — Good
Security               ███░░  3/5 — Adequate
Reliability            ██░░░  2/5 — Needs Work
Performance Efficiency ████░  4/5 — Good
Cost Optimization      █████  5/5 — Excellent
Developer Experience   ███░░  3/5 — Adequate

Overall: 3.5/5 — Solid

Score table (also provide as markdown table):

PillarScoreLabelKey StrengthCritical Gap
Operational ExcellenceX/5[Label][1 sentence][1 sentence or "None"]
SecurityX/5[Label][1 sentence][1 sentence or "None"]
ReliabilityX/5[Label][1 sentence][1 sentence or "None"]
Performance EfficiencyX/5[Label][1 sentence][1 sentence or "None"]
Cost OptimizationX/5[Label][1 sentence][1 sentence or "None"]
Developer ExperienceX/5[Label][1 sentence][1 sentence or "None"]
OverallX.X/5[Rating]

2. Per-Pillar Detail (MUST be thorough for each pillar)

For each of the 6 pillars, provide comprehensive analysis:

Format:

markdown
## [Pillar Name]: X/5 — [Label]

### Score Breakdown

Evaluated against [number] criteria:

| Criteria | Score | Notes |
|----------|:-----:|-------|
| [Criterion 1] | X/5 | [1 sentence why this score] |
| [Criterion 2] | X/5 | [1 sentence why this score] |
| [Criterion 3] | X/5 | [1 sentence why this score] |
| ... | ... | ... |

**Average: X.X/5 → X/5 ([Label])**

### Strengths (minimum 2-4 bullets)

✅ **[Specific strength]**
   - What: [Describe what's implemented]
   - Why it matters: [Business/technical impact]
   - Reference: [Cite specific architecture decision from manifest]

✅ **[Specific strength]**
   - What: [Describe what's implemented]
   - Why it matters: [Business/technical impact]
   - Reference: [Cite specific architecture decision from manifest]

[Continue for all major strengths...]

### Gaps (minimum 2-4 bullets if score < 5)

⚠️ **[Specific gap]**
   - What's missing: [Describe the missing capability]
   - Risk if not addressed: [Specific risk with impact estimation]
   - Severity: [Critical/High/Medium/Low]

⚠️ **[Specific gap]**
   - What's missing: [Describe the missing capability]
   - Risk if not addressed: [Specific risk with impact estimation]
   - Severity: [Critical/High/Medium/Low]

[Continue for all significant gaps...]

### Recommendations (minimum 3-5 specific actions)

Each recommendation must be:
- **Actionable**: Specific enough to implement immediately
- **Measured**: Include effort estimate and impact level
- **Prioritized**: Show order of implementation

**Format for each recommendation:**

**Recommendation #X: [Short action-oriented title]** — [Impact: High/Medium/Low] — [Effort: X hours/days/weeks]

What to do:
[2-3 sentences describing specific implementation steps]

Why it matters:
[1-2 sentences on business/technical impact]

Implementation notes:
- Tool/service to use: [Specific recommendation]
- Code location: [Where to implement, if applicable]
- Dependencies: [What must be done first]
- Definition of done: [How to verify it's complete]

Cost impact: $[amount]/month or [one-time cost]
Timeline: [When to implement — Pre-launch / Month 1 / Quarter 1 / Future]

**Example full recommendation:**

**Recommendation #1: Implement rate limiting on all API endpoints** — Impact: High — Effort: 4-6 hours

What to do:
Add rate limiting middleware using Vercel's @upstash/ratelimit package. Set default limit of 100 requests per minute per IP, with stricter limits (10 req/min) on authentication endpoints and looser limits (1000 req/min) for authenticated users. Return 429 status with Retry-After header when limit exceeded.

Why it matters:
Without rate limiting, a single malicious user or misconfigured client can overwhelm the API, causing downtime for all users. This is especially critical for authentication endpoints which are common DDoS targets. Rate limiting is a pre-launch requirement for production deployment.

Implementation notes:
- Tool/service to use: @upstash/ratelimit with Vercel KV (free tier: 10K requests/day)
- Code location: src/middleware/rateLimit.ts, apply in src/app/api/*/route.ts
- Dependencies: Set up Vercel KV store (5 minute setup)
- Definition of done: Rate limiting active on all endpoints, returns 429 when exceeded, logged in monitoring

Cost impact: $0/month (Vercel KV free tier sufficient for 10K users)
Timeline: Pre-launch (P0 — blocks production deployment)

### Stage-Appropriate Assessment

This architecture is at: **[Stage name]** stage

Expected score range for this stage: **X.X - Y.Y**
Actual score: **Z.Z**
Assessment: **[Above/At/Below]** expectations for this stage

[If below expectations:]
Critical gaps for this stage:
- [Gap 1 that's unacceptable for current stage]
- [Gap 2 that's unacceptable for current stage]
Must address before [milestone/launch].

[If at/above expectations:]
Well-positioned for [next stage]. Consider improving [pillar names] before scaling to [user count/revenue level].

3. Improvement Roadmap (minimum 8-12 items)

Provide comprehensive prioritized roadmap with P0-P3 priority levels:

PriorityPillarActionEffortImpactStageCost
P0SecurityAdd rate limiting to all API endpoints4-6 hoursHighPre-launch$0
P0ReliabilityImplement health checks with dependency verification3 hoursHighPre-launch$0
P0SecuritySet up automated dependency scanning in CI2 hoursHighPre-launch$0
P1Operational ExcellenceAdd structured logging with correlation IDs1 dayMediumMonth 1$26/mo (Sentry)
P1ReliabilitySet up database backups with tested restore process4 hoursHighMonth 1$0 (included)
P1SecurityImplement secrets rotation for API keys2 daysMediumMonth 1$0
P2PerformanceAdd Redis caching layer for hot queries2-3 daysMediumQuarter 1$7/mo
P2Operational ExcellenceImplement feature flags for gradual rollouts1 dayLowQuarter 1$0 (self-hosted)
P2Developer ExperienceAdd OpenAPI schema generation for API docs1 dayLowQuarter 1$0
P3PerformanceImplement database query optimization and indexing review3-5 daysMediumFuture$0
P3ReliabilityAdd chaos engineering / failure injection testing1 weekLowFuture$0
P3Developer ExperienceSet up comprehensive E2E test suite1-2 weeksMediumFuture$0

Priority Definitions:

  • P0 (Must-have before launch): Blocks production deployment. Critical security, reliability, or operational gaps. Complete in current sprint.
  • P1 (Should-have in first month): Important for stability and user trust. Complete within 30 days of launch.
  • P2 (Nice-to-have in first quarter): Improves experience and reduces operational burden. Complete within 90 days.
  • P3 (Future enhancement): Optimization or nice-to-have. Evaluate after product-market fit.

Cost Summary:

  • Pre-launch (P0): $X/month + Y hours labor
  • Month 1 (P0+P1): $X/month + Y hours labor
  • Quarter 1 (P0+P1+P2): $X/month + Y hours labor
  • Full roadmap (All): $X/month + Y hours labor

Timeline Visualization:

code
Pre-Launch (Week 0):
├─ P0 items (total: X hours)
└─ Must complete before production deployment

Month 1 (Weeks 1-4):
├─ P1 items (total: X hours)
└─ Critical for stability

Quarter 1 (Weeks 5-12):
├─ P2 items (total: X hours)
└─ Improves operational efficiency

Future (Month 4+):
├─ P3 items (total: X hours)
└─ Evaluate based on growth

4. Quick Wins (REQUIRED if any exist)

Identify 3-5 high-impact, low-effort improvements that can be done in <1 day each:

Format:

markdown
Quick Win #X: [Short title] — [Pillar Name]

What: [1-2 sentences on what to implement]
Effort: [X hours]
Impact: [High/Medium impact on pillar score]
How: [3-5 step implementation checklist]
Cost: $[amount or $0]

ROI: [Pillar score improvement: X/5 → Y/5, or specific metric improvement]

Example:

markdown
Quick Win #1: Add security headers to API responses — Security

What: Configure Next.js security headers (CSP, X-Frame-Options, HSTS, etc.) in next.config.js to protect against common web vulnerabilities.
Effort: 30 minutes
Impact: High (improves Security pillar from 3/5 to 3.5/5)
How:
1. Add headers configuration to next.config.js
2. Test with securityheaders.com
3. Verify CSP doesn't break any functionality
4. Deploy to staging and production

Cost: $0

ROI: Security pillar: 3/5 → 3.5/5, protects against XSS and clickjacking with minimal effort

5. Critical Blockers (REQUIRED if any exist)

If any pillar scores 1/5 or has critical gaps, call them out explicitly:

markdown
🚨 CRITICAL BLOCKER: [Issue name]

Pillar: [Name]
Current state: [What's broken or missing]
Risk: [What bad thing will happen]
Probability: [High/Medium/Low chance of occurrence]
Impact: [Severity if it occurs — data loss, security breach, downtime, etc.]

This blocks: [Production launch / Scaling / User trust / Compliance]

Required action: [Specific fix needed]
Effort: [Realistic time estimate]
Must complete by: [Deadline or stage gate]
Owner: [Who should do this — role/skill level]

Example:

markdown
🚨 CRITICAL BLOCKER: No database backups configured

Pillar: Reliability
Current state: Supabase project has default backups (daily, 7-day retention) but no tested restore process. No way to recover from accidental data deletion or corruption.
Risk: Single developer mistake (DROP TABLE, bad migration) or Supabase issue could cause permanent data loss for all users.
Probability: Medium (10-15% chance in first year based on industry data)
Impact: Catastrophic — lose all user data, company trust, potential legal liability

This blocks: Production launch with real users

Required action:
1. Enable Supabase point-in-time recovery (PITR) — provides 7-day recovery window
2. Set up daily automated backup export to S3 (in addition to Supabase backups)
3. Document and TEST restore procedure (actually restore a backup to verify it works)
4. Add backup monitoring (alert if backup fails)

Effort: 4-6 hours (2 hours setup + 2 hours testing + 1 hour documentation)
Must complete by: Before production launch (P0)
Owner: Backend developer or DevOps engineer

6. Scoring Transparency (ALWAYS show)

Show how the overall score was calculated:

code
Well-Architected Score Calculation:

Operational Excellence: X/5
Security: X/5
Reliability: X/5
Performance Efficiency: X/5
Cost Optimization: X/5
Developer Experience: X/5
                      ─────
Total: XX/30
Average: XX/30 ÷ 6 = X.XX → X.X/5

Overall Rating: X.X/5 — [Rating Label]

Interpretation based on stage:

code
Stage: [MVP / Early Product / Growth / Enterprise]
Expected score range: [X.X - Y.Y]
Actual score: [Z.Z]

Assessment: [Above/At/Below] expectations

[If below]: Priority focus areas: [list pillars scoring below stage expectations]
[If at/above]: Continue improving [lowest scoring pillars] while scaling

When Evaluating Existing Architectures

When used with /architect:well-architected on an existing architecture (not a new blueprint):

  1. Ask about current state — Don't assume. Ask what's already in place for each pillar.
  2. Be specific — Reference actual services, endpoints, and tools from the architecture.
  3. Score honestly — A 3 is not a failure. Most startups are in the 2.5-3.5 range.
  4. Prioritize ruthlessly — Don't recommend 20 improvements. Pick the top 5 that give the most value.
  5. Match to stage — An MVP doesn't need the same score as an enterprise product.