AWS Well-Architected Framework

Purpose

This skill encodes the AWS Well-Architected Framework's six pillars as heuristics and checklists for planning and reviewing AWS interactions. Use it to ensure architectural decisions align with AWS best practices.

When to Use

•Planning new AWS resources or architectures
•Reviewing existing infrastructure
•Validating proposed changes
•Assessing compliance with best practices
•Identifying improvement opportunities

When NOT to Use

•Emergency fixes (address immediate issue, then review)
•Cost-only analysis (use aws-cost-optimizer for detailed cost work)
•Compliance-only validation (use aws-governance-guardrails)

The Six Pillars

Pillar	Focus
Operational Excellence	Operations, automation, improvement
Security	Protection, detection, response
Reliability	Recovery, resilience, availability
Performance Efficiency	Right resources, optimization
Cost Optimization	Cost awareness, efficiency
Sustainability	Environmental impact, efficiency

Quick Assessment Checklist

For any AWS interaction, consider:

markdown

## Well-Architected Quick Check

### Operational Excellence
- [ ] Can this be automated/codified?
- [ ] Are operations documented?
- [ ] How will we monitor this?

### Security
- [ ] Least privilege applied?
- [ ] Data encrypted?
- [ ] Logging enabled?

### Reliability
- [ ] Multi-AZ/region considered?
- [ ] Backup/recovery defined?
- [ ] Failure modes understood?

### Performance Efficiency
- [ ] Right-sized for workload?
- [ ] Scaling approach defined?
- [ ] Appropriate service type?

### Cost Optimization
- [ ] Cost-aware sizing?
- [ ] Reserved/spot considered?
- [ ] Idle resource risk?

### Sustainability
- [ ] Efficient resource use?
- [ ] Right region for workload?
- [ ] Scaling matches demand?

Pillar 1: Operational Excellence

Principles

•Perform operations as code — Use IaC, automation
•Make frequent, small, reversible changes — Reduce blast radius
•Refine operations procedures frequently — Continuously improve
•Anticipate failure — Pre-mortems, game days
•Learn from all operational failures — Blameless post-mortems

Key Questions

Question	Good Answer
How do you deploy changes?	CI/CD pipeline with approvals
How do you monitor?	CloudWatch, alarms, dashboards
How do you respond to incidents?	Runbooks, on-call rotation
How do you improve?	Regular reviews, metrics tracking

Best Practices for AWS Coworker

markdown

## Operational Excellence Checklist

Infrastructure as Code:
- [ ] Changes defined in CDK/Terraform/CloudFormation
- [ ] Version controlled in Git
- [ ] Deployed via CI/CD pipeline

Monitoring:
- [ ] CloudWatch metrics enabled
- [ ] Alarms for critical metrics
- [ ] Dashboard for visibility

Documentation:
- [ ] Runbooks for common operations
- [ ] Architecture documented
- [ ] Change history maintained

Pillar 2: Security

Principles

•Implement a strong identity foundation — Least privilege, centralized identity
•Enable traceability — Logging, monitoring, auditing
•Apply security at all layers — Network, compute, data
•Automate security best practices — Security as code
•Protect data in transit and at rest — Encryption everywhere
•Keep people away from data — Reduce direct access
•Prepare for security events — Incident response ready

Key Questions

Question	Good Answer
How do you manage identities?	SSO, IAM roles, no long-lived credentials
How do you detect threats?	GuardDuty, Security Hub, CloudTrail
How do you protect data?	KMS encryption, TLS 1.2+, access controls
How do you respond to incidents?	Documented IR plan, practiced

Best Practices for AWS Coworker

markdown

## Security Checklist

Identity and Access:
- [ ] IAM roles with least privilege
- [ ] No wildcard (*) permissions
- [ ] MFA for human access
- [ ] Service roles for automation

Detection:
- [ ] CloudTrail enabled (all regions)
- [ ] GuardDuty enabled
- [ ] VPC Flow Logs enabled
- [ ] Security Hub findings reviewed

Data Protection:
- [ ] Encryption at rest (KMS)
- [ ] Encryption in transit (TLS 1.2+)
- [ ] S3 bucket policies restrictive
- [ ] No public access unless intentional

Network:
- [ ] Security groups least privilege
- [ ] No 0.0.0.0/0 to sensitive ports
- [ ] Private subnets for data tier
- [ ] NACLs for additional control

Pillar 3: Reliability

Principles

•Automatically recover from failure — Auto-healing, auto-scaling
•Test recovery procedures — Regular DR tests
•Scale horizontally — Distribute load
•Stop guessing capacity — Auto-scale based on demand
•Manage change in automation — Controlled deployments

Key Questions

Question	Good Answer
How do you handle failure?	Auto-scaling, health checks, failover
How do you backup data?	Automated backups, tested restores
What's your RPO/RTO?	Defined and tested
How do you test resilience?	Chaos engineering, DR drills

Best Practices for AWS Coworker

markdown

## Reliability Checklist

Availability:
- [ ] Multi-AZ deployment
- [ ] Load balancer health checks
- [ ] Auto-scaling configured
- [ ] No single points of failure

Backup and Recovery:
- [ ] Automated backups enabled
- [ ] Backup retention appropriate
- [ ] Restore tested recently
- [ ] Cross-region backup (if required)

Change Management:
- [ ] Blue/green or rolling deployments
- [ ] Rollback procedure documented
- [ ] Deployment tested in staging
- [ ] Feature flags for gradual rollout

Resilience:
- [ ] Graceful degradation designed
- [ ] Circuit breakers implemented
- [ ] Timeout and retry logic
- [ ] Dependency failures handled

Pillar 4: Performance Efficiency

Principles

•Democratize advanced technologies — Use managed services
•Go global in minutes — Multi-region when needed
•Use serverless architectures — Where appropriate
•Experiment more often — A/B test, measure
•Consider mechanical sympathy — Understand how services work

Key Questions

Question	Good Answer
How do you select resources?	Based on workload requirements, benchmarked
How do you monitor performance?	Metrics, tracing, profiling
How do you optimize?	Regular review, right-sizing
How do you stay current?	Evaluate new services regularly

Best Practices for AWS Coworker

markdown

## Performance Efficiency Checklist

Resource Selection:
- [ ] Instance type matches workload
- [ ] Storage type appropriate (gp3, io2, etc.)
- [ ] Network bandwidth sufficient
- [ ] Managed service preferred when suitable

Monitoring:
- [ ] Response time metrics
- [ ] Resource utilization tracked
- [ ] Bottlenecks identified
- [ ] Baseline established

Optimization:
- [ ] Right-sized (not over-provisioned)
- [ ] Caching used appropriately
- [ ] CDN for static content
- [ ] Database queries optimized

Pillar 5: Cost Optimization

Principles

•Implement cloud financial management — Cost awareness culture
•Adopt a consumption model — Pay only for what you use
•Measure overall efficiency — Cost per business outcome
•Stop spending money on undifferentiated heavy lifting — Managed services
•Analyze and attribute expenditure — Tagging, cost allocation

Key Questions

Question	Good Answer
How do you track costs?	Cost Explorer, budgets, alerts
How do you right-size?	Regular utilization review
How do you use pricing models?	Reserved, Savings Plans, Spot
How do you manage demand?	Auto-scaling, scheduling

Best Practices for AWS Coworker

markdown

## Cost Optimization Checklist

Visibility:
- [ ] Cost allocation tags applied
- [ ] Budgets configured
- [ ] Cost anomaly alerts set
- [ ] Regular cost review scheduled

Right-Sizing:
- [ ] Utilization metrics reviewed
- [ ] Over-provisioned resources identified
- [ ] Instance type optimization considered
- [ ] Storage tier appropriate

Pricing Models:
- [ ] Reserved capacity for steady-state
- [ ] Savings Plans evaluated
- [ ] Spot instances for fault-tolerant
- [ ] On-demand only for variable

Waste Elimination:
- [ ] Idle resources identified
- [ ] Unused resources terminated
- [ ] Dev/test scaled down off-hours
- [ ] Old snapshots cleaned up

Pillar 6: Sustainability

Principles

•Understand your impact — Measure carbon footprint
•Establish sustainability goals — Targets and metrics
•Maximize utilization — Reduce idle resources
•Anticipate and adopt new offerings — More efficient services
•Use managed services — Shared, optimized infrastructure
•Reduce downstream impact — Efficient data transfer

Key Questions

Question	Good Answer
How do you measure impact?	Carbon footprint tracking
How do you maximize efficiency?	Right-sizing, auto-scaling
How do you select services?	Consider sustainability
How do you optimize data?	Lifecycle policies, efficient formats

Best Practices for AWS Coworker

markdown

## Sustainability Checklist

Efficiency:
- [ ] Resources right-sized
- [ ] Auto-scaling matches demand
- [ ] Idle resources minimized
- [ ] Efficient instance types (Graviton)

Data:
- [ ] Data lifecycle policies
- [ ] Efficient storage classes
- [ ] Data transfer minimized
- [ ] Compression used

Services:
- [ ] Serverless where appropriate
- [ ] Managed services preferred
- [ ] Region selection considers sustainability
- [ ] Latest generation resources

Using This Skill

For Planning

Before creating a plan, assess against all six pillars:

markdown

## Well-Architected Assessment: {Resource/Change}

| Pillar | Score | Notes |
|--------|-------|-------|
| Operational Excellence | ✅/⚠️/❌ | |
| Security | ✅/⚠️/❌ | |
| Reliability | ✅/⚠️/❌ | |
| Performance Efficiency | ✅/⚠️/❌ | |
| Cost Optimization | ✅/⚠️/❌ | |
| Sustainability | ✅/⚠️/❌ | |

### Key Findings
[Summary of findings]

### Recommendations
[Actions to improve alignment]

For Reviews

Use pillar checklists to validate existing infrastructure.

Related Files

Detailed pillar guidance in:

•pillars/operational-excellence.md
•pillars/security.md
•pillars/reliability.md
•pillars/performance-efficiency.md
•pillars/cost-optimization.md
•pillars/sustainability.md

Related Skills

•aws-cli-playbook — Implementation patterns
•aws-governance-guardrails — Policy compliance
•aws-cost-optimizer — Detailed cost analysis
•aws-observability-setup — Monitoring implementation