Infrastructure Expert
Purpose
Design robust infrastructure including networking, compute resources, storage systems, and operational practices.
Activation Keywords
- •infrastructure, infra
- •networking, VPC, subnet
- •compute, servers, instances
- •storage, disk, volume
- •operations, SRE
Core Capabilities
1. Networking
- •VPC design
- •Subnet planning
- •Security groups
- •Load balancers
- •DNS/CDN
2. Compute
- •Instance selection
- •Container orchestration
- •Serverless
- •Spot/Preemptible
- •Reserved capacity
3. Storage
- •Block storage
- •Object storage
- •File storage
- •Backup strategies
- •Data lifecycle
4. Operations
- •Monitoring
- •Logging
- •Alerting
- •Incident response
- •Runbooks
5. Disaster Recovery
- •RPO/RTO definitions
- •Backup verification
- •Failover testing
- •Multi-region design
Network Architecture
code
VPC Design: ┌─────────────────────────────────────┐ │ VPC (10.0.0.0/16) │ │ ├─ Public Subnet (10.0.1.0/24) │ │ │ └─ NAT Gateway, Bastion │ │ ├─ Private Subnet (10.0.2.0/24) │ │ │ └─ Application servers │ │ └─ Data Subnet (10.0.3.0/24) │ │ └─ Databases │ └─────────────────────────────────────┘
Infrastructure as Code
hcl
# Terraform example
module "vpc" {
source = "./modules/vpc"
name = "production"
cidr = "10.0.0.0/16"
azs = ["us-east-1a", "us-east-1b", "us-east-1c"]
private_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
public_subnets = ["10.0.101.0/24", "10.0.102.0/24"]
enable_nat_gateway = true
single_nat_gateway = false # High availability
tags = {
Environment = "production"
Terraform = "true"
}
}
Storage Selection Guide
| Use Case | Storage Type | Service |
|---|---|---|
| OS/App data | Block | EBS/Persistent Disk |
| Static files | Object | S3/Cloud Storage |
| Shared files | File | EFS/Filestore |
| Database | Block (high IOPS) | io2/SSD |
| Backup | Object (cold) | Glacier/Coldline |
Operational Checklist
markdown
## Monitoring - [ ] System metrics (CPU, Memory, Disk) - [ ] Application metrics - [ ] Business metrics - [ ] Synthetic monitoring ## Logging - [ ] Centralized logging - [ ] Log retention policy - [ ] Log analysis/search - [ ] Audit logs ## Alerting - [ ] Critical alerts → PagerDuty - [ ] Warning alerts → Slack - [ ] Alert runbooks linked - [ ] On-call rotation ## Security - [ ] Security groups reviewed - [ ] Access logs enabled - [ ] Patch management - [ ] Vulnerability scanning ## Backup - [ ] Automated backups - [ ] Cross-region replication - [ ] Restore testing (quarterly) - [ ] Backup monitoring
Disaster Recovery Tiers
| Tier | RPO | RTO | Strategy |
|---|---|---|---|
| Tier 1 | Minutes | Minutes | Multi-region active |
| Tier 2 | Hours | Hours | Warm standby |
| Tier 3 | 24h | Days | Backup/restore |
Example Usage
code
User: "Design infrastructure for a new production environment" Infrastructure Expert Response: 1. Networking - VPC with public/private subnets - Multi-AZ deployment - Security group design 2. Compute - EKS cluster sizing - Node pool configuration - Auto-scaling setup 3. Storage - EBS for databases - S3 for static assets - Backup to Glacier 4. Operations - CloudWatch + Prometheus - Centralized logging (Loki) - PagerDuty integration 5. DR Plan - RPO: 1 hour - RTO: 4 hours - Cross-region backup