Infrastructure as Code
Provision and manage cloud infrastructure using code-based automation tools. This skill covers tool selection, state management, module design, and operational patterns across Terraform/OpenTofu, Pulumi, and AWS CDK.
When to Use
Use this skill when:
- •Provisioning cloud infrastructure (compute, networking, databases, storage)
- •Migrating from manual infrastructure to code-based workflows
- •Designing reusable infrastructure modules
- •Implementing multi-cloud or hybrid-cloud deployments
- •Establishing state management and drift detection patterns
- •Integrating infrastructure provisioning into CI/CD pipelines
- •Evaluating IaC tools (Terraform vs Pulumi vs CDK)
Common requests:
- •"Create a Terraform module for VPC provisioning"
- •"Set up remote state with locking for team collaboration"
- •"Compare Pulumi vs Terraform for our use case"
- •"Design composable infrastructure modules"
- •"Implement drift detection for existing infrastructure"
Core Concepts
Infrastructure as Code Fundamentals
Key Principles:
- •Declarative vs Imperative - Describe desired state (Terraform) or program infrastructure (Pulumi)
- •Idempotency - Same input produces same output, safe to re-run
- •Version Control - Infrastructure changes tracked in Git
- •State Management - Track actual infrastructure state
- •Module Composition - Reusable, versioned infrastructure components
Benefits:
- •Reproducibility (same code = same infrastructure)
- •Auditability (Git history shows all changes)
- •Collaboration (code reviews for infrastructure changes)
- •Automation (CI/CD deploys infrastructure)
- •Disaster recovery (rebuild from code)
Tool Selection Framework
Choose IaC tools based on team composition and cloud strategy:
Terraform/OpenTofu - Declarative, HCL-based
- •Multi-cloud and hybrid-cloud deployments
- •Operations/SRE teams prefer declarative approach
- •Largest provider ecosystem (AWS, GCP, Azure, 3000+ providers)
- •Mature module registry and community
Pulumi - Imperative, programming language-based
- •Developer-centric teams familiar with TypeScript/Python/Go
- •Complex logic requires programming constructs (loops, conditionals, functions)
- •Native unit testing using familiar test frameworks
- •Strong typing and IDE support
AWS CDK - AWS-native, programming language-based
- •AWS-only infrastructure
- •Tight integration with AWS services
- •L1/L2/L3 construct abstractions
- •CloudFormation under the hood
Decision Tree:
Multi-cloud required?
├─ YES → Team composition?
│ ├─ Ops/SRE focused → Terraform/OpenTofu
│ └─ Developer focused → Pulumi
└─ NO → AWS only?
├─ YES → Language preference?
│ ├─ HCL/declarative → Terraform
│ ├─ TypeScript/Python → AWS CDK
│ └─ YAML/simple → CloudFormation
└─ NO → GCP/Azure only?
└─ Terraform or Pulumi
State Management Architecture
Remote state with locking enables team collaboration:
Backend Selection:
| Cloud Provider | Recommended Backend | Locking Mechanism |
|---|---|---|
| AWS | S3 + DynamoDB | DynamoDB table |
| GCP | Google Cloud Storage | Native |
| Azure | Azure Blob Storage | Lease-based |
| Multi-cloud | Terraform Cloud/Enterprise | Built-in |
| Pulumi | Pulumi Service | Built-in |
State Isolation Strategies:
- •
Directory Separation (recommended for most teams)
- •Separate directories per environment (
prod/,staging/,dev/) - •Complete state file isolation
- •No risk of cross-environment contamination
- •Separate directories per environment (
- •
Workspaces
- •Single codebase, multiple environments
- •Shared state backend, environment namespacing
- •Risk: accidental cross-environment operations
- •
Layered Architecture
- •Separate state files for networking, compute, data layers
- •Blast radius reduction
- •Cross-layer references via remote state data sources
Critical State Management Rules:
- •Always use remote state for team environments
- •Enable state file encryption at rest
- •Enable versioning on state storage
- •Use state locking to prevent concurrent modifications
- •Never commit state files to Git
- •Mark sensitive outputs as
sensitive = true
Module Design Patterns
Composable Module Structure:
modules/ ├── vpc/ # Network foundation ├── security-group/ # Reusable security group patterns ├── rds/ # Database with backups, encryption ├── ecs-cluster/ # Container orchestration base ├── ecs-service/ # Individual microservice └── alb/ # Application load balancer
Module Versioning:
- •Pin module versions in production (
version = "5.1.0") - •Use semantic versioning for internal modules
- •Test module updates in non-prod first
- •Maintain CHANGELOG for module releases
Module Design Principles:
- •Clear input contract (required vs optional variables)
- •Documented outputs (what consumers can reference)
- •Sane defaults where possible
- •Validation rules for inputs
- •Examples directory showing usage
When to Create a Module:
- •Resource group is reused 3+ times
- •Clear boundaries and responsibilities
- •Stable interface contract
- •Team has module maintenance capacity
When to Keep Monolithic:
- •One-off infrastructure
- •Rapid prototyping phase
- •High coupling between resources
- •Small team, simple infrastructure
Quick Reference
Terraform/OpenTofu Commands
# Initialize providers and backend terraform init # Plan changes (preview) terraform plan # Apply changes terraform apply # Destroy infrastructure terraform destroy # Format HCL files terraform fmt # Validate syntax terraform validate # Show state terraform state list terraform state show <resource> # Import existing resources terraform import <resource.name> <id> # Workspace management terraform workspace list terraform workspace new staging terraform workspace select prod
Pulumi Commands
# Initialize new project pulumi new aws-typescript # Preview changes pulumi preview # Apply changes pulumi up # Destroy infrastructure pulumi destroy # Show stack outputs pulumi stack output # Manage stacks pulumi stack ls pulumi stack select prod # Import existing resources pulumi import <type> <name> <id> # Export/import state pulumi stack export > state.json pulumi stack import < state.json
AWS CDK Commands
# Initialize new app cdk init app --language typescript # Synthesize CloudFormation cdk synth # Preview changes cdk diff # Deploy stack cdk deploy # Destroy stack cdk destroy # Bootstrap account/region cdk bootstrap # List stacks cdk list
Common Patterns Checklist
Infrastructure Provisioning:
- • Remote state configured with locking
- • State file encryption enabled
- • Provider versions pinned
- • Module versions pinned (production)
- • Variables have descriptions and types
- • Sensitive outputs marked as sensitive
- • Tagging strategy implemented
- • Cost allocation tags applied
Module Development:
- • Clear README with usage examples
- • Required vs optional variables documented
- • Outputs documented with descriptions
- • Validation rules for critical inputs
- • Examples directory with working code
- • Tests for module behavior (Terratest/CDK assertions)
- • CHANGELOG for version tracking
- • Semantic versioning followed
Operational Readiness:
- • Drift detection scheduled
- • CI/CD pipeline for plan/apply
- • State backup strategy
- • Disaster recovery documented
- • Team access controls configured (IAM/RBAC)
- • Cost estimation integrated (Infracost)
- • Security scanning integrated (Checkov/tfsec)
- • Documentation kept current
Detailed Documentation
For comprehensive patterns and implementation details:
Tool-Specific Patterns:
- •
references/terraform-patterns.md- Terraform/OpenTofu best practices, HCL patterns - •
references/pulumi-patterns.md- Pulumi across TypeScript/Python/Go
Architecture and Design:
- •
references/state-management.md- Remote state, locking, isolation strategies - •
references/module-design.md- Composable modules, versioning, registries
Operations:
- •
references/drift-detection.md- Detecting and remediating infrastructure drift
Working Examples
Practical implementations demonstrating IaC patterns:
Terraform Examples:
- •
examples/terraform/vpc-module/- Multi-AZ VPC with public/private subnets - •
examples/terraform/ecs-service/- ECS service with ALB, autoscaling - •
examples/terraform/rds-cluster/- Aurora cluster with backups, encryption - •
examples/terraform/state-backend/- S3 + DynamoDB backend setup
Pulumi Examples:
- •
examples/pulumi/typescript/vpc/- TypeScript VPC component - •
examples/pulumi/python/ecs-service/- Python ECS service - •
examples/pulumi/go/rds-cluster/- Go RDS cluster - •
examples/pulumi/testing/- Unit tests for Pulumi programs
AWS CDK Examples:
- •
examples/cdk/typescript/vpc-stack/- VPC using L2 constructs - •
examples/cdk/typescript/ecs-fargate/- Fargate service with ALB - •
examples/cdk/typescript/pipeline-stack/- Self-mutating CDK pipeline - •
examples/cdk/testing/- CDK assertions and snapshot tests
Utility Scripts
Automated validation and operational tools:
- •
scripts/validate-terraform.sh- Terraform fmt, validate, tflint - •
scripts/cost-estimate.sh- Infracost wrapper for cost analysis - •
scripts/drift-check.sh- Scheduled drift detection - •
scripts/security-scan.sh- Checkov/tfsec security scanning - •
scripts/state-backup.sh- State file backup automation - •
scripts/module-release.sh- Module versioning and publishing
Integration with Other Skills
Deployment Pipeline:
- •
building-ci-pipelines- Automate terraform plan/apply in CI/CD - •
gitops-workflows- GitOps-based infrastructure deployment
Platform Engineering:
- •
kubernetes-operations- Provision EKS, GKE, AKS clusters - •
platform-engineering- Internal developer platform infrastructure
Security:
- •
secret-management- Provision Vault, External Secrets Operator - •
security-hardening- Implement infrastructure security controls - •
compliance-frameworks- Policy-as-code for compliance
Operations:
- •
observability- Provision monitoring infrastructure (Prometheus, Grafana) - •
disaster-recovery- Infrastructure rebuild procedures - •
cost-optimization- Implement cost controls via IaC
Data Platform:
- •
data-architecture- Provision data lakes, warehouses - •
streaming-data- Provision Kafka, Kinesis infrastructure
Best Practices
Development Workflow:
- •Write infrastructure code in feature branches
- •Run
terraform plan/pulumi previewlocally - •Submit pull request with plan output
- •Code review focuses on security, cost, blast radius
- •CI runs automated tests and security scans
- •Apply only after approval and CI passes
- •Monitor for drift post-deployment
State Management:
- •Use remote state from day one (never local state for teams)
- •Separate state files per environment
- •Enable state locking to prevent concurrent modifications
- •Version state storage for rollback capability
- •Encrypt state at rest (contains sensitive data)
- •Regular state backups to separate location
Module Development:
- •Start with monolithic code, extract modules when patterns emerge
- •Design for reusability but avoid premature abstraction
- •Document all inputs and outputs
- •Provide working examples in
examples/directory - •Pin provider versions in modules
- •Test modules before publishing
- •Use semantic versioning for releases
Security:
- •Scan IaC for security issues before apply (Checkov, tfsec)
- •Never commit secrets to code (use secret references)
- •Mark sensitive outputs as
sensitive = true - •Implement least-privilege IAM policies
- •Enable resource encryption by default
- •Use private module registries for internal modules
Cost Management:
- •Estimate costs before applying changes (Infracost)
- •Tag all resources for cost allocation
- •Review cost impact in pull requests
- •Set up cost alerts for drift
- •Rightsize resources based on usage
Operational Excellence:
- •Schedule regular drift detection
- •Document disaster recovery procedures
- •Maintain runbooks for common operations
- •Monitor state file access logs
- •Practice infrastructure rebuilds periodically
- •Keep provider versions current with testing
Common Pitfalls
State File Issues:
- •Manual state editing - Use terraform state commands, not direct edits
- •No state locking - Race conditions corrupt state
- •Local state for teams - State divergence across team members
- •Large state files - Break into multiple state files by layer
Module Design:
- •Over-abstraction - Too generic, hard to understand
- •Under-abstraction - Copy-paste code everywhere
- •No version pinning - Unexpected breaking changes
- •No examples - Users don't know how to consume module
Operations:
- •No drift detection - Manual changes go unnoticed
- •Direct resource modification - Bypassing IaC creates drift
- •No rollback plan - Can't recover from failed apply
- •Ignoring plan output - Surprises during apply
Security:
- •Secrets in code - Hard-coded credentials
- •No security scanning - Vulnerabilities in production
- •Overly permissive IAM - Excessive privileges
- •No state encryption - Sensitive data exposed
Troubleshooting Guide
State Lock Issues:
terraform force-unlock <lock-id> # Use only if certain no other process running
Import Existing Resources:
terraform import aws_vpc.main vpc-12345678 pulumi import aws:ec2/vpc:Vpc main vpc-12345678
Drift Detection:
terraform plan -detailed-exitcode # Exit 2 = drift detected pulumi preview --diff
For detailed drift remediation, see references/drift-detection.md.
State Recovery:
# Terraform: Restore from S3 versioning aws s3 cp s3://bucket/backup/terraform.tfstate terraform.tfstate # Pulumi: Restore from checkpoint pulumi stack export --version <timestamp> | pulumi stack import
Related Skills
For cloud-specific implementations:
- •
aws-patterns- AWS-specific resource patterns - •
gcp-patterns- GCP-specific resource patterns - •
azure-patterns- Azure-specific resource patterns
For infrastructure operations:
- •
kubernetes-operations- Manage Kubernetes clusters provisioned via IaC - •
gitops-workflows- GitOps-based infrastructure deployment - •
platform-engineering- Internal developer platforms
For security and compliance:
- •
security-hardening- Infrastructure security controls - •
secret-management- Secret injection and rotation - •
compliance-frameworks- Policy-as-code for compliance
For deployment automation:
- •
building-ci-pipelines- CI/CD for infrastructure code - •
deploying-applications- Application deployment to provisioned infrastructure
For cost and observability:
- •
cost-optimization- FinOps practices for infrastructure - •
observability- Monitoring infrastructure health