Cost Optimization

Purpose

Cloud cost optimization transforms uncontrolled spending into strategic resource allocation through the FinOps lifecycle: Inform, Optimize, and Operate. This skill provides decision frameworks for commitment-based discounts (Reserved Instances, Savings Plans), right-sizing strategies, Kubernetes cost management, and automated cost governance across multi-cloud environments.

When to Use This Skill

Invoke cost-optimization when:

•Reducing cloud spend by 15-40% through systematic optimization
•Implementing cost visibility dashboards and allocation tracking
•Establishing budget alerts and anomaly detection
•Optimizing Kubernetes resource requests and cluster efficiency
•Managing Reserved Instances, Savings Plans, or Committed Use Discounts
•Automating idle resource cleanup and right-sizing recommendations
•Setting up showback/chargeback models for internal teams
•Preventing cost overruns through CI/CD cost estimation (Infracost)
•Responding to finance team requests for cloud cost reduction

FinOps Principles

The FinOps Lifecycle

code

┌─────────────────────────────────────────────────────┐
│  INFORM → OPTIMIZE → OPERATE (continuous loop)      │
│    ↓         ↓           ↓                          │
│ Visibility  Action   Automation                     │
└─────────────────────────────────────────────────────┘

Inform Phase: Establish cost visibility

•Enable cost allocation tags (Owner, Project, Environment)
•Deploy real-time cost dashboards for engineering teams
•Integrate cloud billing data (AWS CUR, Azure Consumption API, GCP BigQuery)
•Set up Kubernetes cost monitoring (Kubecost, OpenCost)

Optimize Phase: Take action on cost drivers

•Purchase commitment-based discounts (40-72% savings)
•Right-size over-provisioned resources (target 60-80% utilization)
•Implement spot/preemptible instances for fault-tolerant workloads
•Clean up idle resources (unattached volumes, old snapshots)

Operate Phase: Automate and govern

•Budget alerts with cascading notifications (50%, 75%, 90%, 100%)
•Automated cleanup scripts for idle resources
•CI/CD cost estimation to prevent surprise increases
•Continuous monitoring with anomaly detection

Core FinOps Principles

•Collaboration: Cross-functional teams (finance, engineering, operations, product)
•Accountability: Teams own the cost of their services
•Transparency: All costs visible and understandable to stakeholders
•Optimization: Continuous improvement of cost efficiency

For detailed FinOps maturity models and organizational structures, see references/finops-foundations.md.

Cost Optimization Strategies

1. Commitment-Based Discounts

Reserved Instances (RIs): 40-72% discount for 1-3 year commitments

•Standard RI: Instance type locked, highest discount (60% for 3-year)
•Convertible RI: Flexible instance types, moderate discount (54% for 3-year)
•Use for: Databases (RDS, ElastiCache), stable production EC2 workloads

Savings Plans: Flexible compute commitments

•Compute Savings Plans: Applies to EC2, Fargate, Lambda (54% discount for 3-year)
•EC2 Instance Savings Plans: Tied to instance family (66% discount for 3-year)
•Use for: Workloads that change instance types or regions

GCP Committed Use Discounts (CUDs): 25-70% discount

•Resource-based CUDs: Commit to vCPU, memory, GPUs
•Spend-based CUDs: Commit to dollar amount (flexible)
•Sustained Use Discounts: Automatic 20-30% discount for sustained usage (no commitment)

Decision Framework:

code

Reserve when:
├─ Workload is production-critical (24/7 uptime required)
├─ Usage is predictable (stable baseline over 6+ months)
├─ Architecture is stable (unlikely to change instance types)
└─ Financial commitment acceptable (1-3 year lock-in)

Use On-Demand when:
├─ Development/testing environments
├─ Unpredictable spiky workloads
├─ Short-term projects (<6 months)
└─ Evaluating new instance types

For detailed commitment strategies and RI coverage analysis, see references/commitment-strategies.md.

2. Spot and Preemptible Instances

Discount: 70-90% off on-demand pricing (interruptible with 2-minute warning)

Use Spot For: CI/CD workers, batch jobs, ML training (with checkpointing), Kubernetes workers, data analytics Avoid Spot For: Stateful databases, real-time services, long-running jobs without checkpointing

Best Practices:

•Diversify instance types and spread across Availability Zones
•Implement graceful shutdown handlers
•Auto-fallback to on-demand when capacity unavailable
•Kubernetes: Mix 70% spot + 30% on-demand nodes with taints/tolerations

3. Right-Sizing Strategies

Target Utilization: 60-80% average (leave headroom for spikes)

Compute Right-Sizing:

•Analyze actual CPU/memory utilization over 30+ days
•Downsize instances with <40% average utilization
•Consolidate underutilized workloads
•Switch instance families (compute-optimized vs. memory-optimized)

Database Right-Sizing:

•Analyze connection pool usage (max connections vs. allocated)
•Downgrade storage IOPS if utilization <50%
•Evaluate read replica necessity (can caching replace it?)
•Consider serverless options (Aurora Serverless, Azure SQL Serverless)

Kubernetes Right-Sizing:

•Set requests = average usage (not peak)
•Set limits = 2-3x requests (allow bursting)
•Use Vertical Pod Autoscaler (VPA) for automated recommendations
•Identify pods with 0% CPU usage (candidates for consolidation)

Storage Right-Sizing:

•Delete unattached volumes (EBS, Azure Disks, GCP Persistent Disks)
•Delete old snapshots (>90 days, retention policy not required)
•Implement lifecycle policies (S3 Intelligent-Tiering, Azure Blob Lifecycle)
•Compress/deduplicate data

Right-Sizing Tools:

•AWS Compute Optimizer: ML-based EC2, Lambda, EBS recommendations
•Azure Advisor: VM rightsizing, reserved instance advice
•GCP Recommender: VM, disk, commitment recommendations
•VPA (Vertical Pod Autoscaler): Automated container resource requests

4. Kubernetes Cost Management

Resource Requests and Limits:

yaml

# Set requests = average usage (enables efficient bin-packing)
resources:
  requests:
    cpu: 500m        # 0.5 CPU cores (average usage)
    memory: 1Gi      # 1 GiB memory (average usage)
  limits:
    cpu: 1500m       # 1.5 CPU cores (3x requests, allows bursting)
    memory: 3Gi      # 3 GiB memory (3x requests)

Namespace Quotas: Prevent runaway resource consumption

•ResourceQuota: Limit total CPU/memory per namespace
•LimitRange: Default/max requests per pod
•PriorityClass: Ensure critical pods get resources

Cluster Autoscaling:

•Scale down idle nodes to reduce costs
•Scale-to-zero for dev clusters during off-hours
•Use multiple node pools (spot + on-demand mix)
•Set max node limits to prevent overspend

Cost Visibility:

•Deploy Kubecost or OpenCost for namespace-level cost tracking
•Allocate costs by labels (team, project, environment)
•Track idle cost (cluster capacity not allocated to workloads)
•Generate showback/chargeback reports

For detailed Kubernetes cost optimization patterns, see references/kubernetes-cost-optimization.md.

Cost Visibility and Monitoring

Tagging for Cost Allocation

Required Tags:

•Owner or Team - Responsible team/department
•Project or Application - Business unit or application name
•Environment - prod, staging, dev, test
•CostCenter - Finance cost center code

Enable Cost Allocation Tags:

•AWS: Activate tags in Cost Allocation Tags console
•Azure: Apply tags via Azure Policy enforcement
•GCP: Use labels on all resources, export to BigQuery

For comprehensive tagging strategies, see references/tagging-for-cost-allocation.md.

Monitoring and Dashboards

Native Cloud Tools:

•AWS Cost Explorer: Analyze spending patterns, forecast costs
•Azure Cost Management + Billing: Budget tracking, cost analysis
•GCP Cloud Billing: BigQuery export for custom analysis

Third-Party Platforms:

•Kubecost: Kubernetes cost visibility and optimization
•CloudZero: Unit cost economics, anomaly detection
•CloudHealth: Multi-cloud cost management
•Infracost: Terraform cost estimation in CI/CD

Key Metrics to Track:

•Total monthly cloud spend (trend over time)
•Cost per service/team/project (allocation accuracy)
•Unit cost metrics (cost per customer, cost per transaction)
•Reserved Instance/Savings Plan utilization (target >95%)
•Idle resource waste (target <5% of total spend)
•Budget variance (forecasted vs. actual)

Budget Alerts and Anomaly Detection

Cascading Budget Alerts:

code

50% of budget  → Email to team lead (informational)
75% of budget  → Email + Slack to team (warning)
90% of budget  → Email + Slack + PagerDuty (urgent)
100% of budget → Automated shutdown (non-prod only) or escalation

Anomaly Detection: Alert on unexpected cost spikes

•

20% cost increase week-over-week
•

$500 unexpected daily cost spike
•New resource types (unusual spend patterns)

Budget Granularity:

•Organization-level (total cloud spend)
•Department-level (engineering, data, marketing)
•Project-level (per application/service)
•Environment-level (prod vs. dev/staging)

Decision Frameworks

Framework 1: Commitment Discount Decision Tree

code

Should we purchase Reserved Instances / Savings Plans?

STEP 1: Analyze Historical Usage (6-12 months)
├─ Identify steady-state baseline (minimum usage)
├─ Exclude spiky/seasonal workloads
└─ Calculate: (baseline usage) / (total usage) = commitment %

STEP 2: Choose Commitment Type
├─ RESERVED INSTANCES
│   ├─ Pros: Highest discount (up to 72%)
│   ├─ Cons: Instance type locked (unless convertible)
│   └─ Use for: Databases, stable production workloads
│
├─ SAVINGS PLANS
│   ├─ Pros: Flexible (across instance types, regions)
│   ├─ Cons: Slightly lower discount than RI
│   └─ Use for: Compute workloads, Lambda, Fargate
│
└─ COMMITTED USE DISCOUNTS (GCP)
    ├─ Resource-based: vCPU/memory commitments
    └─ Spend-based: Dollar amount commitments

STEP 3: Determine Commitment Period
├─ 1-year commitment
│   ├─ Lower discount (40-50%)
│   └─ Less risk if architecture changes
│
└─ 3-year commitment
    ├─ Higher discount (60-72%)
    └─ Only for mature, stable workloads

STEP 4: Monitor and Optimize
├─ Target >95% RI/Savings Plan utilization
├─ Sell unused RIs on AWS Reserved Instance Marketplace
└─ Adjust commitments quarterly based on usage trends

Framework 2: Right-Sizing Priority Matrix

Cost Impact vs. Effort:

High Impact, Low Effort (DO FIRST):

•Idle resources (100% waste): Stopped instances, unattached volumes, old snapshots
•Unused NAT Gateways ($32/month each)
•Over-provisioned databases (<20% CPU for 30 days)
•Kubernetes pods with no resource requests set

High Impact, Medium Effort (DO SECOND):

•Over-provisioned compute (<40% CPU/memory for 30 days)
•Lambda functions with max memory >2x used memory
•Storage optimization (S3 Intelligent-Tiering, gp3 vs. gp2)

Low Impact, High Effort (DO LAST):

•Application code optimization (requires profiling, refactoring)
•Architecture redesign (serverless migration, multi-region optimization)

Weekly Optimization Routine:

•Delete idle resources (automated script)
•Review top 10 cost drivers (manual analysis)
•Right-size 3-5 instances/week (incremental approach)
•Monitor impact (cost trend over 4 weeks)

Framework 3: Spot vs. On-Demand Decision

code

Should this workload use Spot/Preemptible instances?

├─ Is the workload fault-tolerant?
│   ├─ NO → Use On-Demand
│   └─ YES → Continue
│
├─ Is the workload stateless (or has checkpointing)?
│   ├─ NO → Use On-Demand (data loss risk)
│   └─ YES → Continue
│
├─ Can the workload handle interruptions gracefully?
│   ├─ NO → Use On-Demand
│   └─ YES → Continue
│
└─ Workload Type Assessment:
    ├─ Batch Jobs / CI/CD → ✅ Use Spot (70-90% savings)
    ├─ ML Training → ✅ Use Spot (with checkpointing)
    ├─ Kubernetes Workers → ✅ Use Spot (mixed with on-demand)
    ├─ Production API Servers → ⚠️ Mixed fleet (70% spot, 30% on-demand)
    ├─ Databases → ❌ Use On-Demand (or Reserved)
    └─ Real-time Services → ❌ Use On-Demand (or Reserved)

Tool Selection Guide

By Platform

Platform	Cost Visibility	Right-Sizing	Automation
AWS	Cost Explorer, CUR	Compute Optimizer	AWS Budgets, Lambda cleanup
Azure	Cost Management	Azure Advisor	Azure Policy, Automation
GCP	Cloud Billing	Recommender	Budget Alerts, Cloud Functions
Kubernetes	Kubecost, OpenCost	VPA	Cluster Autoscaler
Multi-Cloud	CloudZero, CloudHealth	Densify	ParkMyCloud

By Use Case

Use Case	Recommended Tool	Key Feature
K8s cost visibility	Kubecost	Real-time namespace cost allocation
Terraform cost estimation	Infracost	PR comments with cost diffs
Multi-cloud aggregation	CloudHealth	Unified cost view across AWS/Azure/GCP
Automated optimization	nOps (AWS), CAST AI (K8s)	ML-based automation
Unit cost economics	CloudZero	Cost per customer/transaction tracking
Spot instance management	Spot.io	Automated spot orchestration

For detailed tool comparisons and selection criteria, see references/tools-comparison.md.

Cloud-Specific Tactics

AWS Optimization Tactics

•Enable Cost & Usage Reports (CUR): Export detailed billing to S3
•Use AWS Compute Optimizer: ML-based EC2 rightsizing recommendations
•Implement Savings Plans: More flexible than Reserved Instances
•S3 Intelligent-Tiering: Automatic storage class optimization
•Lambda Right-Sizing: Adjust memory allocation (CPU scales proportionally)
•EBS gp3 Migration: 20% cheaper than gp2 with same performance

Azure Optimization Tactics

•Enable Azure Advisor: VM rightsizing and reserved instance recommendations
•Azure Hybrid Benefit: Bring Windows Server licenses for discounts
•Dev/Test Pricing: Reduced rates for non-production workloads
•Azure Spot VMs: Up to 90% discount for interruptible workloads
•Storage Lifecycle Management: Auto-tier blobs to cool/archive tiers

GCP Optimization Tactics

•Export Billing to BigQuery: Custom cost analysis with SQL
•Sustained Use Discounts: Automatic 20-30% discount (no commitment)
•Committed Use Discounts: 52-70% savings for 3-year commitments
•Preemptible VMs: Up to 91% discount for batch workloads
•GCP Recommender: Idle VM detection and rightsizing advice

For cloud-specific deep dives, see references/cloud-specific-tactics.md.

Implementation Checklist

Phase 1: Establish Visibility (Week 1-2)

• Enable cost allocation tags (Owner, Project, Environment)
• Activate cost allocation tags in cloud billing console
• Deploy Kubecost for Kubernetes cost visibility (if using K8s)
• Create cost dashboards (Grafana, CloudWatch, Azure Monitor, GCP)
• Set up weekly cost reports (emailed to team leads)

Phase 2: Set Up Governance (Week 2-3)

• Create budget alerts (50%, 75%, 90%, 100% thresholds)
• Enable anomaly detection (>20% WoW increase)
• Implement tagging policy enforcement (Azure Policy, AWS Config, GCP Org Policy)
• Establish showback reports (cost by team/project)
• Document cost ownership (who owns which services)

Phase 3: Quick Wins (Week 3-4)

• Delete idle resources (unattached volumes, old snapshots)
• Stop/terminate unused development instances
• Right-size top 10 over-provisioned instances (<40% utilization)
• Implement S3 Intelligent-Tiering or lifecycle policies
• Evaluate Reserved Instance/Savings Plan coverage

Phase 4: Commitment Discounts (Month 2)

• Analyze 6-12 months usage history
• Calculate baseline usage for commitment sizing
• Purchase Reserved Instances for databases
• Purchase Savings Plans for compute workloads
• Monitor RI/SP utilization (target >95%)

Phase 5: Automation (Month 2-3)

• Deploy automated cleanup scripts (weekly schedule)
• Integrate Infracost into CI/CD pipelines
• Implement auto-shutdown for dev/test environments (off-hours)
• Enable Vertical Pod Autoscaler (VPA) for K8s rightsizing
• Set up Spot instance automation (Spot.io, CAST AI, or native)

Phase 6: Continuous Optimization (Ongoing)

• Weekly cost reviews with engineering teams
• Monthly optimization sprints (top cost drivers)
• Quarterly commitment adjustments (RI/SP coverage)
• Annual FinOps maturity assessment

Common Pitfalls

Pitfall 1: No Cost Visibility

❌ Problem: Finance team sees cloud bill at end of month, surprises everywhere ✅ Solution: Deploy real-time cost dashboards, daily Slack reports to engineering teams

Pitfall 2: Reserved Instance Underutilization

❌ Problem: Purchased 100 RIs, only using 60 (40% wasted commitment) ✅ Solution: Monitor RI utilization weekly (target >95%), sell unused RIs on marketplace

Pitfall 3: Missing Kubernetes Resource Requests

❌ Problem: Pods with no requests set → inefficient bin-packing → wasted nodes ✅ Solution: Use VPA to auto-generate recommendations, enforce via admission control

Pitfall 4: Idle Resources Not Cleaned Up

❌ Problem: 50 stopped EC2 instances (still paying for EBS), 200 unattached volumes ✅ Solution: Weekly automated cleanup of idle resources >7 days old

Pitfall 5: No Budget Alerts

❌ Problem: Accidentally left test cluster running, $10K bill surprise ✅ Solution: Budget alerts at 50%, 75%, 90%, 100% with Slack/PagerDuty notifications

Related Skills

•resource-tagging: Cost allocation tags enable showback/chargeback models
•kubernetes-operations: K8s rightsizing, VPA, cluster autoscaling for cost optimization
•infrastructure-as-code: Infracost for Terraform cost estimation and policy-as-code
•aws-patterns: AWS-specific cost optimization tactics (EC2, RDS, S3, Lambda)
•gcp-patterns: GCP-specific optimizations (Compute Engine, BigQuery, Cloud Storage)
•azure-patterns: Azure-specific optimizations (VMs, Storage, App Service, Functions)
•platform-engineering: Internal FinOps platforms and self-service cost dashboards
•disaster-recovery: Balance cost vs. RTO/RPO (warm standby vs. cold standby)

Examples

See examples/ directory for:

•terraform/: AWS, Azure, GCP cost optimization infrastructure (budgets, alerts)
•kubernetes/: Kubecost deployment, resource quotas, VPA configurations
•ci-cd/: Infracost GitHub Actions, cost approval workflows
•dashboards/: Grafana cost dashboards, CloudWatch alarms

Scripts

See scripts/ directory for:

•cleanup_idle_resources.py: Automated AWS/Azure/GCP idle resource cleanup
•ri_coverage_report.py: Reserved Instance coverage analysis
•cost_allocation_report.py: Generate showback/chargeback reports
•spot_savings_calculator.py: Estimate savings from spot instances
•k8s_rightsizing_audit.py: Find K8s pods with missing resource requests

Key Takeaways

•FinOps is a Culture: Collaboration between finance, engineering, and operations
•Visibility First: Can't optimize what can't measure (tags + dashboards mandatory)
•Commitment = Savings: Reserved Instances/Savings Plans provide 40-72% discounts
•Right-Size Continuously: Target 60-80% utilization (leave headroom for spikes)
•Automate Cleanup: Idle resources are 100% waste (weekly automated deletion)
•Kubernetes Costs Hidden: Use Kubecost/OpenCost for namespace-level visibility
•Shift-Left Cost Awareness: Infracost in CI/CD prevents surprise cost increases
•Budget Alerts Prevent Overspend: Cascading notifications at 50%, 75%, 90%, 100%
•Spot for Fault-Tolerant Workloads: 70-90% discount (CI/CD, batch jobs, ML training)
•Unit Cost Metrics Drive Value: Track cost per customer, cost per transaction