Cost Anomaly Detection

Name: cost-anomaly-detection
Rating: 62
Author: Cloudzero

Purpose

This skill proactively identifies unusual cost patterns, unexpected spikes, irregular spending behaviors, and anomalies that may indicate problems, inefficiencies, or opportunities for optimization.

When to Use

•"Are there any cost anomalies?"
•"Check for unusual spending"
•"Find cost issues"
•"What looks wrong with my costs?"
•"Detect abnormal costs"
•Proactive cost monitoring
•Weekly/monthly cost reviews
•Security incident detection
•Waste identification
•Before presenting cost reports
•Keywords: anomaly, unusual, abnormal, irregular, unexpected, odd, suspicious, detect issues

Prerequisites

This skill builds on the understand-cloudzero-organization skill.

Before applying this procedure:

•If you haven't already in this session, load the understand-cloudzero-organization skill and follow its instructions
•Reference the cached organization context (don't reload unnecessarily)
•Organization context is critical for distinguishing legitimate changes from true anomalies

How This Skill Works

Step 1: Establish Baseline

Query historical data to establish normal patterns:

code

# Recent period
get_cost_data(
    granularity="daily",
    date_range="last 30 days",
    cost_type="real_cost"
)

# Compare to baseline period
get_cost_data(
    granularity="daily",
    date_range="30 to 60 days ago",
    cost_type="real_cost"
)

Calculate baseline statistics:

•Mean daily cost
•Standard deviation
•Normal range (e.g., mean ± 2 standard deviations)
•Typical day-of-week patterns
•Expected growth rate

Step 2: Total Cost Anomaly Detection

Identify days with unusual total spending:

Detect Outliers:

code

For each day in recent period:
  If cost > (baseline_mean + 2 × baseline_stddev):
    Flag as high anomaly
  If cost < (baseline_mean - 2 × baseline_stddev):
    Flag as low anomaly (potential data issue or optimization)

Look for:

•Single-day spikes (unusual one-time events)
•Sustained increases (new baseline)
•Gradual drift away from normal
•Weekend vs. weekday anomalies
•Unexpected patterns

Step 3: Service-Level Anomaly Detection

Check each service for unusual behavior:

code

# Get services with daily breakdown
get_cost_data(
    group_by=["CZ:Service"],
    granularity="daily",
    limit=20
)

# Compare recent pattern to baseline for each service

For each major service:

•Calculate its typical daily cost
•Identify days with unusual spending
•Detect new services that appeared
•Detect services that disappeared
•Calculate variance from expected

Anomaly Types:

•Spike: Sudden increase then return to normal
•Step Change: Sudden increase that persists
•Gradual Drift: Slow increase over time
•Drop: Unexpected decrease
•New Appearance: Service that didn't exist before
•Disappearance: Service that stopped

Step 4: Account-Level Anomaly Detection

Identify accounts with unusual spending:

code

get_cost_data(
    group_by=["CZ:Account"],
    granularity="daily",
    limit=20
)

For each account:

•Compare to its historical pattern
•Flag accounts with >50% increase from baseline
•Identify new accounts with unexpected high costs
•Detect accounts with no activity (potential issue)

Step 5: Resource-Level Anomaly Detection

Identify specific resources with unusual costs:

code

# Get top resources
get_cost_data(
    group_by=["CZ:Resource"],
    limit=50
)

# Compare to previous period
get_cost_data(
    group_by=["CZ:Resource"],
    date_range="previous period",
    limit=50
)

Look for:

•New high-cost resources
•Resources with sudden cost increases
•Resources that appeared recently
•Expensive resources without proper tags

Step 6: Regional Anomaly Detection

Check for unusual regional spending patterns:

code

get_cost_data(
    group_by=["CZ:Region"],
    granularity="daily",
    limit=20
)

Anomalies might indicate:

•Unauthorized resource creation in unexpected regions
•Data transfer anomalies
•Failover events
•Misconfigured deployments

Step 7: Usage Pattern Anomalies

Detect unusual usage patterns:

Hourly Pattern Analysis (if examining recent days):

code

get_cost_data(
    granularity="hourly",
    date_range="last 7 days"
)

Look for:

•24/7 costs when should be business hours only
•Weekend activity when shouldn't exist
•Off-hours spikes (potential security issue)
•Missing expected peaks (potential outage)

Day-of-Week Patterns:

•Calculate average cost per day of week
•Compare recent weeks to baseline weeks
•Flag unusual weekday/weekend ratios

Step 8: Multi-Dimensional Anomaly Detection

Cross-reference anomalies across dimensions:

code

get_cost_data(
    group_by=["CZ:Account", "CZ:Service", "CZ:Region"],
    limit=100
)

Find:

•Specific service in specific account with anomaly
•Regional anomalies for specific services
•Account+Service combinations that are unusual

Step 9: Rate-of-Change Anomalies

Detect unusual growth rates:

code

Calculate for each dimension value:
  recent_rate = (cost_this_week - cost_last_week) / cost_last_week
  typical_rate = historical average growth rate

  If recent_rate > (typical_rate + threshold):
    Flag as accelerating growth anomaly

Step 10: Security and Waste Indicators

Look for specific patterns indicating issues:

Potential Security Issues:

•New EC2 instances in unusual regions
•Sudden spike in compute or network costs
•Resources created in accounts with no recent activity
•Large data transfer spikes
•Cryptocurrency mining patterns (sustained high compute)

Potential Waste:

•EBS volumes without attached instances
•Old snapshots accumulating
•Unused Reserved Instances
•Idle RDS databases (consistent low cost)
•Over-provisioned resources

Potential Misconfigurations:

•Public S3 buckets with high request costs
•NAT Gateway traffic spikes
•Logging to expensive destinations
•Unoptimized data transfer routes

Step 11: Tag-Based Anomaly Detection

Check for anomalies in tagged resources:

code

get_cost_data(
    group_by=["CZ:Tag:Environment", "CZ:Service"],
    granularity="daily",
    limit=50
)

Anomalies might be:

•Non-prod environments at prod scale
•Test environments with sustained high costs
•Development resources left running 24/7

Output Format

Provide comprehensive anomaly report:

1. Executive Summary

•Anomaly Count: X anomalies detected
•Severity: [High: X, Medium: Y, Low: Z]
•Potential Cost Impact: $X,XXX/month if unaddressed
•Most Critical: [Brief description of #1 issue]
•Action Required: [Yes/No and urgency]

2. Anomaly Severity Classification

HIGH SEVERITY (Immediate Action Required):

•
[Anomaly description]
- •Detected: [Date/time]
- •Impact: $X,XXX
- •Potential cause: [Analysis]
- •Recommended action: [Specific steps]

MEDIUM SEVERITY (Review Within 24-48 Hours):

•
[Anomaly description]
- •[Details]

LOW SEVERITY (Monitor or Investigate When Convenient):

•
[Anomaly description]
- •[Details]

3. Detailed Anomaly Analysis

For each significant anomaly:

Anomaly #1: [Descriptive Title]

Type: [Spike / Step Change / Drift / New Resource / etc.] Severity: [High / Medium / Low] Detected: [Date/time first observed] Impact: $X,XXX (XX% above normal)

Details:

•What: [Specific description of the anomaly]
•Where: [Account / Service / Region / Resource]
•When: [Time period]
•Baseline: Normal cost is $X, observed cost is $Y
•Deviation: XX% above/below normal (Z standard deviations)

Pattern Analysis:

•First observed: [Date]
•Duration: [Ongoing / X days]
•Trend: [Growing / Stable / Declining]
•Time pattern: [Constant / Hourly / Daily pattern]

Potential Causes:

•[Most likely cause with reasoning]
•[Alternative explanation]
•[Other possibilities]

Related Anomalies:

•[Other anomalies that might be connected]

Recommendations:

•Immediate: [Action to take now]
•Investigation: [What to check]
•Remediation: [How to fix]
•Prevention: [How to avoid future occurrences]

Estimated Impact If Not Addressed:

•Daily: $XXX
•Monthly: $X,XXX
•Annual: $XX,XXX

4. Anomaly Dashboard

Cost Anomalies by Category:

Category	Count	Total Impact	Avg Impact
Compute Spikes	X	$X,XXX	$XXX
Storage Growth	X	$X,XXX	$XXX
Data Transfer	X	$X,XXX	$XXX
New Resources	X	$X,XXX	$XXX
Security Concerns	X	$X,XXX	$XXX
Waste/Idle	X	$X,XXX	$XXX

Anomalies by Dimension:

Dimension	Anomaly Count	Most Affected Value	Impact
Service	X	[Service name]	$X,XXX
Account	X	[Account ID]	$X,XXX
Region	X	[Region]	$X,XXX

5. Time-Series Anomaly Visualization

Cost Over Time with Anomalies Highlighted:

code

[Describe the pattern, indicating where anomalies occurred]

Days with anomalies:
- [Date]: $X,XXX (XX% above baseline) - [Service/Account]
- [Date]: $X,XXX (XX% above baseline) - [Service/Account]
- [Date]: $X,XXX (XX% above baseline) - [Service/Account]

Baseline range: $X,XXX - $X,XXX
Normal mean: $X,XXX
Current level: $X,XXX (within/outside normal range)

6. New or Changed Resources

New High-Cost Resources Detected:

Resource	Service	Account	First Seen	Current Cost	Status
[Resource ID]	EC2	[Account]	[Date]	$X,XXX/mo	⚠️ Review
[Resource ID]	RDS	[Account]	[Date]	$X,XXX/mo	⚠️ Review

Recently Changed Resources:

Resource	Service	Change Type	Date	Impact
[Resource ID]	EC2	Size increase	[Date]	+$XXX/mo
[Resource ID]	RDS	Multi-AZ enabled	[Date]	+$XXX/mo

7. Security and Compliance Concerns

Potential Security Issues:

•
[Issue description]
- •Indicators: [What suggests this is a security issue]
- •Affected resources: [Details]
- •Recommended action: [Contact security team, isolate resource, etc.]

Potential Compliance Issues:

•
[Issue description]
- •Compliance requirement: [Which policy/standard]
- •Violation: [What's non-compliant]
- •Remediation: [Steps to fix]

8. Waste and Optimization Opportunities

Identified Waste:

•
[Type of waste] - $X,XXX/month
- •Description: [Details]
- •How to fix: [Steps]
- •Savings potential: $X,XXX/month

Optimization Opportunities:

•
[Opportunity] - Potential savings: $X,XXX/month
- •Current state: [Details]
- •Recommended change: [Action]
- •Implementation effort: [Low/Medium/High]

9. Baseline Comparison

Current vs. Baseline:

Metric	Baseline	Current	Variance	Status
Daily Cost	$X,XXX	$X,XXX	+XX%	⚠️
Weekday Avg	$X,XXX	$X,XXX	+XX%	⚠️
Weekend Avg	$X,XXX	$X,XXX	+XX%	✅
Top Service	$X,XXX	$X,XXX	+XX%	⚠️
Top Account	$X,XXX	$X,XXX	+XX%	⚠️

Statistical Analysis:

•Mean: $X,XXX (baseline: $X,XXX)
•Std Dev: $XXX (baseline: $XXX)
•Current cost is X standard deviations from baseline
•Coefficient of variation: XX% (baseline: XX%)

10. Prioritized Action Plan

Immediate Actions (Within 24 Hours):

•
[Action] - Prevents $X,XXX/month
- •Severity: High
- •Effort: Low
- •Owner: [Suggested owner]
•
[Action] - Prevents $X,XXX/month
- •[Details]

Short-Term Actions (This Week):

•
[Action] - Potential savings $X,XXX/month
- •[Details]

Monitoring and Prevention:

•Set up alerts for [specific anomaly type]
•Review [dimension] daily for next week
•Investigate [specific pattern] further
•Implement [preventive measure]

11. False Positive Assessment

Likely Legitimate (Not True Anomalies):

•
[Item]
- •Reason: [Why this is expected based on org context]
- •Recommendation: Update baseline expectations

Requires Validation:

•
[Item]
- •Could be legitimate or anomalous
- •Recommendation: Verify with [team/person]

Skill-Specific Best Practices

•Establish proper baselines - Need sufficient historical data
•Use statistical methods - Not just absolute thresholds
•Consider day-of-week patterns - Compare apples to apples
•Cross-reference dimensions - Anomalies often span multiple dimensions
•Prioritize by impact - Focus on highest-cost anomalies first
•Check for false positives - Validate against known changes
•Provide context - Explain why something is anomalous

For general cost analysis best practices, see ${CLAUDE_PLUGIN_ROOT}/references/best-practices.md

Anomaly Detection Techniques

Statistical Anomaly Detection

code

For each data point:
  z_score = (value - mean) / stddev
  If abs(z_score) > 2:
    Flag as anomaly

Percentage-Based Detection

code

If (current - baseline) / baseline > 0.5:
  Flag as 50%+ increase anomaly

Rate-of-Change Detection

code

day_over_day_change = (today - yesterday) / yesterday
If day_over_day_change > threshold:
  Flag as rapid change anomaly

Pattern Matching

•Compare recent pattern to historical patterns
•Detect when current pattern doesn't match any known pattern
•Use day-of-week, time-of-day templates

Clustering

•Group similar cost patterns
•Identify outliers that don't fit any cluster
•Flag new clusters that emerge

Common Anomaly Types

Type 1: Compute Spikes

Indicators:

•Sudden EC2/Lambda/ECS cost increase
•Unusual instance types or sizes
•New regions with compute resources

Causes:

•Auto-scaling event
•New deployment
•Performance testing
•Crypto mining (security issue)

Type 2: Storage Growth

Indicators:

•Gradual or sudden storage cost increase
•S3 bucket growth
•EBS volume increases

Causes:

•Data accumulation (expected or unexpected)
•Backup retention issues
•Log accumulation
•Snapshot proliferation

Type 3: Data Transfer Spikes

Indicators:

•Network/data transfer cost spike
•Cross-region transfer increase
•Internet egress increase

Causes:

•Architecture change
•Data migration
•Security incident (data exfiltration)
•Misconfigured application

Type 4: New Resource Creation

Indicators:

•Resources that didn't exist in baseline
•Costs in new accounts or regions
•New service usage

Causes:

•New project launch (legitimate)
•Developer experimentation
•Unauthorized resource creation
•Security breach

Type 5: Idle or Waste Resources

Indicators:

•Resources with consistent low but non-zero cost
•Detached volumes
•Unused Reserved Instances

Causes:

•Forgotten test resources
•Improper cleanup after projects
•Manual provisioning without automation

Advanced Techniques

Machine Learning Anomaly Detection

If sufficient data:

•Build time-series models (ARIMA, Prophet)
•Predict expected costs
•Flag actual costs that deviate from prediction

Seasonal Adjustment

Account for known seasonal patterns:

•End-of-quarter increased activity
•Holiday seasons
•Business cycle patterns

Multi-Variate Analysis

Look for combinations of factors:

•High cost + new resource + unusual region = high priority
•Low cost + expected service + known account = low priority

Anomaly Correlation

Find related anomalies:

•EC2 spike + data transfer spike might be same event
•Multiple services in same account might share root cause

Tips for Effective Anomaly Detection

•Run regularly - Daily or weekly, not just when problems noticed
•Know your baselines - Understand normal patterns first
•Tune thresholds - Adjust based on organization's tolerance
•Follow up - Track which anomalies were real issues vs. false positives
•Automate - Set up alerts for high-severity anomalies
•Document patterns - Build knowledge base of anomaly types
•Close the loop - Report back on resolution to improve detection
•Balance sensitivity - Too sensitive = alert fatigue, too loose = miss issues

Cost Anomaly Detection

Purpose

When to Use

Prerequisites

How This Skill Works

Step 1: Establish Baseline

Step 2: Total Cost Anomaly Detection

Step 3: Service-Level Anomaly Detection

Step 4: Account-Level Anomaly Detection

Step 5: Resource-Level Anomaly Detection

Step 6: Regional Anomaly Detection

Step 7: Usage Pattern Anomalies

Step 8: Multi-Dimensional Anomaly Detection

Step 9: Rate-of-Change Anomalies

Step 10: Security and Waste Indicators

Step 11: Tag-Based Anomaly Detection

Output Format

1. Executive Summary

2. Anomaly Severity Classification

3. Detailed Anomaly Analysis

Anomaly #1: [Descriptive Title]

4. Anomaly Dashboard

5. Time-Series Anomaly Visualization

6. New or Changed Resources

7. Security and Compliance Concerns

8. Waste and Optimization Opportunities

9. Baseline Comparison

10. Prioritized Action Plan

11. False Positive Assessment

Skill-Specific Best Practices

Anomaly Detection Techniques

Statistical Anomaly Detection

Percentage-Based Detection

Rate-of-Change Detection

Pattern Matching

Clustering

Common Anomaly Types

Type 1: Compute Spikes

Type 2: Storage Growth

Type 3: Data Transfer Spikes

Type 4: New Resource Creation

Type 5: Idle or Waste Resources

Advanced Techniques

Machine Learning Anomaly Detection

Seasonal Adjustment

Multi-Variate Analysis

Anomaly Correlation

Tips for Effective Anomaly Detection

See Also