Cost Spike Investigation

Name: cost-spike-investigation
Rating: 92
Author: Cloudzero

Purpose

This skill helps identify and explain sudden increases in cloud costs by comparing recent spending patterns to historical baselines and pinpointing the specific resources, services, or dimensions responsible for the spike.

When to Use

•User reports unexpected cost increases
•Monthly bills show significant jumps
•Alerts indicate spending anomalies
•Need to explain "why did costs go up?"
•Investigating budget overruns
•Keywords: spike, increase, jump, surge, anomaly, unexpected, sudden change

Prerequisites

This skill builds on the understand-cloudzero-organization skill.

Before applying this procedure:

•If you haven't already in this session, load the understand-cloudzero-organization skill and follow its instructions
•Reference the cached organization context (don't reload unnecessarily)

How This Skill Works

Step 1: Understand the Spike Period

•Clarify when the spike occurred (specific date, week, month)
•Define baseline period for comparison (e.g., previous month, same period last year)
•Default to comparing last 7 days vs. previous 7 days if not specified

Step 2: Identify Top-Level Changes

Query total costs for both periods:

code

# Recent period (where spike occurred)
get_cost_data(
    date_range="2024-01-15 to 2024-01-21",
    granularity=None,
    cost_type="real_cost"
)

# Baseline period (for comparison)
get_cost_data(
    date_range="2024-01-08 to 2024-01-14",
    granularity=None,
    cost_type="real_cost"
)

Calculate:

•Absolute dollar change
•Percentage increase
•Cost per day in each period

Step 3: Drill Down by Key Dimensions

Systematically check common cost drivers to identify where the spike originated:

Check by Cloud Provider:

code

get_cost_data(
    group_by=["CZ:CloudProvider"],
    date_range="<spike_period>",
    limit=10
)

Check by Service:

code

get_cost_data(
    group_by=["CZ:Service"],
    date_range="<spike_period>",
    limit=20
)

Check by Account:

code

get_cost_data(
    group_by=["CZ:Account"],
    date_range="<spike_period>",
    limit=20
)

Compare results from spike period vs. baseline period to identify:

•Services with largest absolute increases
•Services with largest percentage increases
•New services that appeared during spike period

Step 4: Multi-Dimensional Analysis

Once you identify the primary dimension responsible, drill deeper:

code

# Example: If EC2 showed the spike, break down by account and region
get_cost_data(
    group_by=["CZ:Account", "CZ:Region"],
    filters={"CZ:Service": ["Amazon Elastic Compute Cloud - Compute"]},
    date_range="<spike_period>",
    limit=50
)

Step 5: Time-Series Analysis

Show how costs evolved during the spike period:

code

get_cost_data(
    group_by=["CZ:Service"],
    granularity="daily",
    date_range="<extended_period_including_spike>",
    limit=5
)

This reveals:

•When exactly the spike started
•Whether it's sustained or temporary
•If it's still ongoing

Step 6: Check for New Resources

Identify if new resources were provisioned:

code

# Compare dimension values between periods
get_dimension_values(
    dimension="CZ:Resource",
    date_range="<spike_period>"
)

get_dimension_values(
    dimension="CZ:Resource",
    date_range="<baseline_period>"
)

Step 7: Investigate Tags and Custom Dimensions

Use organization-specific dimensions to attribute costs:

code

# Discover available custom dimensions
get_available_dimensions(filter="User:Defined")

# Query by relevant custom dimensions
get_cost_data(
    group_by=["User:Defined:Team", "CZ:Service"],
    date_range="<spike_period>",
    limit=30
)

Check tag-based attribution:

code

get_available_dimensions(filter="Tag")

get_cost_data(
    group_by=["CZ:Tag:Environment", "CZ:Service"],
    date_range="<spike_period>",
    limit=30
)

Output Format

Provide a clear, structured investigation report:

1. Executive Summary

•Total cost increase ($ and %)
•Time period analyzed
•Primary root cause in one sentence

2. Spike Metrics

•Baseline period cost: $X
•Spike period cost: $Y
•Absolute increase: $Z
•Percentage increase: W%
•Cost per day comparison

3. Root Cause Analysis

•Primary driver: [Service/Account/Resource] responsible for X% of increase
•Contributing factors: Secondary drivers ranked by impact
•New resources: Any newly provisioned resources

4. Detailed Breakdown

•Top 5-10 cost changes by relevant dimensions
•Time-series visualization (describe the pattern)
•Multi-dimensional attribution (e.g., which team, which account)

5. Recommendations

•Is this spike expected/legitimate or potentially wasteful?
•Immediate actions to investigate further
•Potential optimization opportunities
•Monitoring recommendations to prevent future surprises

Skill-Specific Best Practices

•Compare apples to apples - Use same date range lengths for baseline and spike periods
•Check multiple dimensions - Don't stop at just service-level analysis
•Look for new resources - Spikes often come from new provisioning
•Consider seasonality - Compare to same period last year if relevant
•Show your work - Explain which queries you ran and why
•Be specific - Provide exact numbers, dimension values, and time periods

For general cost analysis best practices, see ${CLAUDE_PLUGIN_ROOT}/references/best-practices.md

Common Spike Patterns

•New resource provisioning: Large instances, databases, or storage volumes
•Scaling events: Auto-scaling responding to traffic
•Data transfer spikes: Unusual egress or cross-region transfer
•Reserved Instance expiration: Reverting to on-demand pricing
•Development/testing: Teams spinning up test environments
•Crypto mining/security incidents: Unauthorized resource usage

Cost Spike Investigation

Purpose

When to Use

Prerequisites

How This Skill Works

Step 1: Understand the Spike Period

Step 2: Identify Top-Level Changes

Step 3: Drill Down by Key Dimensions

Step 4: Multi-Dimensional Analysis

Step 5: Time-Series Analysis

Step 6: Check for New Resources

Step 7: Investigate Tags and Custom Dimensions

Output Format

1. Executive Summary

2. Spike Metrics

3. Root Cause Analysis

4. Detailed Breakdown

5. Recommendations

Skill-Specific Best Practices

Common Spike Patterns

See Also