Prompt Caching

Cache LLM prompt prefixes for 90% token savings.

Supported Models (2026)

Provider	Models
Claude	Opus 4.1, Opus 4, Sonnet 4.5, Sonnet 4, Sonnet 3.7, Haiku 4.5, Haiku 3.5, Haiku 3
OpenAI	gpt-5.2, gpt-5.2-mini, o3, o3-mini (automatic caching)

Claude Prompt Caching

python

def build_cached_messages(
    system_prompt: str,
    few_shot_examples: str | None,
    user_content: str,
    use_extended_cache: bool = False
) -> list[dict]:
    """Build messages with cache breakpoints.

    Cache structure (processing order: tools → system → messages):
    1. System prompt (cached)
    2. Few-shot examples (cached)
    ─────── CACHE BREAKPOINT ───────
    3. User content (NOT cached)
    """
    # TTL: "5m" (default, 1.25x write cost) or "1h" (extended, 2x write cost)
    ttl = "1h" if use_extended_cache else "5m"

    content_parts = []

    # Breakpoint 1: System prompt
    content_parts.append({
        "type": "text",
        "text": system_prompt,
        "cache_control": {"type": "ephemeral", "ttl": ttl}
    })

    # Breakpoint 2: Few-shot examples (up to 4 breakpoints allowed)
    if few_shot_examples:
        content_parts.append({
            "type": "text",
            "text": few_shot_examples,
            "cache_control": {"type": "ephemeral", "ttl": ttl}
        })

    # Dynamic content (NOT cached)
    content_parts.append({
        "type": "text",
        "text": user_content
    })

    return [{"role": "user", "content": content_parts}]

Cache Pricing (2026)

code

┌─────────────────────────────────────────────────────────────┐
│  Cache Cost Multipliers (relative to base input price)      │
├─────────────────────────────────────────────────────────────┤
│  5-minute cache write:  1.25x base input price              │
│  1-hour cache write:    2.00x base input price              │
│  Cache read:            0.10x base input price (90% off!)   │
└─────────────────────────────────────────────────────────────┘

Example: Claude Sonnet 4 @ $3/MTok input

Without Prompt Caching:
System prompt:     2,000 tokens @ $3/MTok  = $0.006
Few-shot examples: 5,000 tokens @ $3/MTok  = $0.015
User content:     10,000 tokens @ $3/MTok  = $0.030
───────────────────────────────────────────────────
Total:            17,000 tokens            = $0.051

With 5m Caching (first request = cache write):
Cached prefix:     7,000 tokens @ $3.75/MTok = $0.02625 (1.25x)
User content:     10,000 tokens @ $3/MTok    = $0.03000
Total first req:                             = $0.05625

With 5m Caching (subsequent = cache read):
Cached prefix:     7,000 tokens @ $0.30/MTok = $0.0021 (0.1x)
User content:     10,000 tokens @ $3/MTok    = $0.0300
Total cached req:                            = $0.0321

Savings: 37% per cached request, break-even after 2 requests

Extended Cache (1-hour TTL)

Use 1-hour cache when:

•Prompt reused > 10 times per hour
•System prompts are highly stable
•Token count > 10k (maximize savings)

python

# Extended cache: 2x write cost but persists 12x longer
"cache_control": {"type": "ephemeral", "ttl": "1h"}

# Break-even: 1h cache pays off after ~8 reads
# (2x write cost ÷ 0.9 savings per read ≈ 8 reads)

OpenAI Automatic Caching

python

# OpenAI caches prefixes automatically
# No cache_control markers needed

response = await openai.chat.completions.create(
    model="gpt-5.2",
    messages=[
        {"role": "system", "content": system_prompt},  # Cached
        {"role": "user", "content": user_content}      # Not cached
    ]
)

# Check cache usage in response
cache_tokens = response.usage.prompt_tokens_cached

Cache Processing Order

code

Cache references entire prompt in order:
1. Tools (cached first)
2. System messages (cached second)
3. User messages (cached last)

⚠️ Extended thinking changes invalidate message caches
   (but NOT system/tools caches)

Monitoring Cache Effectiveness

python

# Track these fields in API response
response = await client.messages.create(...)

cache_created = response.usage.cache_creation_input_tokens  # New cache
cache_read = response.usage.cache_read_input_tokens         # Cache hit
regular = response.usage.input_tokens                        # Not cached

# Calculate cache hit rate
if cache_created + cache_read > 0:
    hit_rate = cache_read / (cache_created + cache_read)
    print(f"Cache hit rate: {hit_rate:.1%}")

Best Practices

python

# ✅ Good: Long, stable prefix first
messages = [
    {"role": "system", "content": LONG_SYSTEM_PROMPT},
    {"role": "user", "content": FEW_SHOT_EXAMPLES},
    {"role": "user", "content": user_input}  # Variable
]

# ❌ Bad: Variable content early (breaks cache)
messages = [
    {"role": "user", "content": user_input},  # Breaks cache!
    {"role": "system", "content": LONG_SYSTEM_PROMPT}
]

Key Decisions

Decision	Recommendation
Min prefix size	1,024 tokens (Claude)
Breakpoint count	2-4 per request
Content order	Stable prefix first
Default TTL	5m for most cases
Extended TTL	1h if >10 reads/hour

Common Mistakes

•Variable content before cached prefix
•Too many breakpoints (overhead)
•Prefix too short (min 1024 tokens)
•Not checking cache_read_input_tokens
•Using 1h TTL for infrequent calls (wastes 2x write)

Related Skills

•semantic-caching - Redis similarity caching
•cache-cost-tracking - Cost monitoring
•llm-streaming - Streaming with caching

Capability Details

anthropic-caching

Keywords: anthropic, claude, cache_control, ephemeral Solves:

•Use Anthropic prompt caching
•Set cache breakpoints
•Reduce API costs

openai-caching

Keywords: openai, gpt, cached_tokens, automatic Solves:

•Use OpenAI prompt caching
•Structure prompts for cache hits
•Monitor cache effectiveness

wrapper-template

Keywords: wrapper, template, implementation, python Solves:

•Prompt cache wrapper template
•Python implementation
•Drop-in caching layer