Prompt Caching
Cache LLM prompt prefixes for 90% token savings.
Supported Models (2026)
| Provider | Models |
|---|---|
| Claude | Opus 4.1, Opus 4, Sonnet 4.5, Sonnet 4, Sonnet 3.7, Haiku 4.5, Haiku 3.5, Haiku 3 |
| OpenAI | gpt-5.2, gpt-5.2-mini, o3, o3-mini (automatic caching) |
Claude Prompt Caching
python
def build_cached_messages(
system_prompt: str,
few_shot_examples: str | None,
user_content: str,
use_extended_cache: bool = False
) -> list[dict]:
"""Build messages with cache breakpoints.
Cache structure (processing order: tools → system → messages):
1. System prompt (cached)
2. Few-shot examples (cached)
─────── CACHE BREAKPOINT ───────
3. User content (NOT cached)
"""
# TTL: "5m" (default, 1.25x write cost) or "1h" (extended, 2x write cost)
ttl = "1h" if use_extended_cache else "5m"
content_parts = []
# Breakpoint 1: System prompt
content_parts.append({
"type": "text",
"text": system_prompt,
"cache_control": {"type": "ephemeral", "ttl": ttl}
})
# Breakpoint 2: Few-shot examples (up to 4 breakpoints allowed)
if few_shot_examples:
content_parts.append({
"type": "text",
"text": few_shot_examples,
"cache_control": {"type": "ephemeral", "ttl": ttl}
})
# Dynamic content (NOT cached)
content_parts.append({
"type": "text",
"text": user_content
})
return [{"role": "user", "content": content_parts}]
Cache Pricing (2026)
code
┌─────────────────────────────────────────────────────────────┐ │ Cache Cost Multipliers (relative to base input price) │ ├─────────────────────────────────────────────────────────────┤ │ 5-minute cache write: 1.25x base input price │ │ 1-hour cache write: 2.00x base input price │ │ Cache read: 0.10x base input price (90% off!) │ └─────────────────────────────────────────────────────────────┘ Example: Claude Sonnet 4 @ $3/MTok input Without Prompt Caching: System prompt: 2,000 tokens @ $3/MTok = $0.006 Few-shot examples: 5,000 tokens @ $3/MTok = $0.015 User content: 10,000 tokens @ $3/MTok = $0.030 ─────────────────────────────────────────────────── Total: 17,000 tokens = $0.051 With 5m Caching (first request = cache write): Cached prefix: 7,000 tokens @ $3.75/MTok = $0.02625 (1.25x) User content: 10,000 tokens @ $3/MTok = $0.03000 Total first req: = $0.05625 With 5m Caching (subsequent = cache read): Cached prefix: 7,000 tokens @ $0.30/MTok = $0.0021 (0.1x) User content: 10,000 tokens @ $3/MTok = $0.0300 Total cached req: = $0.0321 Savings: 37% per cached request, break-even after 2 requests
Extended Cache (1-hour TTL)
Use 1-hour cache when:
- •Prompt reused > 10 times per hour
- •System prompts are highly stable
- •Token count > 10k (maximize savings)
python
# Extended cache: 2x write cost but persists 12x longer
"cache_control": {"type": "ephemeral", "ttl": "1h"}
# Break-even: 1h cache pays off after ~8 reads
# (2x write cost ÷ 0.9 savings per read ≈ 8 reads)
OpenAI Automatic Caching
python
# OpenAI caches prefixes automatically
# No cache_control markers needed
response = await openai.chat.completions.create(
model="gpt-5.2",
messages=[
{"role": "system", "content": system_prompt}, # Cached
{"role": "user", "content": user_content} # Not cached
]
)
# Check cache usage in response
cache_tokens = response.usage.prompt_tokens_cached
Cache Processing Order
code
Cache references entire prompt in order: 1. Tools (cached first) 2. System messages (cached second) 3. User messages (cached last) ⚠️ Extended thinking changes invalidate message caches (but NOT system/tools caches)
Monitoring Cache Effectiveness
python
# Track these fields in API response
response = await client.messages.create(...)
cache_created = response.usage.cache_creation_input_tokens # New cache
cache_read = response.usage.cache_read_input_tokens # Cache hit
regular = response.usage.input_tokens # Not cached
# Calculate cache hit rate
if cache_created + cache_read > 0:
hit_rate = cache_read / (cache_created + cache_read)
print(f"Cache hit rate: {hit_rate:.1%}")
Best Practices
python
# ✅ Good: Long, stable prefix first
messages = [
{"role": "system", "content": LONG_SYSTEM_PROMPT},
{"role": "user", "content": FEW_SHOT_EXAMPLES},
{"role": "user", "content": user_input} # Variable
]
# ❌ Bad: Variable content early (breaks cache)
messages = [
{"role": "user", "content": user_input}, # Breaks cache!
{"role": "system", "content": LONG_SYSTEM_PROMPT}
]
Key Decisions
| Decision | Recommendation |
|---|---|
| Min prefix size | 1,024 tokens (Claude) |
| Breakpoint count | 2-4 per request |
| Content order | Stable prefix first |
| Default TTL | 5m for most cases |
| Extended TTL | 1h if >10 reads/hour |
Common Mistakes
- •Variable content before cached prefix
- •Too many breakpoints (overhead)
- •Prefix too short (min 1024 tokens)
- •Not checking
cache_read_input_tokens - •Using 1h TTL for infrequent calls (wastes 2x write)
Related Skills
- •
semantic-caching- Redis similarity caching - •
cache-cost-tracking- Cost monitoring - •
llm-streaming- Streaming with caching
Capability Details
anthropic-caching
Keywords: anthropic, claude, cache_control, ephemeral Solves:
- •Use Anthropic prompt caching
- •Set cache breakpoints
- •Reduce API costs
openai-caching
Keywords: openai, gpt, cached_tokens, automatic Solves:
- •Use OpenAI prompt caching
- •Structure prompts for cache hits
- •Monitor cache effectiveness
wrapper-template
Keywords: wrapper, template, implementation, python Solves:
- •Prompt cache wrapper template
- •Python implementation
- •Drop-in caching layer