AgentSkillsCN

prompt-caching

为Claude API启用提示词缓存功能,可将延迟降低高达85%,同时将成本削减高达90%。支持cache_control缓存控制、临时缓存、缓存断点以及性能优化等特性。

SKILL.md
--- frontmatter
name: prompt-caching
description: Prompt caching for Claude API to reduce latency by up to 85% and costs by up to 90%. Activate for cache_control, ephemeral caching, cache breakpoints, and performance optimization.
allowed-tools:
  - Bash
  - Read
  - Write
  - Edit
  - Glob
  - Grep
  - Task
triggers:
  - cache
  - prompt-caching
  - cache_control
  - ephemeral
  - cache breakpoint
  - latency
  - performance
  - cost optimization
dependencies:
  - llm-integration
related-skills:
  - extended-thinking
  - batch-processing

Prompt Caching Skill

Leverage Anthropic's prompt caching to dramatically reduce latency and costs for repeated prompts.

When to Use This Skill

  • RAG systems with large static documents
  • Multi-turn conversations with long instructions
  • Code analysis with large codebase context
  • Batch processing with shared prefixes
  • Document analysis and summarization

Core Concepts

Cache Control Placement

python
import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are a helpful assistant with access to a large knowledge base...",
            "cache_control": {"type": "ephemeral"}  # Cache this content
        }
    ],
    messages=[{"role": "user", "content": "What is...?"}]
)

Cache Hierarchy

Cache breakpoints are checked in this order:

  1. Tools - Tool definitions cached first
  2. System - System prompts cached second
  3. Messages - Conversation history cached last

TTL Options

TTLWrite CostRead CostUse Case
5 minutes (default)1.25x base0.1x baseInteractive sessions
1 hour2.0x base0.1x baseBatch processing, stable docs

Cache Requirements

  • Minimum tokens: 1024-4096 (varies by model)
  • Maximum breakpoints: 4 per request
  • Supported models: Claude Opus 4.5, Sonnet 4.5, Haiku 4.5

Implementation Patterns

Pattern 1: Single Breakpoint (Recommended)

python
# Best for: Document analysis, Q&A with static context
system = [
    {
        "type": "text",
        "text": large_document_content,
        "cache_control": {"type": "ephemeral"}  # Single breakpoint at end
    }
]

Pattern 2: Multi-Turn Conversation

python
# Cache grows with conversation
messages = [
    {"role": "user", "content": "First question"},
    {"role": "assistant", "content": "First answer"},
    {
        "role": "user",
        "content": "Follow-up question",
        "cache_control": {"type": "ephemeral"}  # Cache entire conversation
    }
]

Pattern 3: RAG with Multiple Breakpoints

python
system = [
    {
        "type": "text",
        "text": "Tool definitions and instructions",
        "cache_control": {"type": "ephemeral"}  # Breakpoint 1: Tools
    },
    {
        "type": "text",
        "text": retrieved_documents,
        "cache_control": {"type": "ephemeral"}  # Breakpoint 2: Documents
    }
]

Pattern 4: Batch Processing with 1-Hour TTL

python
# Warm the cache before batch
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=100,
    system=[{
        "type": "text",
        "text": shared_context,
        "cache_control": {"type": "ephemeral", "ttl": "1h"}
    }],
    messages=[{"role": "user", "content": "Initialize cache"}]
)

# Now run batch - all requests hit the cache
for item in batch_items:
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1024,
        system=[{
            "type": "text",
            "text": shared_context,
            "cache_control": {"type": "ephemeral", "ttl": "1h"}
        }],
        messages=[{"role": "user", "content": item}]
    )

Performance Monitoring

Check Cache Usage

python
response = client.messages.create(...)

# Monitor these fields
cache_write = response.usage.cache_creation_input_tokens  # New cache written
cache_read = response.usage.cache_read_input_tokens       # Cache hit!
uncached = response.usage.input_tokens                    # After breakpoint

print(f"Cache hit rate: {cache_read / (cache_read + cache_write + uncached) * 100:.1f}%")

Cost Calculation

python
def calculate_cost(usage, model="claude-sonnet-4-20250514"):
    # Example rates (check current pricing)
    base_input_rate = 0.003  # per 1K tokens

    write_cost = (usage.cache_creation_input_tokens / 1000) * base_input_rate * 1.25
    read_cost = (usage.cache_read_input_tokens / 1000) * base_input_rate * 0.1
    uncached_cost = (usage.input_tokens / 1000) * base_input_rate

    return write_cost + read_cost + uncached_cost

Cache Invalidation

Changes that invalidate cache:

ChangeImpact
Tool definitionsEntire cache invalidated
System promptSystem + messages invalidated
Any content before breakpointThat breakpoint + later invalidated

Best Practices

DO:

  • Place breakpoint at END of static content
  • Keep tools/instructions stable across requests
  • Use 1-hour TTL for batch processing
  • Monitor cache_read_input_tokens for savings

DON'T:

  • Place breakpoint in middle of dynamic content
  • Change tool definitions frequently
  • Expect cache to work with <1024 tokens
  • Ignore the 20-block lookback limit

Integration with Extended Thinking

python
# Cache + Extended Thinking
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=16000,
    thinking={"type": "enabled", "budget_tokens": 10000},
    system=[{
        "type": "text",
        "text": large_context,
        "cache_control": {"type": "ephemeral"}
    }],
    messages=[{"role": "user", "content": "Analyze this..."}]
)

See Also

  • [[llm-integration]] - Claude API basics
  • [[extended-thinking]] - Deep reasoning
  • [[batch-processing]] - Bulk processing