Context Engineering: Optimizing AI Context Windows
Master the art of context engineering for AI applications - optimizing prompts, managing tokens, and designing effective context strategies.
Triggers
Use this skill when:
- •Optimizing LLM prompts for better results
- •Managing context window limits
- •Implementing RAG (Retrieval Augmented Generation)
- •Designing AI application architectures
- •Reducing token costs while maintaining quality
- •Keywords: context, prompt, tokens, RAG, context window, prompt engineering, token budget, retrieval, embedding
Core Concepts
Context Window Anatomy
code
┌─────────────────────────────────────────────────────┐ │ CONTEXT WINDOW │ │ │ │ ┌──────────────────────────────────────────────┐ │ │ │ SYSTEM PROMPT (Fixed) │ │ │ │ - Identity & role │ │ │ │ - Behavioral rules │ │ │ │ - Output format │ │ │ └──────────────────────────────────────────────┘ │ │ │ │ ┌──────────────────────────────────────────────┐ │ │ │ RETRIEVED CONTEXT (Dynamic) │ │ │ │ - Relevant documents │ │ │ │ - Code snippets │ │ │ │ - Reference data │ │ │ └──────────────────────────────────────────────┘ │ │ │ │ ┌──────────────────────────────────────────────┐ │ │ │ CONVERSATION HISTORY (Growing) │ │ │ │ - Previous messages │ │ │ │ - Tool results │ │ │ │ - Intermediate outputs │ │ │ └──────────────────────────────────────────────┘ │ │ │ │ ┌──────────────────────────────────────────────┐ │ │ │ CURRENT INPUT (Variable) │ │ │ │ - User query │ │ │ │ - Inline context │ │ │ └──────────────────────────────────────────────┘ │ │ │ │ ┌──────────────────────────────────────────────┐ │ │ │ OUTPUT SPACE (Reserved) │ │ │ │ - max_tokens allocation │ │ │ └──────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────┘
Token Budget Planning
| Component | Typical Allocation | Notes |
|---|---|---|
| System Prompt | 500-2000 tokens | Keep stable |
| Retrieved Context | 2000-10000 tokens | Scale with need |
| History | 1000-5000 tokens | Compress over time |
| Current Input | 100-1000 tokens | User controlled |
| Output Reserve | 1000-4096 tokens | Task dependent |
Pattern 1: System Prompt Design
Layered System Prompt
markdown
# System Prompt Structure ## Layer 1: Identity (Always First) You are [role description]. Your purpose is [primary function]. ## Layer 2: Capabilities You have access to: - [Capability 1] - [Capability 2] ## Layer 3: Behavioral Rules ALWAYS: - [Rule 1] - [Rule 2] NEVER: - [Constraint 1] - [Constraint 2] ## Layer 4: Output Format When responding: - [Format guideline 1] - [Format guideline 2] ## Layer 5: Context Hints (Dynamic) Current context: [injected at runtime]
Compression Techniques
Before (verbose):
code
You are a helpful AI assistant that specializes in helping users with coding tasks. When a user asks you to write code, you should first understand what they're trying to accomplish, then write clean and well-documented code that follows best practices.
After (compressed):
code
Role: Coding assistant Process: Understand task -> Write clean, documented, best-practice code
Pattern 2: Dynamic Context Injection
Context Template System
python
def build_context(task: str, retrieved_docs: list, history: list) -> str:
template = """# Task
{task}
# Relevant Context
{context}
# Conversation History
{history}
# Instructions
Respond based on the context provided. If information is missing, say so.
"""
return template.format(
task=task,
context=format_docs(retrieved_docs),
history=format_history(history)
)
def format_docs(docs: list, max_tokens: int = 5000) -> str:
formatted = []
current_tokens = 0
for doc in sorted(docs, key=lambda d: d.relevance, reverse=True):
doc_tokens = count_tokens(doc.content)
if current_tokens + doc_tokens > max_tokens:
break
formatted.append(f"## {doc.title}\n{doc.content}")
current_tokens += doc_tokens
return "\n\n".join(formatted)
Priority-Based Inclusion
python
class ContextPriority:
CRITICAL = 1 # Always include
HIGH = 2 # Include if space
MEDIUM = 3 # Include if plenty of space
LOW = 4 # Include only if necessary
def select_context(items: list, budget: int) -> list:
selected = []
remaining = budget
# Sort by priority, then relevance
sorted_items = sorted(items, key=lambda x: (x.priority, -x.relevance))
for item in sorted_items:
tokens = count_tokens(item.content)
if tokens <= remaining:
selected.append(item)
remaining -= tokens
elif item.priority == ContextPriority.CRITICAL:
# Summarize critical items if they don't fit
summary = summarize(item.content, remaining)
selected.append(item._replace(content=summary))
break
return selected
Pattern 3: Conversation Summarization
Rolling Summary
python
class ConversationManager:
def __init__(self, max_history_tokens: int = 3000):
self.messages = []
self.summary = ""
self.max_tokens = max_history_tokens
def add_message(self, role: str, content: str):
self.messages.append({"role": role, "content": content})
self._maybe_summarize()
def _maybe_summarize(self):
total_tokens = sum(count_tokens(m["content"]) for m in self.messages)
if total_tokens > self.max_tokens:
# Keep last N messages
keep_recent = 4
to_summarize = self.messages[:-keep_recent]
recent = self.messages[-keep_recent:]
# Summarize older messages
new_summary = self._create_summary(to_summarize)
self.summary = f"{self.summary}\n{new_summary}".strip()
self.messages = recent
def get_context(self) -> str:
parts = []
if self.summary:
parts.append(f"[Previous conversation summary: {self.summary}]")
parts.extend([f"{m['role']}: {m['content']}" for m in self.messages])
return "\n\n".join(parts)
Hierarchical Memory
code
MEMORY LEVELS Level 1: Working Memory (Current Context) - Last few exchanges - Current task details - Active tool results Level 2: Session Memory (Summarized) - Earlier conversation summary - Key decisions made - Important context established Level 3: Long-term Memory (Retrieved) - Past session summaries - User preferences - Project knowledge
Pattern 4: RAG (Retrieval Augmented Generation)
Basic RAG Pipeline
python
class RAGPipeline:
def __init__(self, embedder, vector_store, llm):
self.embedder = embedder
self.store = vector_store
self.llm = llm
def query(self, question: str, k: int = 5) -> str:
# 1. Embed query
query_embedding = self.embedder.embed(question)
# 2. Retrieve relevant docs
docs = self.store.similarity_search(query_embedding, k=k)
# 3. Build context
context = self._build_context(question, docs)
# 4. Generate response
return self.llm.generate(context)
def _build_context(self, question: str, docs: list) -> str:
doc_text = "\n\n".join([
f"Source: {d.metadata.get('source', 'unknown')}\n{d.content}"
for d in docs
])
return f"""Based on the following context, answer the question.
Context:
{doc_text}
Question: {question}
Answer:"""
Advanced RAG Techniques
Hybrid Search:
python
def hybrid_search(query: str, k: int = 5):
# Semantic search
semantic_results = vector_search(query, k=k*2)
# Keyword search
keyword_results = bm25_search(query, k=k*2)
# Combine and dedupe
combined = merge_results(semantic_results, keyword_results)
# Rerank
return rerank(query, combined, k=k)
Query Expansion:
python
def expand_query(original_query: str) -> list:
expansion_prompt = f"""Generate 3 alternative phrasings for this query:
{original_query}
Return as JSON list."""
alternatives = llm.generate(expansion_prompt)
return [original_query] + json.loads(alternatives)
Pattern 5: Token Optimization
Techniques
| Technique | Savings | Trade-off |
|---|---|---|
| Abbreviations | 10-20% | Readability |
| Remove examples | 20-40% | Clarity |
| Bullet points | 15-25% | Formatting |
| Summarization | 50-80% | Detail loss |
| Selective inclusion | Variable | Coverage |
Implementation
python
def optimize_context(content: str, target_tokens: int) -> str:
current_tokens = count_tokens(content)
if current_tokens <= target_tokens:
return content
# Try progressive compression
strategies = [
remove_redundant_whitespace,
abbreviate_common_terms,
remove_examples,
extract_key_points,
aggressive_summarize
]
for strategy in strategies:
content = strategy(content)
if count_tokens(content) <= target_tokens:
return content
# Last resort: truncate
return truncate_to_tokens(content, target_tokens)
Pattern 6: Context Window Monitoring
Token Tracking
python
class TokenTracker:
def __init__(self, model: str):
self.model = model
self.limit = get_context_limit(model)
self.usage = {
"system": 0,
"context": 0,
"history": 0,
"input": 0,
"reserved": 4096 # For output
}
def update(self, component: str, content: str):
self.usage[component] = count_tokens(content)
@property
def available(self) -> int:
used = sum(self.usage.values())
return self.limit - used
@property
def utilization(self) -> float:
return sum(self.usage.values()) / self.limit
def can_add(self, content: str) -> bool:
return count_tokens(content) <= self.available
def report(self) -> str:
return f"""Token Usage:
- System: {self.usage['system']}
- Context: {self.usage['context']}
- History: {self.usage['history']}
- Input: {self.usage['input']}
- Reserved: {self.usage['reserved']}
- Available: {self.available}
- Utilization: {self.utilization:.1%}"""
Best Practices
System Prompts
- •Front-load important instructions: Models attend more to beginning
- •Use clear structure: Headers, bullets, consistent formatting
- •Be specific: Vague instructions get vague results
- •Test variations: Small changes can have big impacts
- •Version control: Track what works
Context Selection
- •Relevance over recency: Most relevant, not most recent
- •Diversity: Include different perspectives
- •Source attribution: Help model cite correctly
- •Chunking strategy: Match chunk size to use case
- •Metadata inclusion: Add context about context
Token Management
- •Reserve output space: Don't fill entire context
- •Monitor utilization: Track across sessions
- •Compress proactively: Before hitting limits
- •Cache summaries: Don't re-summarize repeatedly
- •Profile costs: Know your token spend
Quick Reference
Token Counts (Approximate)
| Content Type | Tokens/Item |
|---|---|
| English word | 1.3 |
| Code line | 10-15 |
| Paragraph | 50-100 |
| Page of text | 500-750 |
| JSON object | 20-50 |
Model Context Limits
| Model | Context Limit |
|---|---|
| Claude Opus 4.5 | 200K |
| Claude Sonnet 4 | 200K |
| Claude Haiku 3.5 | 200K |
| GPT-4 Turbo | 128K |
| GPT-4o | 128K |
Notes
- •Context engineering is iterative - test and refine
- •Different tasks need different context strategies
- •Monitor both quality and cost
- •Cache aggressively where possible
- •Document your context architecture