Context Engineering Basics

Overview

Master context engineering fundamentals for AI agents and LLM applications. Learn strategies for dynamic context discovery, compression techniques, prompt caching optimization, memory consolidation patterns, and just-in-time context loading to maximize token efficiency and response quality.

Use when:

•Building AI agents with limited context windows
•Optimizing token usage in LLM applications
•Implementing memory systems for conversational AI
•Managing large knowledge bases with selective retrieval
•Designing multi-turn conversations with persistent context

Core Techniques:

•Dynamic Context Discovery (relevance scoring, semantic search)
•Context Compression (summarization, entity extraction, pruning)
•Prompt Caching (KV-cache optimization, prefix reuse)
•Memory Consolidation (episodic/semantic memory patterns)
•JIT Context Loading (lazy loading, incremental context expansion)

Skill Metadata

yaml

---
name: context-engineering-basics
description: |
  Context engineering fundamentals for AI systems: dynamic discovery,
  compression, caching, memory consolidation, and JIT loading patterns.
  Use when optimizing token efficiency or managing large knowledge bases.
license: MIT
metadata: ./metadata.yaml
---

Core Concepts

1. Context Window Constraints

Modern LLMs have finite context windows (4K-200K tokens). Effective context engineering maximizes information density within these limits.

Key Constraints:

•Token Limits: GPT-4 (8K-128K), Claude (200K), Gemini (1M-2M)
•Cost: Tokens cost money; larger contexts = higher costs
•Latency: Longer contexts increase processing time
•Attention Decay: Information at context boundaries may be less attended

Strategy: Progressive disclosure - load essential context first, expand as needed.

2. Dynamic Context Discovery

Identify and retrieve relevant information from large knowledge bases in real-time.

Techniques:

Semantic Search

python

# Vector similarity search
query_embedding = embed(user_query)
relevant_chunks = vector_db.search(
    query_embedding,
    top_k=5,
    threshold=0.75
)

Relevance Scoring

python

def score_relevance(chunk, query, conversation_history):
    """Multi-factor relevance scoring"""
    scores = {
        'semantic': cosine_similarity(chunk, query),
        'recency': time_decay(chunk.timestamp),
        'frequency': access_count(chunk.id),
        'context_fit': overlap_with_history(chunk, conversation_history)
    }
    return weighted_average(scores, weights=[0.4, 0.2, 0.1, 0.3])

Hybrid Search

Combine keyword matching (BM25) with semantic search for better precision:

python

# Hybrid retrieval
keyword_results = bm25_search(query, top_k=10)
semantic_results = vector_search(query, top_k=10)
combined = rerank(keyword_results + semantic_results, query)

See: references/dynamic-context-discovery.md for advanced patterns.

3. Context Compression

Reduce token count while preserving essential information.

Compression Strategies:

Summarization

python

def compress_conversation(history, target_ratio=0.3):
    """Compress conversation history to target ratio"""
    if len(history) < 5:
        return history  # Don't compress short conversations

    # Summarize older messages, keep recent ones verbatim
    cutoff = int(len(history) * 0.7)
    old_messages = history[:cutoff]
    recent_messages = history[cutoff:]

    summary = llm.summarize(old_messages, style="bullet_points")
    return [summary] + recent_messages

Entity Extraction

python

# Extract key entities to reduce noise
entities = extract_entities(text, types=['PERSON', 'ORG', 'DATE', 'LOCATION'])
compressed = f"Entities: {entities}\nSummary: {summarize(text, max_length=100)}"

Pruning Strategies

•Recency: Keep recent messages, summarize old ones
•Importance: Score each message, keep high-scoring ones
•Redundancy: Remove duplicate or highly similar information

See: references/context-compression.md for complete techniques.

4. Prompt Caching

Optimize repeated context reuse through caching mechanisms.

KV-Cache Optimization:

python

# Prefix caching pattern
system_context = """
You are an expert assistant...
[Large static context: 5000 tokens]
"""

# Cache this prefix across requests
@cache_prefix(system_context)
def handle_request(user_message):
    return llm.generate(
        prefix=system_context,  # Cached, not re-processed
        message=user_message
    )

Claude Prompt Caching:

python

# Anthropic's prompt caching
messages = [
    {
        "role": "system",
        "content": large_knowledge_base,
        "cache_control": {"type": "ephemeral"}  # Cache this
    },
    {
        "role": "user",
        "content": user_query
    }
]

Cache Strategies:

•Static Prefixes: System prompts, guidelines, knowledge bases
•Session Context: User profile, conversation metadata
•Incremental Updates: Append to cached context without reprocessing

See: references/prompt-caching.md for provider-specific implementations.

5. Memory Consolidation

Organize information into structured memory systems.

Memory Hierarchies:

python

class AgentMemory:
    def __init__(self):
        self.working_memory = []      # Current conversation (short-term)
        self.episodic_memory = []     # Past conversations (events)
        self.semantic_memory = {}     # Facts and knowledge
        self.procedural_memory = {}   # Learned skills/patterns

    def consolidate(self):
        """Move working memory to long-term storage"""
        # Extract important facts
        facts = extract_facts(self.working_memory)
        self.semantic_memory.update(facts)

        # Store conversation as episode
        episode = {
            'timestamp': now(),
            'summary': summarize(self.working_memory),
            'key_entities': extract_entities(self.working_memory)
        }
        self.episodic_memory.append(episode)

        # Clear working memory
        self.working_memory = []

CoALA Framework Integration:

python

# Cognitive Architecture for Language Agents
memory = CoALAMemory(
    working_memory_size=10,
    episodic_retention_days=30,
    semantic_index_type='faiss'
)

# Automatic consolidation triggers
memory.set_consolidation_triggers(
    max_working_size=15,
    time_interval=3600,  # Every hour
    importance_threshold=0.7
)

See: references/memory-consolidation.md for advanced patterns.

6. Just-In-Time (JIT) Context Loading

Load context incrementally based on need, not upfront.

Lazy Loading Pattern:

python

class JITContextLoader:
    def __init__(self, knowledge_base):
        self.kb = knowledge_base
        self.loaded_context = set()

    def load_on_demand(self, query, current_context):
        """Load only what's needed for this query"""
        # Identify required context
        required_topics = identify_topics(query)

        # Load only new required context
        new_context = []
        for topic in required_topics:
            if topic not in self.loaded_context:
                chunk = self.kb.get_topic(topic)
                new_context.append(chunk)
                self.loaded_context.add(topic)

        return current_context + new_context

Incremental Expansion:

python

# Start with minimal context
context = {
    'system': base_system_prompt,
    'user_profile': essential_user_data
}

# Expand based on query complexity
if is_complex(query):
    context['examples'] = load_few_shot_examples(query)
    context['documentation'] = load_relevant_docs(query)

# Expand based on errors/uncertainty
if model_uncertain(response):
    context['additional_knowledge'] = search_knowledge_base(query)

See: references/jit-context-loading.md for implementation patterns.

Best Practices

1. Token Budget Management

python

class TokenBudget:
    def __init__(self, max_tokens=8000, reserve=500):
        self.max_tokens = max_tokens
        self.reserve = reserve  # Reserve for response
        self.allocated = {
            'system': 0,
            'context': 0,
            'history': 0,
            'query': 0
        }

    def can_add(self, component, tokens):
        """Check if we can add more context"""
        total = sum(self.allocated.values()) + tokens
        return total + self.reserve <= self.max_tokens

    def add(self, component, content):
        """Add content with token tracking"""
        tokens = count_tokens(content)
        if self.can_add(component, tokens):
            self.allocated[component] += tokens
            return True
        return False

2. Context Quality over Quantity

Prioritize:

•High-relevance over completeness
•Recent over historical (unless historical is critical)
•Unique information over redundant
•Structured over unstructured

3. Monitoring and Metrics

python

@monitor_context
def generate_response(query, context):
    metrics = {
        'total_tokens': count_tokens(context),
        'compression_ratio': original_size / compressed_size,
        'cache_hit_rate': cache_hits / total_requests,
        'retrieval_latency': time_to_retrieve_context,
        'relevance_score': avg_relevance(context, query)
    }
    log_metrics(metrics)
    return llm.generate(query, context)

4. Graceful Degradation

python

def build_context(query, max_tokens=8000):
    """Build context with fallback levels"""
    context_levels = [
        ('essential', load_essential_context),
        ('recommended', load_recommended_context),
        ('optional', load_optional_context)
    ]

    context = ""
    for level, loader in context_levels:
        candidate = loader(query)
        if count_tokens(context + candidate) < max_tokens:
            context += candidate
        else:
            logger.warning(f"Skipped {level} context due to token limits")
            break

    return context

Common Patterns

Pattern 1: Sliding Window with Summarization

python

class SlidingWindowContext:
    def __init__(self, window_size=10, compression_ratio=0.3):
        self.window_size = window_size
        self.compression_ratio = compression_ratio
        self.messages = []
        self.summary = None

    def add_message(self, message):
        self.messages.append(message)

        if len(self.messages) > self.window_size:
            # Compress oldest messages
            to_compress = self.messages[:5]
            self.summary = update_summary(self.summary, to_compress)
            self.messages = self.messages[5:]

    def get_context(self):
        context = []
        if self.summary:
            context.append(f"[Previous conversation summary: {self.summary}]")
        context.extend(self.messages)
        return context

Pattern 2: Hierarchical Context

python

class HierarchicalContext:
    """Context organized by abstraction levels"""

    def __init__(self):
        self.layers = {
            'meta': "",        # High-level goals, constraints
            'domain': "",      # Domain knowledge, ontologies
            'task': "",        # Current task context
            'dialogue': []     # Conversation history
        }

    def render(self, detail_level='full'):
        """Render context at specified detail level"""
        if detail_level == 'minimal':
            return self.layers['meta'] + self.layers['task']
        elif detail_level == 'medium':
            return self.layers['meta'] + self.layers['task'] + \
                   summarize(self.layers['dialogue'])
        else:  # full
            return "\n".join([
                self.layers['meta'],
                self.layers['domain'],
                self.layers['task'],
                format_dialogue(self.layers['dialogue'])
            ])

Pattern 3: Context Routing

python

class ContextRouter:
    """Route queries to appropriate context subsets"""

    def __init__(self):
        self.context_stores = {
            'product_info': ProductKnowledgeBase(),
            'user_history': UserHistoryDB(),
            'company_policy': PolicyDocuments(),
            'technical_docs': TechnicalDocs()
        }

    def route(self, query):
        """Determine which context stores to query"""
        query_type = classify_query(query)

        routing_rules = {
            'product_question': ['product_info', 'technical_docs'],
            'account_issue': ['user_history', 'company_policy'],
            'technical_support': ['technical_docs', 'user_history']
        }

        stores_to_query = routing_rules.get(query_type, ['product_info'])

        # Retrieve from relevant stores only
        context = []
        for store_name in stores_to_query:
            store = self.context_stores[store_name]
            context.extend(store.search(query, top_k=3))

        return context

Anti-Patterns to Avoid

❌ Anti-Pattern 1: Context Dumping

python

# DON'T: Load entire knowledge base
context = load_entire_database()  # 100K tokens!
response = llm.generate(query, context)

Fix: Use dynamic retrieval

python

# DO: Load only relevant chunks
context = semantic_search(query, top_k=5)  # ~2K tokens
response = llm.generate(query, context)

❌ Anti-Pattern 2: Ignoring Recency

python

# DON'T: Treat all history equally
context = all_messages  # Includes irrelevant old messages

Fix: Apply time decay

python

# DO: Prioritize recent context
recent = messages[-10:]
summary_of_old = summarize(messages[:-10])
context = [summary_of_old] + recent

❌ Anti-Pattern 3: Static Context

python

# DON'T: Use same context for all queries
context = fixed_knowledge_base
for query in queries:
    response = llm.generate(query, context)

Fix: Dynamic context per query

python

# DO: Adapt context to each query
for query in queries:
    context = retrieve_relevant(query, knowledge_base)
    response = llm.generate(query, context)

❌ Anti-Pattern 4: No Token Tracking

python

# DON'T: Add context blindly
context = system + history + knowledge + examples  # Hope it fits!

Fix: Track and limit tokens

python

# DO: Manage token budget
budget = TokenBudget(max_tokens=8000)
if budget.can_add('system', system_prompt):
    budget.add('system', system_prompt)
# ... add components with budget checks

Testing Context Strategies

Test 1: Relevance Accuracy

python

def test_retrieval_relevance():
    """Test that retrieved context is actually relevant"""
    test_cases = [
        {
            'query': "How do I reset my password?",
            'expected_topics': ['authentication', 'password_reset'],
            'unexpected_topics': ['billing', 'shipping']
        }
    ]

    for case in test_cases:
        retrieved = retrieve_context(case['query'])
        topics = extract_topics(retrieved)

        assert all(t in topics for t in case['expected_topics'])
        assert not any(t in topics for t in case['unexpected_topics'])

Test 2: Compression Quality

python

def test_compression_preserves_key_info():
    """Ensure compression doesn't lose critical information"""
    original = long_conversation_history
    compressed = compress_context(original, ratio=0.3)

    # Extract key facts from both
    original_facts = extract_facts(original)
    compressed_facts = extract_facts(compressed)

    # Should preserve 80%+ of key facts
    preservation_rate = len(compressed_facts) / len(original_facts)
    assert preservation_rate >= 0.8

Test 3: Token Budget Compliance

python

def test_token_limits():
    """Verify context stays within token limits"""
    max_tokens = 8000
    for query in test_queries:
        context = build_context(query, max_tokens=max_tokens)
        actual_tokens = count_tokens(context)

        assert actual_tokens <= max_tokens
        assert actual_tokens >= max_tokens * 0.7  # Use at least 70% of budget

Platform-Specific Considerations

OpenAI (GPT-4, GPT-3.5)

•Context window: 8K-128K depending on model
•No native prompt caching (use application-level caching)
•Function calling can reduce context needs
•Use max_tokens to reserve space for responses

Anthropic (Claude)

•Context window: 200K tokens (Claude 3.5)
•Native prompt caching with cache_control parameter
•Extended context enables document analysis
•System prompts can be cached separately

Google (Gemini)

•Context window: 1M-2M tokens (Gemini 1.5/2.0)
•Massive context enables different strategies
•Can include entire codebases or documents
•Still benefits from relevance filtering

Local Models (Llama, Mistral)

•Smaller context windows (4K-32K typically)
•Context compression critical
•Optimize for inference speed
•Consider quantization impact on context handling

Integration Examples

Example 1: RAG with Context Engineering

python

class ContextEngineeredRAG:
    def __init__(self, vector_db, llm):
        self.vector_db = vector_db
        self.llm = llm
        self.budget = TokenBudget(max_tokens=8000)

    def query(self, question, conversation_history=None):
        # 1. Dynamic retrieval
        candidates = self.vector_db.search(question, top_k=20)

        # 2. Rerank by relevance
        relevant = self.rerank(candidates, question, conversation_history)

        # 3. Compress if needed
        context = self.build_context(relevant, conversation_history)

        # 4. Generate with budget compliance
        if count_tokens(context) > self.budget.max_tokens:
            context = self.compress(context)

        return self.llm.generate(question, context)

    def rerank(self, candidates, question, history):
        """Rerank by multiple factors"""
        scored = [
            (doc, score_relevance(doc, question, history))
            for doc in candidates
        ]
        return [doc for doc, score in sorted(scored, key=lambda x: x[1], reverse=True)]

    def build_context(self, docs, history):
        """Build context with token budget"""
        context_parts = []

        # Add compressed history
        if history:
            compressed_history = compress_conversation(history)
            if self.budget.can_add('history', compressed_history):
                self.budget.add('history', compressed_history)
                context_parts.append(compressed_history)

        # Add documents until budget exhausted
        for doc in docs:
            if self.budget.can_add('context', doc):
                self.budget.add('context', doc)
                context_parts.append(doc)
            else:
                break

        return "\n\n".join(context_parts)

Example 2: Multi-Agent Context Sharing

python

class SharedContextManager:
    """Manage context across multiple agents"""

    def __init__(self):
        self.global_context = {}
        self.agent_contexts = defaultdict(list)

    def update_global(self, key, value):
        """Update shared context available to all agents"""
        self.global_context[key] = value

    def get_agent_context(self, agent_id, query):
        """Get personalized context for specific agent"""
        # Combine global + agent-specific + query-relevant
        context = {
            'global': self.global_context,
            'agent_history': self.agent_contexts[agent_id][-10:],
            'relevant': self.retrieve_relevant(query)
        }
        return self.render_context(context)

    def consolidate_agent_output(self, agent_id, output):
        """Extract learnings from agent output to shared context"""
        facts = extract_facts(output)
        for fact in facts:
            if is_globally_relevant(fact):
                self.update_global(fact['key'], fact['value'])

        self.agent_contexts[agent_id].append(output)

Resources

References

•
Dynamic Context Discovery (references/dynamic-context-discovery.md)
- •Advanced retrieval strategies
- •Hybrid search implementations
- •Relevance scoring algorithms
•
Context Compression (references/context-compression.md)
- •Summarization techniques
- •Entity extraction methods
- •Redundancy elimination
•
Prompt Caching (references/prompt-caching.md)
- •Provider-specific implementations
- •Cache invalidation strategies
- •Performance optimization
•
Memory Consolidation (references/memory-consolidation.md)
- •Memory hierarchy patterns
- •CoALA framework integration
- •Long-term storage strategies
•
JIT Context Loading (references/jit-context-loading.md)
- •Lazy loading patterns
- •Incremental expansion
- •Performance considerations

External Resources

•LangChain Context Management: https://python.langchain.com/docs/modules/memory/
•Anthropic Prompt Caching: https://docs.anthropic.com/claude/docs/prompt-caching
•OpenAI Best Practices: https://platform.openai.com/docs/guides/prompt-engineering
•CoALA Framework Paper: https://arxiv.org/abs/2309.02427
•RAG Evaluation: https://arxiv.org/abs/2401.15884

License

MIT License - See LICENSE file for details

Version: 1.0.0 Last Updated: 2026-02-05 Maintainer: Hefesto Skill Generator