Recursive LLM Pattern
Process documents that exceed context limits using a REPL environment with sub-LLM calls. Based on the Recursive Language Models (RLM) paper.
Key Insight
The prompt is loaded as a variable in the REPL environment, not fed directly to the LLM. The LLM then writes code to symbolically interact with, decompose, and recursively process the content. This lets it handle inputs orders of magnitude beyond its context window.
When to Use
- •Processing very large files (multi-MB text, code repos, >100K tokens)
- •Aggregating information from many documents
- •Tasks requiring dense access to long content (every line matters)
- •When direct context would cause quality degradation
When NOT to use: For small inputs (<50K tokens), direct LLM calls are faster and slightly more accurate. The RLM pattern adds overhead that only pays off at scale.
Prerequisites
- •
mcp_handley_labmust be installed in the Python environment - •API keys configured for desired providers (GEMINI_API_KEY, OPENAI_API_KEY, etc.)
Pattern
- •
Create Python REPL session
codemcp__repl__session(action="create", backend="python")
- •
Import llm module and survey the context
pythonfrom mcp_handley_lab import llm from pathlib import Path context_path = Path("/path/to/large/document.txt") context_size = context_path.stat().st_size print(f"Context: {context_size:,} bytes (~{context_size // 4:,} tokens)") # Peek at the structure before choosing a chunking strategy with open(context_path) as f: head = f.read(2000) print(head) - •
Process in chunks with stateless sub-LLM calls
pythonresults = [] chunk_size = 100_000 # bytes (~25K tokens); adjust based on measured token usage with open(context_path, 'rb') as f: offset = 0 while chunk := f.read(chunk_size): text = chunk.decode('utf-8', errors='replace') # branch="" makes call stateless (no memory accumulation) result = llm.chat(f'''Extract key information from this document chunk. <document_chunk offset="{offset}"> {text} </document_chunk> Summarize the key facts found in this chunk only.''', branch="") results.append({"offset": offset, "summary": result.content}) offset += len(chunk) # Aggregate results summaries = "\n".join(f"[{r['offset']}]: {r['summary']}" for r in results) final = llm.chat(f"Synthesize these chunk summaries:\n{summaries}", branch="") print(final.content)
Advanced Patterns
Filter Before Calling LLMs
Use code to narrow the search space before spending tokens on sub-calls:
import re
# Use regex/string ops to find relevant sections first
with open(context_path) as f:
content = f.read()
matches = [m.start() for m in re.finditer(r'keyword|pattern', content)]
print(f"Found {len(matches)} relevant sections")
# Only send relevant sections to sub-LLMs
for start in matches:
excerpt = content[max(0, start-500):start+2000]
result = llm.chat(f"Analyze this excerpt:\n{excerpt}", branch="")
results.append(result.content)
Recursive Sub-Calls (True RLM Depth)
Sub-LLM calls can themselves spawn further sub-calls. Use this when a single level of chunking isn't enough — e.g., each chunk still exceeds context or requires its own decomposition:
def process_section(text, depth=0, max_depth=2):
"""Recursively process text, spawning sub-calls if too large."""
if len(text) < 100_000 or depth >= max_depth:
result = llm.chat(f"Analyze:\n{text}", branch="")
return result.content
# Chunk and recurse
mid = len(text) // 2
left = process_section(text[:mid], depth + 1, max_depth)
right = process_section(text[mid:], depth + 1, max_depth)
merged = llm.chat(f"Merge these analyses:\n1: {left}\n2: {right}", branch="")
return merged.content
Build Output Through REPL Variables
For tasks producing long output (e.g., transforming every line), accumulate results in REPL variables rather than trying to generate everything in one LLM call:
output_parts = []
for i, section in enumerate(sections):
transformed = llm.chat(f"Transform this section:\n{section}", branch="")
output_parts.append(transformed.content)
print(f"Processed {i+1}/{len(sections)}")
# Stitch together — the final output can exceed any single LLM's output limit
final_output = "\n\n".join(output_parts)
Path("/tmp/result.txt").write_text(final_output)
print(f"Written {len(final_output):,} chars to /tmp/result.txt")
Answer Verification
Use a separate sub-call to verify answers, avoiding context rot in the main reasoning chain:
# After computing an answer from chunks
answer = llm.chat(f"Based on these summaries, answer: {query}\n{summaries}", branch="")
# Verify with targeted evidence
verification = llm.chat(
f"Verify this answer: '{answer.content}'\nUsing this evidence:\n{relevant_excerpt}",
branch=""
)
print(f"Answer: {answer.content}\nVerification: {verification.content}")
Batching Guidance
Batch aggressively — each llm.chat() call has fixed overhead (API latency, token preamble). Aim for ~100-200K characters per sub-call rather than many small calls.
# BAD: 1000 individual calls (slow, expensive)
for line in lines:
result = llm.chat(f"Classify: {line}", branch="")
# GOOD: batch into ~10 calls of 100 lines each
batch_size = 100
for i in range(0, len(lines), batch_size):
batch = "\n".join(lines[i:i+batch_size])
result = llm.chat(f"Classify each line:\n{batch}", branch="")
API Reference
from mcp_handley_lab import llm
result = llm.chat(
prompt: str, # The question/task
branch: str = "session", # Conversation branch ("" for stateless)
model: str = "gemini", # Provider or model name
system_prompt: str = "", # System instructions
temperature: float = 1.0,
options: dict = None, # Provider-specific (grounding, reasoning_effort, etc.)
)
# Result fields (LLMResult):
result.content # Response text
result.usage.input_tokens # Tokens used for input
result.usage.output_tokens # Tokens used for output
result.usage.cost # Cost in USD
result.usage.model_used # Canonical model ID
result.branch # Branch name (empty if stateless)
result.commit_sha # Git commit (None if stateless)
Available if configured in models.yaml with API keys set: gemini, openai, claude, mistral, grok, groq
Tips
- •Use
branch=""for stateless sub-calls (no memory accumulation) - •Use
branch="session"(default) for multi-turn conversations - •Use
model="gemini"for large context windows in sub-calls - •Use delimiters (
<document>...</document>) to prevent prompt injection - •Include byte offsets in prompts to preserve provenance
- •Cache repeated queries in Python variables
- •Use regex/string ops for pattern matching (faster than LLM)
- •Monitor token usage via
result.usageto tune chunk sizes - •Survey the context structure (peek at head, count lines/sections) before choosing a chunking strategy — structure-aware chunking (by headers, paragraphs, functions) beats fixed-size byte chunking