LLM Application Patterns
Architecture Pattern Selection
| Pattern | Use When | Complexity |
|---|---|---|
| Single prompt | Classification, extraction, simple Q&A | Low |
| Chain/pipeline | Multi-step transformations, routing | Medium |
| RAG | Knowledge retrieval from docs | Medium |
| Agent with tools | External actions, multi-step reasoning | High |
| Multi-agent | Complex workflows, specialized sub-tasks | Very High |
Decision rule: Use the simplest pattern that solves the problem. A single well-structured prompt beats a complex chain 80% of the time.
Prompting Strategies
Strategy Selection
| Task Type | Strategy | Avoid |
|---|---|---|
| Classification | Few-shot with labels | CoT (overthinks simple tasks) |
| Reasoning / Math | CoT with verification | Zero-shot (unreliable) |
| Multi-step tasks | ReAct / tool-use | Single-shot (misses steps) |
| Extraction | Structured output + schema | Free-form (inconsistent) |
| Creative | System prompt + constraints | Over-constraining |
Few-Shot Prompting
python
SENTIMENT_PROMPT = """Classify the sentiment as positive, negative, or neutral.
Review: "The food was amazing and the service was quick."
Sentiment: positive
Review: "Waited 45 minutes and the order was wrong."
Sentiment: negative
Review: "It was okay, nothing special."
Sentiment: neutral
Review: "{review}"
Sentiment:"""
- •3-5 examples is the sweet spot (diminishing returns after)
- •Cover all label classes in examples
- •Vary example order across runs to check for position bias
Chain-of-Thought (CoT)
python
COT_PROMPT = """Solve step by step. Show reasoning, then give final answer as "Answer: <value>".
Question: {question}
Let me think step by step:"""
"Let's think step by step" works for large models (70B+). Smaller models often produce plausible-sounding but wrong reasoning. Verify CoT actually helps on your task before committing.
Structured Output
python
# Anthropic -- tool use for structured output
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-5-20250929",
max_tokens=1024,
tools=[{
"name": "extract_info",
"description": "Extract structured information from text",
"input_schema": {
"type": "object",
"properties": {
"company_name": {"type": "string"},
"revenue_millions": {"type": "number", "description": "Revenue in millions USD"},
"fiscal_year": {"type": "string"},
"sentiment": {"type": "string", "enum": ["positive", "negative", "neutral"]},
},
"required": ["company_name", "sentiment"],
},
}],
tool_choice={"type": "tool", "name": "extract_info"},
messages=[{"role": "user", "content": f"Extract info from: {text}"}],
)
result = response.content[0].input # Parsed dict
python
# OpenAI -- structured outputs with Pydantic
from openai import OpenAI
from pydantic import BaseModel
class CompanyInfo(BaseModel):
company_name: str
revenue_millions: float | None
fiscal_year: str | None
sentiment: str
client = OpenAI()
completion = client.beta.chat.completions.parse(
model="gpt-4o",
messages=[{"role": "user", "content": f"Extract info from: {text}"}],
response_format=CompanyInfo,
)
result = completion.choices[0].message.parsed # CompanyInfo instance
ReAct / Tool Use
python
# Anthropic tool use
import anthropic
client = anthropic.Anthropic()
tools = [
{
"name": "search_database",
"description": "Search internal knowledge base. Returns relevant documents.",
"input_schema": {
"type": "object",
"properties": {"query": {"type": "string"}},
"required": ["query"],
},
},
{
"name": "calculate",
"description": "Evaluate a math expression.",
"input_schema": {
"type": "object",
"properties": {"expression": {"type": "string"}},
"required": ["expression"],
},
},
]
def agent_loop(question: str, max_steps: int = 5) -> str:
messages = [{"role": "user", "content": question}]
for _ in range(max_steps):
response = client.messages.create(
model="claude-sonnet-4-5-20250929", max_tokens=1024,
tools=tools, messages=messages,
)
if response.stop_reason == "end_turn":
return response.content[0].text
# Execute tool calls
tool_results = []
for block in response.content:
if block.type == "tool_use":
result = execute_tool(block.name, block.input)
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": str(result),
})
messages.append({"role": "assistant", "content": response.content})
messages.append({"role": "user", "content": tool_results})
return "Max steps reached"
Memory / Context Management
| Conversation Length | Strategy | Implementation |
|---|---|---|
| < 10 messages | Full history | Pass all messages directly |
| 10-50 messages | Sliding window | Keep last K messages + system prompt |
| 50+ messages | Summarize + recent | Summarize old turns, keep recent 5-10 |
| Entity tracking | Structured state | Extract entities into dict, inject as context |
| Large corpus | Semantic retrieval | Embed messages, retrieve relevant history |
python
def manage_context(messages: list, max_tokens: int = 3000) -> list:
"""Sliding window with summarization fallback."""
if count_tokens(messages) <= max_tokens:
return messages
# Keep system prompt + last 5 turns
system = [m for m in messages if m["role"] == "system"]
recent = messages[-10:] # Last 5 turns (user + assistant)
if count_tokens(system + recent) <= max_tokens:
return system + recent
# Summarize if still too long
old = messages[len(system):-10]
summary = summarize_messages(old)
return system + [{"role": "user", "content": f"Previous context summary: {summary}"}] + recent
RAG Integration
Chunking Strategy
| Document Type | Chunk Size | Overlap |
|---|---|---|
| Technical docs | 500-1000 tokens | 10-20% |
| Code | 300-500 tokens | 50 tokens |
| Chat logs | 200-300 tokens | 50 tokens |
Retrieval Pipeline
- •Multi-query: generate 3-5 query variations for ambiguous questions
- •Hybrid search: dense (vector) + sparse (BM25) with RRF fusion
- •Rerank: cross-encoder on top 20-50 candidates → return top 3-5
- •Cite: include source markers
[1],[2]in generation prompt
Prompt Versioning
python
from dataclasses import dataclass, field
import hashlib
@dataclass
class PromptVersion:
name: str
template: str
model: str
temperature: float = 0.0
version: str = field(default="")
def __post_init__(self):
if not self.version:
content = f"{self.template}{self.model}{self.temperature}"
self.version = hashlib.sha256(content.encode()).hexdigest()[:8]
def render(self, **kwargs) -> str:
return self.template.format(**kwargs)
Evaluation Harness
python
def evaluate_prompt(client, prompt_version, test_cases, parse_fn):
results = []
for case in test_cases:
rendered = prompt_version.render(**case["inputs"])
response = client.messages.create(
model=prompt_version.model, max_tokens=1024,
messages=[{"role": "user", "content": rendered}],
)
prediction = parse_fn(response.content[0].text)
results.append({
"expected": case["expected"],
"predicted": prediction,
"correct": prediction == case["expected"],
})
accuracy = sum(r["correct"] for r in results) / len(results)
return {"version": prompt_version.version, "accuracy": accuracy, "results": results}
Production Guardrails
Cost Control
- •Cache identical queries (hash prompt + model + temperature)
- •Route simple tasks to cheaper/smaller models
- •Summarize history before exceeding context window
- •Monitor token usage by endpoint
Reliability
- •Set timeout limits on all LLM calls
- •Implement retry with exponential backoff for rate limits
- •Fallback to simpler model on primary model failure
- •Validate tool inputs before execution
Observability
- •Log: prompt version, model, tokens used, latency, response hash
- •Track agent tool selection accuracy
- •Monitor hallucination rate via groundedness checks
- •Alert on latency p95/p99 regressions
Gotchas
Position Bias
Models favor options at certain positions (often first/last). For MCQ eval, rotate answer positions and average.
Lost-in-the-Middle
Information in the middle of long contexts is retrieved less reliably. Put critical context at the beginning or end.
Common Anti-Patterns
- •Building complex chains when a single well-structured prompt suffices
- •Temperature=0 for creative tasks (deterministic != best quality)
- •Not testing adversarial/edge cases in prompt evaluation
- •Assuming a prompt that works on GPT-4 transfers to smaller models
- •Storing entire conversation history without windowing (context overflow + cost explosion)
- •Generic tool descriptions (confuses agent tool selection)
- •No fallback for LLM failures (always handle rate limits and timeouts)
Cross-References
- •ai-ml:rag-and-vector-search -- retrieval-augmented generation, chunking, embedding strategies
- •ai-ml:structured-output-patterns -- JSON mode, function calling, constrained decoding
- •ai-ml:agentic-systems-design -- tool use, multi-agent orchestration, planning loops