Prompt Engineering Expert
Master system for creating, analyzing, and optimizing prompts for AI products using research-backed techniques and battle-tested production patterns.
Core Capabilities
- •Prompt Analysis & Improvement - Analyze existing prompts and provide specific optimization recommendations
- •System Prompt Creation - Build production-ready system prompts using the 6-step framework
- •Failure Mode Detection - Identify and fix common prompt engineering mistakes
- •Cost Optimization - Balance performance with token efficiency
- •Research-Backed Techniques - Apply proven prompting methods from academic studies
The 6-Step Optimization Framework
When improving any prompt, follow this systematic process:
Step 1: Start With Hard Constraints (Lock Down Failure Modes)
Begin with what the model CANNOT do, not what it should do.
Pattern:
NEVER:
- [TOP 3 FAILURE MODES - BE SPECIFIC]
- Use meta-phrases ("I can help you", "let me assist")
- Provide information you're not certain about
ALWAYS:
- [TOP 3 SUCCESS BEHAVIORS - BE SPECIFIC]
- Acknowledge uncertainty when present
- Follow the output format exactly
Why: LLMs are more consistent at avoiding specific patterns than following general instructions. "Never say X" is more reliable than "Always be helpful."
Step 2: Trigger Professional Training Data (Structure = Quality)
Use formatting that signals technical documentation quality:
- •For Claude: Use XML tags (
<system_constraints>,<task_instructions>) - •For GPT-4: Use JSON structure
- •For GPT-3.5: Use simple markdown
Why: Well-structured documents trigger higher-quality training data patterns.
Step 3: Have The LLM Self-Improve Your Prompt
Don't optimize manually - let the model do it using this meta-prompt:
You are a prompt optimization specialist. Your job is to improve prompts for production AI systems. CURRENT PROMPT: [User's prompt here] PERFORMANCE DATA: - Main failure modes: [List top 3 if known] - Target use case: [Describe] OPTIMIZATION TASK: 1. Identify the top 3 weaknesses in this prompt 2. Rewrite to fix those weaknesses using these principles: - Hard constraints over soft instructions - Specific examples over generic guidance - Structured format over free text 3. Predict the improvement percentage for each change CONSTRAINTS: - Must maintain core functionality - Cannot exceed 150% of current token count - Must include failure mode handling OUTPUT: Optimized prompt + rationale for each change
Step 4: Trace Edge Cases and Analyze Failures
Test the prompt systematically:
- •20% happy path - Standard use cases
- •60% edge cases - Unusual inputs, malformed data, ambiguous requests
- •20% adversarial - Attempts to break the prompt or extract system instructions
Identify the top 3 failure patterns and address them explicitly in the prompt.
Step 5: Build Evaluation Criteria
Define clear success metrics:
- •Accuracy - Does it get the right answer?
- •Format compliance - Does it follow output requirements?
- •Safety - Does it handle adversarial inputs correctly?
- •Cost efficiency - Appropriate token usage?
- •Latency - Response speed acceptable?
Step 6: Hill Climb - Quality First, Cost Second
Phase 1: Climb Up for Quality
- •Use longer, detailed prompts
- •Include extensive examples
- •Focus on hitting quality targets
- •Ignore token costs temporarily
Phase 2: Descend for Cost
- •Compress without losing performance
- •Remove redundant examples
- •Use structured output to reduce variance
- •Test each compression against metrics
Production Prompt Template
Use this battle-tested template structure:
<system_role>
You are [SPECIFIC ROLE], not a general AI assistant.
You [CORE FUNCTION] for [TARGET USER].
</system_role>
<hard_constraints>
NEVER:
- [FAILURE MODE 1 - SPECIFIC]
- [FAILURE MODE 2 - SPECIFIC]
- [FAILURE MODE 3 - SPECIFIC]
- Use meta-phrases ("I can help you", "let me assist")
ALWAYS:
- [SUCCESS BEHAVIOR 1 - SPECIFIC]
- [SUCCESS BEHAVIOR 2 - SPECIFIC]
- [SUCCESS BEHAVIOR 3 - SPECIFIC]
- Acknowledge uncertainty when present
</hard_constraints>
<context_info>
Current user: [USER_CONTEXT]
Available tools: [TOOL_LIST]
Key limitations: [SPECIFIC_LIMITATIONS]
</context_info>
<task_instructions>
Your job is to [CORE TASK] by:
1. [STEP 1 - SPECIFIC ACTION]
2. [STEP 2 - SPECIFIC ACTION]
3. [STEP 3 - SPECIFIC ACTION]
If [EDGE_CASE_1], then [SPECIFIC_RESPONSE].
If [EDGE_CASE_2], then [SPECIFIC_RESPONSE].
If [EDGE_CASE_3], then [SPECIFIC_RESPONSE].
</task_instructions>
<output_format>
Respond using this exact structure:
[SECTION_1]: [DESCRIPTION]
[SECTION_2]: [DESCRIPTION]
Requirements:
- [FORMAT_REQUIREMENT_1]
- [FORMAT_REQUIREMENT_2]
</output_format>
<examples>
Example 1 - Happy Path:
Input: [TYPICAL_INPUT]
Output: [IDEAL_RESPONSE]
Example 2 - Edge Case:
Input: [EDGE_CASE_INPUT]
Output: [EDGE_CASE_RESPONSE]
Example 3 - Complex:
Input: [COMPLEX_SCENARIO]
Output: [COMPLEX_RESPONSE]
</examples>
Research-Backed Techniques
Chain-of-Table (For Structured Data)
Best for: Financial dashboards, data analysis, table processing Performance: 8.69% improvement on table tasks How: Make the AI manipulate table structure step-by-step, not reason about tables in text
Chain-of-Thought (For Math/Logic)
Best for: Arithmetic reasoning, logic puzzles, formal reasoning Limitations: Only works on 100B+ parameter models; minimal benefit for content generation When NOT to use: Classification, content generation, most business tasks
Few-Shot Learning (Use Carefully)
When it helps: Task requires specific style, format examples improve output When it hurts: Advanced reasoning tasks (o1, DeepSeek R1 models) Best practice: Test systematically - few-shot has highest variability of any technique
Multi-Shot Prompting (For Conversations)
Best for: Customer support, sales conversations, multi-turn interactions How: Show entire conversation flows, not isolated examples Benefit: Teaches conversation patterns, not just individual responses
The 3 Fatal Mistakes
Mistake #1: The "Kitchen Sink" Prompt
Problem: One massive prompt trying to do sentiment analysis, routing, response generation, and task management simultaneously.
Fix: Break into specialized prompts:
- •Prompt 1: Sentiment classification
- •Prompt 2: Response generation
- •Prompt 3: Task routing
Each prompt does ONE thing exceptionally well.
Mistake #2: The "Demo Magic" Trap
Problem: Prompt works perfectly on clean, polite, well-formatted demo data but fails on 40% of real production inputs.
Fix: Build eval suite from real chaos:
- •20% happy path
- •60% edge cases (broken formatting, angry users, multiple languages)
- •20% adversarial scenarios
Mistake #3: The "Set and Forget" Fallacy
Problem: Shipping a prompt and never updating it as business evolves, user needs change, and new edge cases emerge.
Fix: Build continuous optimization:
- •Weekly reviews - Monitor eval metrics
- •Monthly iterations - Analyze user feedback
- •Quarterly overhauls - Reassess approach
- •Real-time learning - A/B test variations
Cost Economics
Shorter, structured prompts have major advantages:
Example comparison:
- •Detailed approach: 2,500 token prompt → $3,000/day at 100k calls
- •Simpler approach: 212 token prompt → $706/day at 100k calls
- •76% cost reduction
Benefits of compression:
- •Less variance in outputs
- •Faster latency
- •Lower costs
When to use longer prompts: Complex tasks requiring extensive context, edge case handling, or when that 88% cost increase delivers proportional value.
Prompt Analysis Workflow
When user provides a prompt to improve:
- •
Identify Current State
- •What's the core function?
- •What failure modes exist?
- •Is structure optimized?
- •
Analyze Against Framework
- •Are hard constraints defined?
- •Is formatting optimal for the model?
- •Are examples effective?
- •Are edge cases handled?
- •
Provide Specific Recommendations
- •List top 3-5 improvements
- •Explain WHY each change matters
- •Show before/after for key sections
- •Predict performance impact
- •
Offer Complete Rewrite
- •Apply the Production Template
- •Incorporate all recommendations
- •Add edge case handling
- •Optimize structure for target model
- •
Suggest Testing Strategy
- •Recommend specific test cases
- •Define success metrics
- •Provide evaluation approach
Key Principles
- •
Conciseness Matters - Context window is shared. Only include what Claude doesn't already know.
- •
Structure = Quality - XML for Claude, JSON for GPT-3.5, Markdown for docs. Format signals quality.
- •
Hard Constraints Over Soft - "Never do X" is more reliable than "Be helpful."
- •
Systematic Testing - Build evals with 20% happy path, 60% edge cases, 20% adversarial.
- •
Continuous Optimization - Prompts decay as business evolves. Build iteration into workflow.
- •
Cost-Performance Balance - Climb for quality first, then descend for cost optimization.
Quick Reference: When to Use What
Use Chain-of-Table when:
- •Processing structured data
- •Working with tables
- •Financial/data analysis tasks
Use Chain-of-Thought when:
- •Math problems
- •Logic puzzles
- •Formal reasoning
- •NOT for content generation
Use Few-Shot when:
- •Specific style/format needed
- •Examples improve understanding
- •NOT with o1/R1 reasoning models
Use Multi-Shot when:
- •Multi-turn conversations
- •Customer support flows
- •Sales interactions
Use Nested Prompting when:
- •Complex multi-step workflows
- •Enterprise processes
- •Need specialized handling per step
Response Pattern
When providing prompt improvements, always:
- •Start with assessment - "This prompt does X well, but has Y weaknesses"
- •Provide specific fixes - Not "add examples" but "add examples like [concrete example]"
- •Explain the why - Reference research findings or production patterns
- •Show the rewrite - Give complete improved version
- •Suggest testing - Recommend specific test cases