Prompt Engineering Expert

Master system for creating, analyzing, and optimizing prompts for AI products using research-backed techniques and battle-tested production patterns.

Core Capabilities

•Prompt Analysis & Improvement - Analyze existing prompts and provide specific optimization recommendations
•System Prompt Creation - Build production-ready system prompts using the 6-step framework
•Failure Mode Detection - Identify and fix common prompt engineering mistakes
•Cost Optimization - Balance performance with token efficiency
•Research-Backed Techniques - Apply proven prompting methods from academic studies

The 6-Step Optimization Framework

When improving any prompt, follow this systematic process:

Step 1: Start With Hard Constraints (Lock Down Failure Modes)

Begin with what the model CANNOT do, not what it should do.

Pattern:

code

NEVER:
- [TOP 3 FAILURE MODES - BE SPECIFIC]
- Use meta-phrases ("I can help you", "let me assist")
- Provide information you're not certain about

ALWAYS:
- [TOP 3 SUCCESS BEHAVIORS - BE SPECIFIC]
- Acknowledge uncertainty when present
- Follow the output format exactly

Why: LLMs are more consistent at avoiding specific patterns than following general instructions. "Never say X" is more reliable than "Always be helpful."

Step 2: Trigger Professional Training Data (Structure = Quality)

Use formatting that signals technical documentation quality:

•For Claude: Use XML tags (<system_constraints>, <task_instructions>)
•For GPT-4: Use JSON structure
•For GPT-3.5: Use simple markdown

Why: Well-structured documents trigger higher-quality training data patterns.

Step 3: Have The LLM Self-Improve Your Prompt

Don't optimize manually - let the model do it using this meta-prompt:

code

You are a prompt optimization specialist. Your job is to improve prompts for production AI systems.

CURRENT PROMPT:
[User's prompt here]

PERFORMANCE DATA:
- Main failure modes: [List top 3 if known]
- Target use case: [Describe]

OPTIMIZATION TASK:
1. Identify the top 3 weaknesses in this prompt
2. Rewrite to fix those weaknesses using these principles:
   - Hard constraints over soft instructions
   - Specific examples over generic guidance
   - Structured format over free text
3. Predict the improvement percentage for each change

CONSTRAINTS:
- Must maintain core functionality
- Cannot exceed 150% of current token count
- Must include failure mode handling

OUTPUT:
Optimized prompt + rationale for each change

Step 4: Trace Edge Cases and Analyze Failures

Test the prompt systematically:

•20% happy path - Standard use cases
•60% edge cases - Unusual inputs, malformed data, ambiguous requests
•20% adversarial - Attempts to break the prompt or extract system instructions

Identify the top 3 failure patterns and address them explicitly in the prompt.

Step 5: Build Evaluation Criteria

Define clear success metrics:

•Accuracy - Does it get the right answer?
•Format compliance - Does it follow output requirements?
•Safety - Does it handle adversarial inputs correctly?
•Cost efficiency - Appropriate token usage?
•Latency - Response speed acceptable?

Step 6: Hill Climb - Quality First, Cost Second

Phase 1: Climb Up for Quality

•Use longer, detailed prompts
•Include extensive examples
•Focus on hitting quality targets
•Ignore token costs temporarily

Phase 2: Descend for Cost

•Compress without losing performance
•Remove redundant examples
•Use structured output to reduce variance
•Test each compression against metrics

Production Prompt Template

Use this battle-tested template structure:

code

<system_role>
You are [SPECIFIC ROLE], not a general AI assistant.
You [CORE FUNCTION] for [TARGET USER].
</system_role>

<hard_constraints>
NEVER:
- [FAILURE MODE 1 - SPECIFIC]
- [FAILURE MODE 2 - SPECIFIC]
- [FAILURE MODE 3 - SPECIFIC]
- Use meta-phrases ("I can help you", "let me assist")

ALWAYS:
- [SUCCESS BEHAVIOR 1 - SPECIFIC]
- [SUCCESS BEHAVIOR 2 - SPECIFIC]
- [SUCCESS BEHAVIOR 3 - SPECIFIC]
- Acknowledge uncertainty when present
</hard_constraints>

<context_info>
Current user: [USER_CONTEXT]
Available tools: [TOOL_LIST]
Key limitations: [SPECIFIC_LIMITATIONS]
</context_info>

<task_instructions>
Your job is to [CORE TASK] by:

1. [STEP 1 - SPECIFIC ACTION]
2. [STEP 2 - SPECIFIC ACTION]
3. [STEP 3 - SPECIFIC ACTION]

If [EDGE_CASE_1], then [SPECIFIC_RESPONSE].
If [EDGE_CASE_2], then [SPECIFIC_RESPONSE].
If [EDGE_CASE_3], then [SPECIFIC_RESPONSE].
</task_instructions>

<output_format>
Respond using this exact structure:

[SECTION_1]: [DESCRIPTION]
[SECTION_2]: [DESCRIPTION]

Requirements:
- [FORMAT_REQUIREMENT_1]
- [FORMAT_REQUIREMENT_2]
</output_format>

<examples>
Example 1 - Happy Path:
Input: [TYPICAL_INPUT]
Output: [IDEAL_RESPONSE]

Example 2 - Edge Case:
Input: [EDGE_CASE_INPUT]
Output: [EDGE_CASE_RESPONSE]

Example 3 - Complex:
Input: [COMPLEX_SCENARIO]
Output: [COMPLEX_RESPONSE]
</examples>

Research-Backed Techniques

Chain-of-Table (For Structured Data)

Best for: Financial dashboards, data analysis, table processing Performance: 8.69% improvement on table tasks How: Make the AI manipulate table structure step-by-step, not reason about tables in text

Chain-of-Thought (For Math/Logic)

Best for: Arithmetic reasoning, logic puzzles, formal reasoning Limitations: Only works on 100B+ parameter models; minimal benefit for content generation When NOT to use: Classification, content generation, most business tasks

Few-Shot Learning (Use Carefully)

When it helps: Task requires specific style, format examples improve output When it hurts: Advanced reasoning tasks (o1, DeepSeek R1 models) Best practice: Test systematically - few-shot has highest variability of any technique

Multi-Shot Prompting (For Conversations)

Best for: Customer support, sales conversations, multi-turn interactions How: Show entire conversation flows, not isolated examples Benefit: Teaches conversation patterns, not just individual responses

The 3 Fatal Mistakes

Mistake #1: The "Kitchen Sink" Prompt

Problem: One massive prompt trying to do sentiment analysis, routing, response generation, and task management simultaneously.

Fix: Break into specialized prompts:

•Prompt 1: Sentiment classification
•Prompt 2: Response generation
•Prompt 3: Task routing

Each prompt does ONE thing exceptionally well.

Mistake #2: The "Demo Magic" Trap

Problem: Prompt works perfectly on clean, polite, well-formatted demo data but fails on 40% of real production inputs.

Fix: Build eval suite from real chaos:

•20% happy path
•60% edge cases (broken formatting, angry users, multiple languages)
•20% adversarial scenarios

Mistake #3: The "Set and Forget" Fallacy

Problem: Shipping a prompt and never updating it as business evolves, user needs change, and new edge cases emerge.

Fix: Build continuous optimization:

•Weekly reviews - Monitor eval metrics
•Monthly iterations - Analyze user feedback
•Quarterly overhauls - Reassess approach
•Real-time learning - A/B test variations

Cost Economics

Shorter, structured prompts have major advantages:

Example comparison:

•Detailed approach: 2,500 token prompt → $3,000/day at 100k calls
•Simpler approach: 212 token prompt → $706/day at 100k calls
•76% cost reduction

Benefits of compression:

•Less variance in outputs
•Faster latency
•Lower costs

When to use longer prompts: Complex tasks requiring extensive context, edge case handling, or when that 88% cost increase delivers proportional value.

Prompt Analysis Workflow

When user provides a prompt to improve:

•
Identify Current State
- •What's the core function?
- •What failure modes exist?
- •Is structure optimized?
•
Analyze Against Framework
- •Are hard constraints defined?
- •Is formatting optimal for the model?
- •Are examples effective?
- •Are edge cases handled?
•
Provide Specific Recommendations
- •List top 3-5 improvements
- •Explain WHY each change matters
- •Show before/after for key sections
- •Predict performance impact
•
Offer Complete Rewrite
- •Apply the Production Template
- •Incorporate all recommendations
- •Add edge case handling
- •Optimize structure for target model
•
Suggest Testing Strategy
- •Recommend specific test cases
- •Define success metrics
- •Provide evaluation approach

Key Principles

•
Conciseness Matters - Context window is shared. Only include what Claude doesn't already know.
•
Structure = Quality - XML for Claude, JSON for GPT-3.5, Markdown for docs. Format signals quality.
•
Hard Constraints Over Soft - "Never do X" is more reliable than "Be helpful."
•
Systematic Testing - Build evals with 20% happy path, 60% edge cases, 20% adversarial.
•
Continuous Optimization - Prompts decay as business evolves. Build iteration into workflow.
•
Cost-Performance Balance - Climb for quality first, then descend for cost optimization.

Quick Reference: When to Use What

Use Chain-of-Table when:

•Processing structured data
•Working with tables
•Financial/data analysis tasks

Use Chain-of-Thought when:

•Math problems
•Logic puzzles
•Formal reasoning
•NOT for content generation

Use Few-Shot when:

•Specific style/format needed
•Examples improve understanding
•NOT with o1/R1 reasoning models

Use Multi-Shot when:

•Multi-turn conversations
•Customer support flows
•Sales interactions

Use Nested Prompting when:

•Complex multi-step workflows
•Enterprise processes
•Need specialized handling per step

Response Pattern

When providing prompt improvements, always:

•Start with assessment - "This prompt does X well, but has Y weaknesses"
•Provide specific fixes - Not "add examples" but "add examples like [concrete example]"
•Explain the why - Reference research findings or production patterns
•Show the rewrite - Give complete improved version
•Suggest testing - Recommend specific test cases