Agent Performance Optimization Workflow
Systematic improvement of existing agents through performance analysis, prompt engineering, and continuous iteration.
[Extended thinking: Agent optimization requires a data-driven approach combining performance metrics, user feedback analysis, and advanced prompt engineering techniques. Success depends on systematic evaluation, targeted improvements, and rigorous testing with rollback capabilities for production safety.]
Use this skill when
- •Improving an existing agent's performance or reliability
- •Analyzing failure modes, prompt quality, or tool usage
- •Running structured A/B tests or evaluation suites
- •Designing iterative optimization workflows for agents
Do not use this skill when
- •You are building a brand-new agent from scratch
- •There are no metrics, feedback, or test cases available
- •The task is unrelated to agent performance or prompt quality
Instructions
- •Establish baseline metrics and collect representative examples.
- •Identify failure modes and prioritize high-impact fixes.
- •Apply prompt and workflow improvements with measurable goals.
- •Validate with tests and roll out changes in controlled stages.
Safety
- •Avoid deploying prompt changes without regression testing.
- •Roll back quickly if quality or safety metrics regress.
Phase 1: Performance Analysis and Baseline Metrics
Comprehensive analysis of agent performance using context-manager for historical data collection.
1.1 Gather Performance Data
Use: context-manager Command: analyze-agent-performance $ARGUMENTS --days 30
Collect metrics including:
- •Task completion rate (successful vs failed tasks)
- •Response accuracy and factual correctness
- •Tool usage efficiency (correct tools, call frequency)
- •Average response time and token consumption
- •User satisfaction indicators (corrections, retries)
- •Hallucination incidents and error patterns
1.2 User Feedback Pattern Analysis
Identify recurring patterns in user interactions:
- •Correction patterns: Where users consistently modify outputs
- •Clarification requests: Common areas of ambiguity
- •Task abandonment: Points where users give up
- •Follow-up questions: Indicators of incomplete responses
- •Positive feedback: Successful patterns to preserve
1.3 Failure Mode Classification
Categorize failures by root cause:
- •Instruction misunderstanding: Role or task confusion
- •Output format errors: Structure or formatting issues
- •Context loss: Long conversation degradation
- •Tool misuse: Incorrect or inefficient tool selection
- •Constraint violations: Safety or business rule breaches
- •Edge case handling: Unusual input scenarios
1.4 Baseline Performance Report
Generate quantitative baseline metrics:
Performance Baseline: - Task Success Rate: [X%] - Average Corrections per Task: [Y] - Tool Call Efficiency: [Z%] - User Satisfaction Score: [1-10] - Average Response Latency: [Xms] - Token Efficiency Ratio: [X:Y]
Phase 2: Prompt Engineering Improvements
Apply advanced prompt optimization techniques using prompt-engineer agent.
2.1 Chain-of-Thought Enhancement
Implement structured reasoning patterns:
Use: prompt-engineer Technique: chain-of-thought-optimization
- •Add explicit reasoning steps: "Let's approach this step-by-step..."
- •Include self-verification checkpoints: "Before proceeding, verify that..."
- •Implement recursive decomposition for complex tasks
- •Add reasoning trace visibility for debugging
2.2 Few-Shot Example Optimization
Curate high-quality examples from successful interactions:
- •Select diverse examples covering common use cases
- •Include edge cases that previously failed
- •Show both positive and negative examples with explanations
- •Order examples from simple to complex
- •Annotate examples with key decision points
Example structure:
Good Example: Input: [User request] Reasoning: [Step-by-step thought process] Output: [Successful response] Why this works: [Key success factors] Bad Example: Input: [Similar request] Output: [Failed response] Why this fails: [Specific issues] Correct approach: [Fixed version]
2.3 Role Definition Refinement
Strengthen agent identity and capabilities:
- •Core purpose: Clear, single-sentence mission
- •Expertise domains: Specific knowledge areas
- •Behavioral traits: Personality and interaction style
- •Tool proficiency: Available tools and when to use them
- •Constraints: What the agent should NOT do
- •Success criteria: How to measure task completion
2.4 Constitutional AI Integration
Implement self-correction mechanisms:
Constitutional Principles: 1. Verify factual accuracy before responding 2. Self-check for potential biases or harmful content 3. Validate output format matches requirements 4. Ensure response completeness 5. Maintain consistency with previous responses
Add critique-and-revise loops:
- •Initial response generation
- •Self-critique against principles
- •Automatic revision if issues detected
- •Final validation before output
2.5 Output Format Tuning
Optimize response structure:
- •Structured templates for common tasks
- •Dynamic formatting based on complexity
- •Progressive disclosure for detailed information
- •Markdown optimization for readability
- •Code block formatting with syntax highlighting
- •Table and list generation for data presentation
Phase 3: Testing and Validation
Comprehensive testing framework with A/B comparison.
3.1 Test Suite Development
Create representative test scenarios:
Test Categories: 1. Golden path scenarios (common successful cases) 2. Previously failed tasks (regression testing) 3. Edge cases and corner scenarios 4. Stress tests (complex, multi-step tasks) 5. Adversarial inputs (potential breaking points) 6. Cross-domain tasks (combining capabilities)
3.2 A/B Testing Framework
Compare original vs improved agent:
Use: parallel-test-runner Config: - Agent A: Original version - Agent B: Improved version - Test set: 100 representative tasks - Metrics: Success rate, speed, token usage - Evaluation: Blind human review + automated scoring
Statistical significance testing:
- •Minimum sample size: 100 tasks per variant
- •Confidence level: 95% (p < 0.05)
- •Effect size calculation (Cohen's d)
- •Power analysis for future tests
3.3 Evaluation Metrics
Comprehensive scoring framework:
Task-Level Metrics:
- •Completion rate (binary success/failure)
- •Correctness score (0-100% accuracy)
- •Efficiency score (steps taken vs optimal)
- •Tool usage appropriateness
- •Response relevance and completeness
Quality Metrics:
- •Hallucination rate (factual errors per response)
- •Consistency score (alignment with previous responses)
- •Format compliance (matches specified structure)
- •Safety score (constraint adherence)
- •User satisfaction prediction
Performance Metrics:
- •Response latency (time to first token)
- •Total generation time
- •Token consumption (input + output)
- •Cost per task (API usage fees)
- •Memory/context efficiency
3.4 Human Evaluation Protocol
Structured human review process:
- •Blind evaluation (evaluators don't know version)
- •Standardized rubric with clear criteria
- •Multiple evaluators per sample (inter-rater reliability)
- •Qualitative feedback collection
- •Preference ranking (A vs B comparison)
Phase 4: Version Control and Deployment
Safe rollout with monitoring and rollback capabilities.
4.1 Version Management
Systematic versioning strategy:
Version Format: agent-name-v[MAJOR].[MINOR].[PATCH] Example: customer-support-v2.3.1 MAJOR: Significant capability changes MINOR: Prompt improvements, new examples PATCH: Bug fixes, minor adjustments
Maintain version history:
- •Git-based prompt storage
- •Changelog with improvement details
- •Performance metrics per version
- •Rollback procedures documented
4.2 Staged Rollout
Progressive deployment strategy:
- •Alpha testing: Internal team validation (5% traffic)
- •Beta testing: Selected users (20% traffic)
- •Canary release: Gradual increase (20% → 50% → 100%)
- •Full deployment: After success criteria met
- •Monitoring period: 7-day observation window
4.3 Rollback Procedures
Quick recovery mechanism:
Rollback Triggers: - Success rate drops >10% from baseline - Critical errors increase >5% - User complaints spike - Cost per task increases >20% - Safety violations detected Rollback Process: 1. Detect issue via monitoring 2. Alert team immediately 3. Switch to previous stable version 4. Analyze root cause 5. Fix and re-test before retry
4.4 Continuous Monitoring
Real-time performance tracking:
- •Dashboard with key metrics
- •Anomaly detection alerts
- •User feedback collection
- •Automated regression testing
- •Weekly performance reports
Success Criteria
Agent improvement is successful when:
- •Task success rate improves by ≥15%
- •User corrections decrease by ≥25%
- •No increase in safety violations
- •Response time remains within 10% of baseline
- •Cost per task doesn't increase >5%
- •Positive user feedback increases
Post-Deployment Review
After 30 days of production use:
- •Analyze accumulated performance data
- •Compare against baseline and targets
- •Identify new improvement opportunities
- •Document lessons learned
- •Plan next optimization cycle
Continuous Improvement Cycle
Establish regular improvement cadence:
- •Weekly: Monitor metrics and collect feedback
- •Monthly: Analyze patterns and plan improvements
- •Quarterly: Major version updates with new capabilities
- •Annually: Strategic review and architecture updates
Remember: Agent optimization is an iterative process. Each cycle builds upon previous learnings, gradually improving performance while maintaining stability and safety.