DSPy Optimization Workflow for Skills-Fleet
When to Use
Load this skill when you need to:
- •Optimize DSPy programs in skills-fleet for better quality
- •Implement production patterns (monitoring, error handling, ensemble methods)
- •Use the optimization API endpoints (
/optimization/start,/optimization/status) - •Design effective DSPy signatures with Literal types and constraints
- •Create or expand training datasets for robust optimization
- •Troubleshoot optimization issues (low scores, API failures, type errors)
- •Implement advanced patterns (versioning, A/B testing, caching)
This skill documents the complete 3-phase DSPy quality improvement workflow successfully implemented in January 2026.
Quick Start
Run Optimization (Simplest)
# Using the quick optimization script uv run python scripts/run_optimization.py # Expected: Runs GEPA optimization with trainset_v4.json (50 examples) # Saves to: config/optimized/skill_program_gepa_v1.pkl
Run Optimization via API
# 1. Start server
uv run skill-fleet serve
# 2. Trigger optimization
curl -X POST http://localhost:8000/api/v1/optimization/start \
-H "Content-Type: application/json" \
-d '{
"optimizer": "miprov2",
"trainset_file": "config/training/trainset_v4.json",
"auto": "medium"
}'
# 3. Check status (use job_id from response)
curl http://localhost:8000/api/v1/optimization/status/{job_id}
Test Your Implementation
# Run comprehensive validation uv run python scripts/test_phase_implementation.py # Expected: 10/10 tests pass
3-Phase Implementation Guide
Phase 1: Foundation (Week 1)
Goal: Enhance signatures, expand training data, add monitoring
Tasks:
- •
Enhance Signatures → See references/phase1-implementation.md
- •Add Literal types for constrained outputs
- •Specific OutputField constraints with quality indicators
- •Concise, actionable docstrings
- •
Expand Training Data → See references/phase1-implementation.md
- •Target: 50-100 examples (DSPy recommendation)
- •Extract from existing skills
- •Generate synthetic examples for diversity
- •Use
scripts/expand_training_data.pyandscripts/generate_synthetic_examples.py
- •
Add Monitoring → See references/phase1-implementation.md
- •ModuleMonitor: Wrap modules for tracking
- •ExecutionTracer: Collect detailed traces
- •MLflowLogger: Optional experiment tracking
Example: examples/example_signature.py
Phase 2: Optimization (Week 2)
Goal: Run optimization, implement custom metrics, add error handling
Tasks:
- •
Run Optimization → See references/phase2-optimization.md
- •Use MIPROv2 with
auto="medium"for balanced cost/quality - •Or GEPA for faster, reflection-based optimization
- •Configure via API or CLI
- •Use MIPROv2 with
- •
Enhanced Metrics → See references/phase2-optimization.md
- •taxonomy_accuracy_metric
- •metadata_quality_metric
- •skill_style_alignment_metric
- •comprehensive_metric (weighted combination)
- •
Error Handling → See references/phase2-optimization.md
- •RobustModule: Retry with exponential backoff
- •ValidatedModule: Output validation
- •Phase-specific fallbacks
Example: examples/example_metric.py
Phase 3: Advanced Patterns (Week 3)
Goal: Implement ensemble, versioning, caching for production
Tasks:
- •
Ensemble Methods → See references/phase3-advanced.md
- •EnsembleModule: Multiple models, best selection
- •BestOfN: Generate N, pick highest quality
- •MajorityVote: Classification consensus
- •
Versioning → See references/phase3-advanced.md
- •ProgramRegistry: Manage multiple versions
- •ABTestRouter: Gradual rollout, A/B testing
- •
Caching → See references/phase3-advanced.md
- •CachedModule: Multi-level caching (memory + disk)
- •Significant performance gains (30-50% faster)
Example: examples/example_ensemble.py
API Usage Patterns
Complete reference: references/api-reference.md
Key Endpoints
POST /api/v1/optimization/start
- •Trigger background optimization job
- •Supports MIPROv2, GEPA, BootstrapFewShot
- •Uses trainset JSON files or skill paths
GET /api/v1/optimization/status/{job_id}
- •Check optimization progress
- •Returns: status, progress (0-1), result, error
GET /api/v1/optimization/optimizers
- •List available optimizers with parameters
- •Useful for discovering configuration options
Integration Example
# Start optimization programmatically
import httpx
async with httpx.AsyncClient() as client:
response = await client.post(
"http://localhost:8000/api/v1/optimization/start",
json={
"optimizer": "miprov2",
"trainset_file": "config/training/trainset_v4.json",
"auto": "medium",
}
)
job_id = response.json()["job_id"]
# Poll for completion
while True:
status = await client.get(
f"http://localhost:8000/api/v1/optimization/status/{job_id}"
)
data = status.json()
if data["status"] == "completed":
print(f"Quality score: {data['result']['quality_score']}")
break
await asyncio.sleep(5)
Best Practices & Patterns
Complete guide: references/best-practices.md
DSPy Signature Design
✅ DO:
- •Use Literal types for enums/categories
- •Add specific constraints to OutputField descriptions
- •Include quality indicators ("quality >0.80", "3-5 examples")
- •Keep docstrings concise and actionable
❌ DON'T:
- •Use generic
strtypes when Literal would work - •Write verbose explanations in docstrings
- •Skip OutputField descriptions
- •Use underscores or spaces in field names
Training Data
✅ DO:
- •Aim for 50-100 diverse examples
- •Include all skill styles (comprehensive, navigation_hub, minimal)
- •Cover all major categories
- •Use both golden and synthetic examples
❌ DON'T:
- •Rely on <20 examples (insufficient for robust optimization)
- •Duplicate examples (reduces effective dataset size)
- •Skip validation of JSON structure
- •Ignore category distribution
Optimization Strategy
✅ DO:
- •Start with GEPA for quick iteration (fast, cheap)
- •Use MIPROv2
auto="medium"for production (balanced) - •Monitor costs and quality during optimization
- •Evaluate on separate test set
❌ DON'T:
- •Jump straight to
auto="heavy"(expensive, often unnecessary) - •Optimize on entire dataset without train/test split
- •Ignore baseline evaluation
- •Skip monitoring/logging during long runs
Troubleshooting
Complete guide: references/troubleshooting.md
Common Issues
Low Quality Scores (<0.70)
- •✓ Check training data diversity (need 50+ examples)
- •✓ Verify signature constraints are specific
- •✓ Review metric function (might be too strict)
- •✓ Try MIPROv2 instead of BootstrapFewShot
API Optimization Job Fails
- •✓ Check trainset JSON structure
- •✓ Verify GOOGLE_API_KEY is set
- •✓ Check server logs for specific error
- •✓ Ensure enough memory (optimization is CPU/memory intensive)
Type Errors in Signatures
- •✓ Add
from __future__ import annotations - •✓ Import types from
typing(Literal, etc.) - •✓ Run
uv run ty check src/to validate - •✓ Check for unresolved references
Slow Optimization
- •✓ Use GEPA instead of MIPROv2 for faster iteration
- •✓ Reduce
num_candidate_programs(default 16 → 8) - •✓ Lower
max_bootstrapped_demos(4 → 2) - •✓ Use
auto="light"instead of "medium"
Utilities & Scripts
Quick Optimization Runner
# Run optimization with sensible defaults .skills/dspy-optimization-workflow/scripts/quick_optimize.py \ --trainset config/training/trainset_v4.json \ --optimizer gepa
Test Custom Metrics
# Test metric against examples .skills/dspy-optimization-workflow/scripts/test_metrics.py \ --metric comprehensive_metric \ --examples 10
Compare Program Versions
# Compare two optimized versions .skills/dspy-optimization-workflow/scripts/compare_versions.py \ --v1 config/optimized/program_v1.pkl \ --v2 config/optimized/program_v2.pkl
Export Monitoring Traces
# Export traces for analysis .skills/dspy-optimization-workflow/scripts/export_traces.py \ --output traces_analysis.json
Key Files Reference
Configuration
- •
config/training/trainset_v4.json- 50 training examples (ready to use) - •
config/config.yaml- LLM configuration (roles, models)
Core Implementation
- •
src/skill_fleet/core/dspy/signatures/- Enhanced signature definitions - •
src/skill_fleet/core/dspy/metrics/enhanced_metrics.py- Evaluation metrics - •
src/skill_fleet/core/dspy/monitoring/- Monitoring infrastructure - •
src/skill_fleet/core/dspy/modules/error_handling.py- Error handling wrappers - •
src/skill_fleet/core/dspy/modules/ensemble.py- Ensemble methods - •
src/skill_fleet/core/dspy/versioning.py- Version management - •
src/skill_fleet/core/dspy/caching.py- Caching strategies
API
- •
src/skill_fleet/api/routes/optimization.py- Optimization endpoints
Scripts
- •
scripts/run_optimization.py- Main optimization runner - •
scripts/test_phase_implementation.py- Comprehensive tests - •
scripts/expand_training_data.py- Training data extraction - •
scripts/generate_synthetic_examples.py- Synthetic example generation
Expected Results
With complete Phase 1-3 implementation:
- •Quality Score: 0.70-0.75 → 0.85-0.90 (+15-20%)
- •Obra Compliance: ~60% → ~85% (+25%)
- •Consistency: Much improved with Literal type constraints
- •Performance: 30-50% faster with strategic caching
- •Reliability: Improved with retry logic and fallbacks
- •Observability: Full monitoring and tracing in production
Next Steps
After loading this skill:
- •For new optimization: Start with Phase 1 (signatures + training data)
- •For existing setup: Jump to Quick Start and run optimization
- •For troubleshooting: Check references/troubleshooting.md
- •For API integration: See references/api-reference.md
- •For advanced patterns: Review references/phase3-advanced.md
Implementation Status: ✅ All phases complete (Jan 19, 2026) Test Results: 10/10 tests passing Type Checks: ✅ Passing (11 expected MLflow warnings) Ready for Production: Yes