Model Evaluation Expert

Name: evaluating-models
Rating: 78
Author: Alanlee0323

When to use this skill

•When a training run completes.
•When the user asks "How is the model performing?".
•When comparing multiple experimental runs.
•When generating reports for stakeholders.

Core Capabilities

•
Deep Metric Analysis:
- •Confusion Matrix: Identify specific class confusion (e.g., "Model constantly mistakes 'background' for 'person'").
- •PR Curve (Precision-Recall): Analyze the trade-off. Is the model aggressive or conservative?
- •F1-Score: Check the harmonic mean at different confidence thresholds.
•
Historical Benchmarking (MLflow):
- •Rule: NEVER evaluate a run in isolation.
- •Action: Query MLflow for the best_run (highest mAP50 or mAP50-95).
- •Comparison: Calculate delta (e.g., "mAP50 improved by +2.5% vs baseline").
•
Objective Recommendations:
- •Based on data, suggested next steps (e.g., "Increase background images", "Tune IoU threshold", "Add more samples for class X").

Workflow

1. Fetch Artifacts

Locate the evaluation artifacts (usually in runs/detect/trainX/ or MLflow):

•confusion_matrix.png
•PR_curve.png
•results.csv

2. Analyze & Compare

Run the comparison logic:

•Current mAP@50 vs Best Historical mAP@50
•Precision vs Recall balance check

3. Generate Report

Output a markdown report in the following format:

markdown

### 📊 Evaluation Report: [Run Name]

**🏆 Performance vs Baseline**
- **mAP@50**: 0.95 (+1.2% 🟢)
- **mAP@50-95**: 0.72 (-0.5% 🔻)
- **Best Run**: [Run ID] (0.938 mAP)

**🔍 Key Insights**
1. **Confusion**: Significant confusion between `dog` and `cat` (15% misclassification).
2. **Recall Risk**: High precision but low recall on `bicycle` class.

**💡 Recommendations**
- Collect more hard-negative samples for `cat`.
- Lower confidence threshold for `bicycle` deployment.

Anti-Hallucination Rules

•If artifacts are missing, say "Cannot evaluate without [Artifact Name]".
•Do not invent numbers.
•If MLflow is unreachable, compare against the "Manual Baseline" provided by the user.