Test Run Analysis
This skill helps you review Freeplay test run results, view evaluation metrics, and identify patterns across test cases.
When to use this skill
- •"Summarize the results of test run X"
- •"Summarize the major differences between two test runs in the same comparison"
- •"What were the metrics for my last test run?"
- •"Analyze test run [ID]"
- •"Which test cases failed in run X?"
- •"What are the evaluation scores from this test?"
- •User mentions a specific test run ID and wants insights
- •Compare two test runs and summarize differences between latency, cost, and evaluations
- •Select a winner from two test runs based on available data
Available API Endpoints
Get Test Run Summary
Endpoint: GET /api/v2/projects/{project_id}/test-runs/id/{test_run_id}
Returns:
{
"id": "test-run-uuid",
"name": "Test run name",
"description": "Description",
"created_at": 1234567890,
"prompt_name": "Prompt template name",
"prompt_version": "version-id",
"model_name": "gpt-4",
"sessions_count": 10,
"summary_statistics": {
"human_evaluation": {},
"auto_evaluation": {
"criteria_name": {
"pass": 8,
"fail": 2,
"average_score": 0.85
}
},
"client_evaluation": {}
}
}
List All Test Runs
Endpoint: GET /api/v2/projects/{project_id}/test-runs
Returns a list of all test runs with their IDs and associated test case IDs.
How to Analyze a Test Run
When a user provides a test run ID, follow these steps:
1. Fetch Test Run Summary
Use curl or Python to call the API:
curl -H "Authorization: Bearer $FREEPLAY_API_KEY" \
"$FREEPLAY_BASE_URL/api/v2/projects/{project_id}/test-runs/id/{test_run_id}"
2. Extract Key Insights
From the response, identify:
- •Overall performance: sessions_count, average scores
- •Evaluation metrics: Parse the
summary_statisticsobject- •
auto_evaluation: LLM-as-judge metrics - •
human_evaluation: Human review scores - •
client_evaluation: Custom evaluation criteria
- •
- •Prompt details: prompt_name, prompt_version, model_name
- •Timestamp: created_at (Unix timestamp)
3. Analyze Metrics
For each evaluation criterion in summary_statistics:
- •Count pass/fail cases
- •Calculate pass rate percentage
- •Identify average scores
- •Flag criteria with low performance
Environment Variables Required
The following environment variables must be set:
- •
FREEPLAY_API_KEY: Your Freeplay API key - •
FREEPLAY_BASE_URL: Freeplay API base URL (default: https://app.freeplay.ai)
These are typically configured in the plugin's .mcp.json file or user's environment.
Project ID can come from:
- •User specification
- •MCP
list_projects()tool to discover available projects
Response Format
When presenting test run results to the user, include:
- •
Summary Header
- •Test run name
- •Prompt tested & version
- •Model used
- •Number of test cases
- •Date/time
- •
Evaluation Metrics
- •For each criterion:
- •Criterion name
- •Pass/fail count
- •Pass rate percentage
- •Average score (if applicable)
- •For each criterion:
- •
Key Insights
- •Overall pass rate
- •Best performing criteria
- •Areas needing improvement
- •Any notable patterns or failures
- •
Recommendations
- •Suggest next steps based on results
- •Highlight specific test cases to review if pass rate is low
- •When no evaluations are configured: The test run completed but has no quantitative metrics to analyze. Provide specific, actionable guidance:
- •Explain that without evaluations, test runs only capture raw outputs — evaluations are what score those outputs automatically on each run
- •Direct the user to set up evaluations: open the prompt in the Freeplay UI → go to the Evaluate tab → create evaluation criteria (e.g., model-graded criteria that check for accuracy, relevance, tone, or format compliance)
- •Link to the documentation: Evaluations Overview
- •Suggest running the health-check skill (
/health-check) to get a full assessment of their project setup, including evaluation coverage - •For immediate value, recommend reviewing a sample of test case outputs directly in the Freeplay UI to manually spot-check quality before investing in automated evaluations
Example Analysis Output
# Test Run: "New test" - Prompt: Chat Transcript to Issue - Model: gpt-4-turbo - Test Cases: 15 - Date: Jan 30, 2025 ## Evaluation Results ### Accuracy (auto_evaluation) - Pass: 13/15 (86.7%) - Average Score: 0.87 ### Relevance (auto_evaluation) - Pass: 14/15 (93.3%) - Average Score: 0.92 ### Format Compliance (auto_evaluation) - Pass: 15/15 (100%) - Average Score: 1.0 ## Overall Performance ✓ Strong performance across all criteria ✓ 93.3% overall pass rate ⚠ Review 2 failing test cases for Accuracy criterion ## Recommendations 1. Investigate the 2 accuracy failures to identify patterns 2. Consider adding more test cases for edge cases 3. Prompt is ready for staging deployment
Example Analysis Output (Multiple Test Runs)
# Test Run Comparison
## Test Run 1: "Bug Report Summarization"
- Prompt: Bug-to-Summary Converter
- Model: gpt-4-turbo
- Test Cases: 10
- Date: Jan 15, 2025
### Evaluation Results
#### Accuracy (auto_evaluation)
- Pass: 8/10 (80.0%)
- Average Score: 0.82
#### Clarity (manual_evaluation)
- Pass: 7/10 (70.0%)
- Average Score: 0.73
#### Format Compliance (auto_evaluation)
- Pass: 10/10 (100%)
- Average Score: 1.0
## Overall Performance
✓ Good accuracy and format compliance
⚠ Clarity underperforms at 70%
✓ 83.3% average pass rate
## Recommendations
1. Refine prompt instructions to improve clarity in summaries
2. Review the 3 test cases with clarity failures
3. Maintain strong format adherence
---
## Test Run 2: "Support Ticket Categorization 234dv vs a24bd"
- Prompt: Ticket Classifier
- Versions compared:
- Version: 234dv
- Model: Opus 4.5
- Created by: Prompt optimizer
- Created: 10/11/25 11:12:32
- Version: a24bd
- Model: Sonnet 4.5
- Created by: Rob Rhyne
- Created: 10/10/25 11:12:32
- Test Cases: 12
- Date: Feb 2, 2025
### Evaluation Results
#### Accuracy (auto_evaluation)
- Pass: 9/12 (75.0%)
- Average Score: 0.79
#### Category Relevance (auto_evaluation)
- Pass: 8/12 (66.7%) ⚠
- Average Score: 0.68
#### Format Compliance (auto_evaluation)
- Pass: 11/12 (91.7%)
- Average Score: 0.95
## Overall Performance
⚠ Category relevance below 70% – needs attention
✓ Good format compliance, moderate accuracy
✗ 77.8% average pass rate
## Recommendations
1. Investigate category relevance failures—review misclassified tickets
2. Consider adding category examples to the prompt
3. Re-evaluate if accuracy improves after prompt update
Tips
- •Always format Unix timestamps into readable dates
- •Calculate percentages and round to 1 decimal place
- •Use visual indicators (✓ ⚠ ✗) for quick scanning
- •Highlight concerning metrics (< 70% pass rate)
- •Keep the analysis concise but actionable