Aviation Agent Test Improvement
Analyze planner behavioral test results and systematically enhance tool definitions, planner prompts, and test coverage.
Critical: Use Existing Infrastructure Only
DO NOT CREATE NEW SCRIPTS. All infrastructure exists:
- •Test Runner:
tests/aviation_agent/test_planner_behavior.py - •Test Cases:
tests/aviation_agent/fixtures/planner_test_cases.json - •CSV Results:
tests/aviation_agent/results/
Quick Start
- •Read existing test cases from
tests/aviation_agent/fixtures/planner_test_cases.json - •Generate new test cases and append to JSON file
- •Run tests:
bash
source ./venv/bin/activate export $(cat web/server/.env | grep -v '^#' | xargs) RUN_PLANNER_BEHAVIOR_TESTS=1 python -m pytest tests/aviation_agent/test_planner_behavior.py -v
- •Report CSV file path and summary metrics
- •If failures exist, analyze CSV and propose code changes
Three-Agent Workflow
For detailed instructions on each agent, see:
- •Ground Truth Generator - Generate expected tool selection for new questions
- •Failure Analysis - Analyze failures and propose fixes
- •Validation - Compare before/after results
Tool Selection Patterns
| Query Type | Tool | Key Arguments |
|---|---|---|
| Routes (from X to Y) | find_airports_near_route | from_location, to_location, filters |
| Near location | find_airports_near_location | location_query, filters |
| Airport details | get_airport_details | icao_code |
| Country search | search_airports | query, filters |
| Notification requirements | get_notification_for_airport | icao, day_of_week |
| Rules question (ONE country) | answer_rules_question | country_code, question, tags |
| Rules browsing (list all) | browse_rules | country_code, tags, offset, limit |
| Rules comparison (2+ countries) | compare_rules_between_countries | countries, tags, category |
Available Filters
| Filter | Type | Description |
|---|---|---|
fuel_type | 'avgas' | 'jet_a' | Preferred over legacy has_avgas/has_jet_a |
has_avgas | boolean | Legacy - still works |
has_jet_a | boolean | Legacy - still works |
has_hard_runway | boolean | Paved/hard surface runways |
has_procedures | boolean | IFR procedures available |
point_of_entry | boolean | Customs/border crossing |
country | string | ISO-2 country code |
min_runway_length_ft | number | Minimum runway length |
max_runway_length_ft | number | Maximum runway length |
max_landing_fee | number | Maximum landing fee |
max_hours_notice | number | Notification requirements |
hotel | boolean | On-site hotel |
restaurant | boolean | On-site restaurant |
Test Case Format
json
{
"question": "User question in natural language",
"expected_tool": "tool_name_from_manifest",
"expected_arguments": {
"arg1": "value1",
"filters": { "filter_key": true }
},
"description": "Why this tool/args combination is correct"
}
Critical Rules
- •NEVER create new test scripts - Use existing
test_planner_behavior.py - •NEVER create analysis scripts - Read CSV files directly
- •ALWAYS edit existing files - Append to
planner_test_cases.json - •ALWAYS use venv -
source ./venv/bin/activate - •ALWAYS load environment -
export $(cat web/server/.env | grep -v '^#' | xargs) - •ALWAYS run tests and report - Print CSV path and summary metrics
Output Format
After Running Tests
code
Tests completed Results saved to: tests/aviation_agent/results/planner_test_results_YYYYMMDD_HHMMSS.csv Summary: - Total tests: 21 - Passed: 21 (100%) - Failed: 0 (0%) - Tool match: 21/21 (100%) - Args match: 21/21 (100%)
Key Files
- •Test Cases:
tests/aviation_agent/fixtures/planner_test_cases.json - •Test Runner:
tests/aviation_agent/test_planner_behavior.py - •Planner Prompt:
shared/aviation_agent/planning.py - •Tool Definitions:
shared/aviation_agent/tools.py - •Formatter:
shared/aviation_agent/formatting.py