Add TestCase entries for a new or existing agent to backend/evals/test_cases.py.
Before You Start
- •Read
backend/evals/test_cases.py— understand theTestCasedataclass and the existing case lists - •Read the target agent at
backend/agents/<name>.py— understand itsid=, tools, knowledge sources, and expected output style - •Read
backend/evals/run_evals.pylines 1-30 — understand what fields are used during evaluation
TestCase Schema
python
@dataclass
class TestCase:
question: str # The user query sent to the agent
expected_strings: list[str] # Strings that MUST appear in the response
category: str # "basic" | "aggregation" | "data_quality" | "complex" | "edge_case" | "navigation"
agent_id: str # Must match the agent's id= field exactly (default: "data-agent")
golden_sql: str | None # Ground-truth SQL for data-agent cases (optional)
expected_result: str | None # Exact expected value for golden SQL comparison (optional)
golden_path: str | None # Knowledge path label for knowledge-agent cases (optional)
Steps
- •
Add a named list for the agent below the last existing
*_CASESlist:python# --------------------------------------------------------------------------- # <AgentName> Test Cases # --------------------------------------------------------------------------- <AGENT_ID_UPPER>_CASES: list[TestCase] = [ ... ] - •
Write 4–6 test cases covering:
- •
basic— typical, direct query the agent is designed to handle - •
basic— second typical query using a different tool or knowledge path - •
complex— multi-step or compound query - •
edge_case— query at the boundary of what the agent knows/can do - •
edge_case— query the agent should gracefully decline or redirect
- •
- •
Set
agent_idto the exact string from the agent'sid=constructor argument. - •
Add to
ALL_CASESat the bottom of the file:pythonALL_CASES = DATA_AGENT_CASES + KNOWLEDGE_AGENT_CASES + WEB_SEARCH_CASES + <AGENT_ID_UPPER>_CASES
Rules
- •
expected_stringsmust be strings the agent will reliably produce — not LLM paraphrases. Use proper nouns, IDs, or keywords from the data, not generic words like "the" or "result". - •For
data-agentcases, always includegolden_sqlwhen the answer is deterministic from the database. The runner uses it for exact result comparison. - •For
knowledge-agentcases, setgolden_pathto the document section name when the answer should come from a specific loaded document. - •Do not add test cases requiring live external services (web, stock prices, real-time data) without noting that they need a running backend with appropriate tools enabled.
- •
edge_casetests for "out of scope" queries should useexpected_strings=["no"]or a short substring the agent uses when declining — check the agent's system prompt for its exact refusal phrasing.
Running Evals
bash
# Run all cases for the agent mise run evals:run -- -c basic # filter by category mise run evals:run # all cases mise run evals:run -- -v # verbose (show full responses) mise run evals:run -- -g # LLM grading mode
Requires: running backend (mise run docker:up) and loaded data (mise run load-sample-data for data-agent).