Check Test Logical Response — Passed-Test Semantic Audit
Audits passed tests only: verifies that each model response is logically coherent with the prompt (topic, entities, intent). Assertions already passed; this skill adds a "semantic sanity" layer so tests that pass on keywords/tools but give off-topic or contradictory answers are surfaced.
When to Apply
- •User says
/check-test-logical-resp, "check logical responses", "check test responses make sense", "audit passed tests for logic" - •User @-mentions a
test-results_*.jsonormulti-model-results_*.jsonand asks whether responses were logical or made sense - •User wants to know if any passed tests had nonsensical or off-topic answers
File Resolution
Use the same file resolution as the analyze-test-json skill:
- •Base directory:
src/IntegrationTesterApp/test-results/ - •No path or "latest": Glob for
test-results_*.jsonandmulti-model-results_*.jsonunder that directory. Parse the embedded timestamp_yyyyMMdd_HHmmssfrom each filename. Sort by timestamp descending; take the first file as latest. - •Explicit path: Use that path. If the user gives a bare name (e.g.
test-results_20260127_054623), resolve tosrc/IntegrationTesterApp/test-results/<name>.json.
Format detection:
- •Root has
modelResults→ multi-model (eachmodelResults[i]hasmodel,testResults). - •Root has
testResults→ single-run (root-leveltestResults).
If the chosen path does not exist, report that and suggest running tests or checking the path.
Per-Test Shape (JSON)
Same as analyze-test-json. Each item in testResults (or modelResults[].testResults) has (camelCase):
- •testCase:
name,prompt,conversationHistory,expectedToolsToCall,expectedToolsNotToCall,responseMustContain,responseMustContainAny,tags, … - •passed: boolean
- •response: LLM response text
- •functionCallDetails:
{ functionName, pluginName, name, parameters, result, resultString, nestedCalls? }[] - •failures, error, durationMs, timedOut: as in analyze-test-json
Scope: Passed Tests Only
- •Input: All test results from the chosen file (single-run: root
testResults; multi-model: eachmodelResults[i].testResults). - •In scope for logical-response check: Only tests where
passed === true. Failed tests are already handled by assertions; this skill targets "passed but nonsensical."
For each passed test, use: testCase.prompt, testCase.conversationHistory (if any), response, and optionally functionCallDetails, then judge logical coherence against the criteria below.
Logical Sense Criteria
Evaluate each passed test against these. Flag when one or more are violated. (responseMustContain / responseMustContainAny already passed — the check is overall coherence and correctness, not keyword presence.)
| Criterion | Meaning | Flag when … |
|---|---|---|
| Topic match | Response should address the same subject as the prompt (e.g. stocks vs weather vs math). | User asked about weather and the reply is mainly about stocks. |
| Entity match | Named entities in the question (cities, tickers, numbers) should appear correctly in the response. | "Seattle" → response about another city; "MSFT" → reply about a different ticker. |
| Intent alignment | If the prompt asks for a specific kind of answer (e.g. "add 3 and 5", "get stock for AAPL"), the response should fulfill that intent. | User asked for a numeric result or stock info for AAPL; reply deflects or answers something else. |
| No clear contradiction | Response should not contradict the question. | User asks "weather in Seattle" and the model says "In New York the weather is …" without clearly addressing Seattle. |
Workflow
- •Resolve file: Same as analyze-test-json (glob + latest by timestamp, or explicit path).
- •Detect format: Root has
modelResults→ multi-model; root hastestResults→ single-run. - •Collect passed tests: From
testResults(or from eachmodelResults[i].testResultsfor multi-model), keep only items withpassed === true. - •For each passed test:
- •Read
testCase.prompt,testCase.conversationHistory,response, and optionallyfunctionCallDetails. - •Apply the logical-sense criteria above.
- •If the response is judged not logically coherent with the prompt, add it to a "Passed but illogical" list with: test name, model (if multi-model), prompt snippet, response snippet, and a short reason (e.g. "Response discusses stocks; user asked for Seattle weather").
- •Read
- •Summarize: Count how many passed tests were reviewed and how many were flagged as illogical.
- •Output: Use the report template below.
Output Report Template
Produce a markdown report in this form. When M = 0 (no logical issues), omit the "Passed tests with logical response issues" detail; keep the summary.
## Logical Response Check — Passed Tests ### 1. Run overview - **File**: <resolved path> - **Format**: Single-run | Multi-model - **Model(s)**: <from configuration or modelResults[].model> - **Passed tests reviewed**: <count> - **Passed tests with logical issues**: <count> ### 2. Passed tests with logical response issues For each passed test where the response did not make logical sense: #### <testCase.name> [Model: <model> if multi-model] - **Prompt** (snippet): "<prompt text>" [+ conversation context if present] - **Response** (snippet): "<response excerpt that shows the problem>" - **Issue**: <short explanation: topic mismatch, wrong entity, contradiction, or intent not fulfilled> ### 3. Summary - <N> passed tests reviewed for logical coherence with the user's question. - <M> showed logical response issues (listed above). - If M is 0: All reviewed passed tests had responses that made logical sense given the prompt.
Relation to Other Skills
| Skill | Relationship |
|---|---|
| analyze-test-json | Focuses on failing tests (flow, assertions, recommendations). check-test-logical-resp reuses its file resolution and JSON shape, but focuses on passed tests and semantic/logical coherence of the response vs the prompt. |
| smart-test-runner | Runs tests and prints console output; it does not consume JSON. Either skill can be used after a run: smart-test-runner to run, then check-test-logical-resp on the generated JSON to audit passed-test logic. |
| add-plugin-test | Can add new tests (e.g. stricter keywords or new cases) if logical-response issues suggest missing or weak assertions. |
Usage Examples
| Command / request | Action |
|---|---|
/check-test-logical-resp or "check logical responses" | Resolve latest file in test-results/, audit all passed tests for logical coherence. |
| "audit passed tests for logic" | Same — resolve latest, then run logical-response check on passed tests only. |
User @-mentions test-results_20260127_054623.json and asks "do these responses make sense?" | Resolve to that file (or under test-results/), run logical check on its passed tests. |
| "check test responses make sense" with no file | Resolve latest test-result JSON, run the audit. |
File Locations & References
- •Test result directory:
src/IntegrationTesterApp/test-results/ - •File resolution: Same as the analyze-test-json skill — glob
test-results_*.jsonandmulti-model-results_*.json, parse_yyyyMMdd_HHmmss, sort descending, take first for "latest".