Reclassify Tests
Analyze a checkpoint's spec and test file to identify tests that are NOT explicitly demonstrated in the spec examples, and add @pytest.mark.functionality markers to them.
Usage: /reclassify-tests file_backup checkpoint_1
Purpose
Core tests (unmarked) should only cover behaviors that are explicitly demonstrated in spec examples. Tests that verify implicit requirements, edge cases, or inferred behaviors should be marked as @pytest.mark.functionality.
This skill helps ensure fair evaluation by:
- •Keeping core tests limited to explicit spec examples
- •Moving implicit/inferred behaviors to functionality tests
- •Making test classification consistent and justifiable
Step 1: Gather Context
Read these files for the specified problem/checkpoint:
problems/{problem}/checkpoint_N.md # Spec with examples
problems/{problem}/tests/test_checkpoint_N.py # Test file to reclassify
problems/{problem}/tests/conftest.py # Fixtures context
problems/{problem}/tests/data/checkpoint_N/ # Test case data (if present)
Step 2: Extract Explicit Spec Behaviors
From the spec, identify behaviors that are explicitly demonstrated:
What Counts as "Explicit"
- •Direct Examples: Code blocks showing exact input → output
- •Explicit Statements: "MUST", "SHALL", specific values defined
- •Example Output: JSON/text showing exact expected format
- •Stated Requirements: Numbered/bulleted requirements with specific behavior
What Does NOT Count as "Explicit"
- •Implied Behavior: Reasonable inferences not shown in examples
- •Edge Cases: Boundary conditions not demonstrated
- •Error Handling: Unless specific errors are shown in examples
- •Format Details: Exact formatting beyond what examples show
- •Ordering: Unless explicitly stated or shown in all examples
Create a Checklist
For each spec example, note:
- •What input is shown?
- •What output is shown?
- •What behavior is demonstrated?
Step 3: Analyze Each Test
For each test function or parametrized test case:
A. Identify What the Test Verifies
- •What specific behavior does this test check?
- •What input does it use?
- •What output does it expect?
B. Find Spec Support
Ask: "Can I point to a spec example that explicitly demonstrates this exact behavior?"
| Spec Support | Classification |
|---|---|
| Example shows exact scenario | Keep as core (no marker) |
| Behavior stated but no example | Mark as functionality |
| Inferred from context | Mark as functionality |
| Edge case not shown | Mark as functionality |
| Error handling not in examples | Mark as error |
C. Document the Decision
For each test, note:
- •Spec support (quote or "NONE")
- •Classification decision
- •Reason
Step 4: Apply Markers
For Tests to Reclassify
Add @pytest.mark.functionality ABOVE the test function or parametrize decorator:
Before:
@pytest.mark.parametrize("case", EDGE_CASES, ids=[c["id"] for c in EDGE_CASES])
def test_edge_cases(entrypoint_argv, case, tmp_path):
...
After:
@pytest.mark.functionality
@pytest.mark.parametrize("case", EDGE_CASES, ids=[c["id"] for c in EDGE_CASES])
def test_edge_cases(entrypoint_argv, case, tmp_path):
...
For Individual Parametrized Cases
If some cases in a parametrized test are explicit and others aren't, consider:
- •Split the test: Separate core cases from functionality cases
- •Use pytest.param with marks:
@pytest.mark.parametrize("case", [
pytest.param(case1, id="explicit-case"),
pytest.param(case2, marks=pytest.mark.functionality, id="inferred-case"),
])
def test_cases(case):
...
Step 5: Report Changes
Generate a summary of changes made:
# Reclassification Report: {problem} {checkpoint}
## Summary
- Tests analyzed: N
- Tests reclassified: N
- Tests unchanged: N
## Reclassified Tests
### `test_function_name` → @pytest.mark.functionality
**Reason**: Test verifies [behavior] which is not explicitly demonstrated in spec.
**Spec says**: [Quote or "No example shown"]
**Test expects**: [What the test checks]
---
### `test_another_function` → @pytest.mark.functionality
...
## Unchanged Tests (Explicitly in Spec)
| Test | Spec Example Reference |
|------|------------------------|
| `test_core_case_1` | Example 1: shows exact input/output |
| `test_core_case_2` | Example 2: demonstrates this scenario |
Classification Guidelines
KEEP AS CORE (no marker) when:
- •Spec example shows exact input and expected output
- •Behavior is explicitly stated with specific values
- •Test directly validates a demonstrated scenario
- •Example output format matches test expectations exactly
MARK AS FUNCTIONALITY when:
- •Behavior is reasonable but not shown in examples
- •Test checks edge cases not demonstrated
- •Test verifies format details beyond examples
- •Test validates inferred requirements
- •Test checks ordering not explicitly stated
- •Test validates error messages/codes not shown
MARK AS ERROR when:
- •Test validates error handling for invalid input
- •Test checks failure modes
- •Test expects non-zero exit or exceptions
Common Reclassification Scenarios
1. Implicit Ordering
Spec: "Process all jobs" Test: Expects specific alphabetical order → Mark as functionality (unless spec explicitly shows ordering)
2. Edge Values
Spec: "Duration in hours" Test: Checks duration=0, duration=MAX → Mark as functionality (unless 0/MAX shown in examples)
3. Format Details
Spec: Shows JSON output example Test: Checks exact whitespace, key ordering → Mark as functionality (unless spec says "must match exactly")
4. Error Handling
Spec: "Return non-zero on error" Test: Checks specific error code or message → Mark as error (and consider relaxing the assertion)
5. Field Presence
Spec: Example shows fields A, B, C Test: Requires field D not in any example → Mark as functionality
Data-Driven Test Considerations
Many test files use data directories like:
tests/data/checkpoint_N/ ├── core/ # Core test cases ├── hidden/ # Functionality test cases ├── errors/ # Error test cases
When test data is already organized this way:
- •
Check that markers match data directories:
- •Tests loading from
core/should have no marker - •Tests loading from
hidden/should have@pytest.mark.functionality - •Tests loading from
errors/should have@pytest.mark.error
- •Tests loading from
- •
Consider moving cases between directories if:
- •A "core" case tests implicit behavior → move to
hidden/ - •A "hidden" case tests explicit behavior → move to
core/
- •A "core" case tests implicit behavior → move to
Example Analysis
Given spec excerpt:
## Example 1
Input: `--now 2025-09-10T03:30:00Z --duration 1`
Output:
{"event":"SCHEDULE_PARSED","timezone":"UTC","jobs_total":2}
{"event":"JOB_ELIGIBLE","job_id":"job-root","kind":"daily",...}
And tests:
def test_schedule_parsed_event():
"""Verify SCHEDULE_PARSED is emitted first."""
# ✓ KEEP AS CORE - Example 1 shows this event first
def test_schedule_parsed_fields():
"""Verify SCHEDULE_PARSED has timezone and jobs_total."""
# ✓ KEEP AS CORE - Example 1 shows these exact fields
def test_empty_schedule():
"""Verify empty schedule returns jobs_total=0."""
# → MARK AS FUNCTIONALITY - No example shows empty schedule
def test_invalid_timezone():
"""Verify invalid timezone returns error."""
# → MARK AS ERROR - Error handling case
After Reclassification
- •
Run tests to ensure markers are syntactically correct:
bashcd problems/{problem} && pytest --collect-only tests/test_checkpoint_N.py - •
Verify counts match expectations:
bashpytest --collect-only -m "not functionality and not error" tests/ # Core only pytest --collect-only -m functionality tests/ # Functionality only pytest --collect-only -m error tests/ # Error only
- •
Review the changes - did any truly explicit tests get marked?