Testing Expert
Expert knowledge for analyzing repository testing architecture, identifying gaps, and planning comprehensive testing improvements—with special focus on AI agent verification.
Philosophy
Core Principles
"Testing code is as important as production code."
"Don't trust the agent, verify the world."
Tests are not second-class citizens. They:
- •Document expected behavior
- •Enable safe refactoring
- •Catch regressions before production
- •Provide deployment confidence
For AI agents specifically, tests must verify reality, not just that the agent claimed success.
The 5 Verification Levels
This is the most important concept for testing AI agent systems.
The Problem: Tests That Lie
Traditional tests verify function returns. Agent tests often only verify:
- •The code didn't crash
- •The agent claimed success
This is insufficient. AI agents can claim success while failing. We found this actual code:
def test_github_app_works(self):
"""Verify GitHub App can create files."""
assert True # ← Tests NOTHING
The Verification Hierarchy
| Level | Name | Description | Confidence |
|---|---|---|---|
| 5 | Invariant | Property is ALWAYS true | ⭐⭐⭐⭐⭐ |
| 4 | Outcome | World state changed correctly | ⭐⭐⭐⭐ |
| 3 | Event | Correct events were emitted | ⭐⭐⭐ |
| 2 | Output | Function returned success | ⭐⭐ |
| 1 | Existence | Code didn't crash | ⭐ |
Minimum acceptable: Level 3 Critical paths: Level 4-5
Anti-Patterns to Find
# ❌ Level 1: USELESS - Tests nothing
def test_agent_works():
agent.run("do something")
assert True
# ❌ Level 2: WEAK - Trusts agent's claim
def test_agent_creates_file():
result = agent.run("create file.py")
assert result.status == "success" # Agent could be lying
# ✅ Level 4: GOOD - Verifies reality
def test_agent_creates_file():
result = agent.run("create file.py")
assert Path("file.py").exists() # Check filesystem
content = Path("file.py").read_text()
compile(content, "file.py", "exec") # Verify valid Python
Three-Source Verification
For critical claims, verify from THREE independent sources:
async def verify_pr_created(execution_id: str, pr_number: int) -> bool:
# Source 1: Our event store
events = await event_store.read(f"execution-{execution_id}")
pr_event = next((e for e in events if e.type == "PRCreated"), None)
# Source 2: External API
github_pr = await github.get_pr(pr_number)
# Source 3: Ground truth
branch_exists = await git_branch_exists(f"aef/{execution_id}")
# ALL THREE must agree
return pr_event and github_pr and branch_exists
Analysis Workflow
Step 1: Discover Test Structure
# Find test configuration cat pytest.ini pyproject.toml setup.cfg 2>/dev/null | grep -A20 "\[tool.pytest" cat jest.config.* vitest.config.* 2>/dev/null # Find test directories find . -type d -name "test*" -o -name "*tests*" | head -20 # Count test files find . -name "test_*.py" -o -name "*_test.py" | wc -l find . -name "*.test.ts" -o -name "*.spec.ts" | wc -l
Step 2: Assess Test Types
# Find test markers/categories grep -r "@pytest.mark\." --include="*.py" | cut -d: -f2 | sort | uniq -c # Find integration tests find . -path "*/integration/*" -name "test_*.py" # Find E2E tests find . -path "*/e2e/*" -name "*.py" -o -name "e2e_*.py"
Step 3: Find Anti-Patterns
# Level 1: assert True (useless) grep -rn "assert True" --include="*.py" # Level 2: Only checking status grep -rn "assert.*\.status\|assert.*\.success" --include="*.py" # Missing assertions grep -rn "def test_" --include="*.py" -A10 | grep -B5 "^--$"
Step 4: Assess Coverage
# Check coverage configuration grep -r "cov" pyproject.toml pytest.ini # Find coverage reports find . -name "*.coverage" -o -name "coverage.xml" -o -name "htmlcov" # Check CI for coverage gates cat .github/workflows/*.yml | grep -i "coverage\|cov-fail"
Step 5: Check CI Integration
# What runs in CI? cat .github/workflows/*.yml | grep -E "pytest|jest|vitest|test" # Coverage gates? cat .github/workflows/*.yml | grep "cov-fail-under" # Test parallelization? cat .github/workflows/*.yml | grep -i "parallel\|matrix"
Assessment Report Template
# Testing Architecture Assessment ## Summary | Metric | Current | Target | Gap | |--------|---------|--------|-----| | Test Files | X | - | - | | Test Functions | X | - | - | | Coverage | X% | 80% | X% | | Unit:Integration:E2E | X:X:X | 70:20:10 | - | ## Test Structure ### Strengths - ✅ [What's working well] ### Gaps - ❌ [What's missing or broken] ## Anti-Patterns Found ### Level 1 Violations (assert True) | File | Line | Issue | |------|------|-------| | path/file.py | 42 | `assert True` | ### Level 2 Violations (Trust without verify) | File | Line | Issue | |------|------|-------| | path/file.py | 88 | `assert result.success` without verification | ## Recommendations ### Priority 1: Critical (Do Now) 1. [Action item] ### Priority 2: Important (This Sprint) 1. [Action item] ### Priority 3: Nice to Have (Backlog) 1. [Action item] ## Implementation Plan ### Week 1: Foundation - [ ] Task 1 - [ ] Task 2 ### Week 2: Core Improvements - [ ] Task 3 - [ ] Task 4
Testing Patterns
Pattern 1: Arrange-Act-Assert (AAA)
def test_user_creation():
# Arrange
user_data = {"name": "Alice", "email": "alice@example.com"}
# Act
user = create_user(user_data)
# Assert
assert user.id is not None
assert user.name == "Alice"
Pattern 2: Given-When-Then (BDD)
def test_checkout_applies_discount():
# Given a cart with items over $100
cart = Cart()
cart.add(Item(price=150))
# When checkout is performed
order = checkout(cart, discount_code="SAVE10")
# Then 10% discount is applied
assert order.total == 135
Pattern 3: Fixture Factory
@pytest.fixture
def make_user():
"""Factory fixture for creating test users."""
created = []
def _make_user(**kwargs):
user = User(
name=kwargs.get("name", "Test User"),
email=kwargs.get("email", f"test-{uuid4()}@example.com"),
)
created.append(user)
return user
yield _make_user
# Cleanup
for user in created:
user.delete()
Pattern 4: Test Markers (Categorization)
Markers tag tests with metadata so pytest can run them selectively:
import pytest
@pytest.mark.unit
def test_fast_logic():
"""Runs in milliseconds, no I/O."""
assert 1 + 1 == 2
@pytest.mark.integration
async def test_with_database():
"""Needs real database connection."""
result = await db.query("SELECT 1")
assert result == 1
@pytest.mark.e2e
async def test_full_workflow():
"""Needs entire stack running."""
response = await api.post("/execute")
assert response.status_code == 200
Running by marker:
pytest -m unit # Fast, every commit (~30s) pytest -m integration # Slower, PR only (~2min) pytest -m "not e2e" # Skip expensive tests
Register in pyproject.toml:
[tool.pytest.ini_options]
markers = [
"unit: Fast tests with no external dependencies",
"integration: Tests requiring database or services",
"e2e: End-to-end tests requiring full stack",
]
Auto-detection rules:
| If test has... | Mark as |
|---|---|
MagicMock, AsyncMock, no DB imports | @pytest.mark.unit |
event_store, postgresql://, docker | @pytest.mark.integration |
e2e_ in filename | @pytest.mark.e2e |
Pattern 5: Test Doubles
| Double | Use Case |
|---|---|
| Dummy | Fill parameters, never used |
| Stub | Return predetermined values |
| Spy | Record calls for later verification |
| Mock | Pre-programmed expectations |
| Fake | Working implementation (in-memory DB) |
Pattern 5: Property-Based Testing
from hypothesis import given, strategies as st
@given(st.lists(st.integers()))
def test_sort_is_idempotent(items):
"""Sorting twice gives same result as sorting once."""
once = sorted(items)
twice = sorted(sorted(items))
assert once == twice
@given(st.lists(st.integers()))
def test_sort_preserves_length(items):
"""Sorting doesn't change list length."""
assert len(sorted(items)) == len(items)
Coverage Best Practices
Coverage Targets by Component Type
| Component | Target | Rationale |
|---|---|---|
| Business Logic | 90%+ | Core value, must be correct |
| API Handlers | 80%+ | User-facing, critical paths |
| Utilities | 85%+ | Widely used, high leverage |
| Infrastructure | 70%+ | Integration tested |
| Generated Code | 0% | Skip, test generator instead |
Coverage Exclusions
# pyproject.toml
[tool.coverage.report]
exclude_lines = [
"pragma: no cover",
"def __repr__",
"if TYPE_CHECKING:",
"raise NotImplementedError",
"if __name__ == .__main__.:",
]
Coverage vs Quality
High coverage ≠ Good tests
100% coverage with bad assertions is worse than 60% coverage with thorough verification.
Focus on:
- •Critical path coverage
- •Edge case coverage
- •Verification quality (Level 4+)
CI/CD Integration
Essential CI Gates
# Required for merge - name: Run tests run: pytest --cov --cov-fail-under=70 - name: Check types run: mypy src - name: Lint run: ruff check .
Test Parallelization
strategy:
matrix:
shard: [1, 2, 3, 4]
steps:
- run: pytest --shard-id=${{ matrix.shard }} --num-shards=4
Flaky Test Detection
- name: Run tests with retries run: pytest --reruns 2 --reruns-delay 1
Common Gaps & Fixes
Gap 1: No Integration Tests
Symptom: All tests mock everything Fix: Add tests with real database/services using testcontainers
Gap 2: Tests Without Assertions
Symptom: def test_x(): some_code() with no assert
Fix: Add explicit verification of expected outcomes
Gap 3: Flaky Tests
Symptom: Tests pass/fail randomly Fix:
- •Remove time-dependencies (use freezegun)
- •Add proper async handling
- •Use deterministic test data
Gap 4: Slow Test Suite
Symptom: Tests take >5 minutes Fix:
- •Parallelize with pytest-xdist
- •Use fixtures with proper scope
- •Mock expensive external calls
Gap 5: Missing Edge Cases
Symptom: Bugs in production that "tests should have caught" Fix:
- •Add property-based testing
- •Review with mutation testing
- •Add regression tests for each bug found
Related Skills
- •
testing/ai-agent-verification- Deep dive on AI agent testing - •
devops/ci-cd- CI/CD pipeline configuration - •
review/prioritize- Prioritize issues by severity