Red-Team — Scenario-Based Stress Testing
Map capabilities (framework features) against facets (cross-cutting perspectives), then spawn agents that build real-world test scenarios for each combination.
Unlike the existing capabilities matrix (combinatorial flag testing), this is scenario-based — "can I actually build a RAG pipeline with caching and async?" or "does sync→async require zero code changes?"
Triggers
- •
/red-team— full matrix - •
/red-team <capability>— narrow to one capability across all facets
Workflow
Phase 0 — Setup
mkdir -p tests/red_team && touch tests/red_team/__init__.py
Phase 1 — Discovery (single Explore agent)
Spawn one read-only Explore agent that:
- •Reads source code (
src/hypergraph/), existing tests, docs, and README - •Reads the capabilities matrix (
tests/capabilities/matrix.py) to understand existing coverage - •Produces two lists:
Capabilities (framework features to probe):
- •Examples:
Runner(sync/async parity),Routing(route/ifelse/END),Cycles,Nesting(GraphNode),InterruptNode,Caching,emit/wait_for,Streaming,bind/select/rename,map_over,TypeValidation,Events
Facets (cross-cutting concerns/perspectives):
- •Examples:
zero-code-change(sync↔async without changes),error-propagation(exceptions across boundaries),empty-inputs(None, [], {}),real-world-pattern(RAG, ETL, multi-agent),composability(combine 3+ features),state-isolation(no leaks between runs),scale(many nodes, deep nesting),determinism(same input → same output)
- •Outputs a matrix as JSON:
{
"capabilities": [{"id": "C1", "name": "Runner", "description": "..."}],
"facets": [{"id": "F1", "name": "zero-code-change", "description": "..."}],
"cells": [
{
"capability": "C1",
"facet": "F1",
"scenario": "Run same graph sync and async with identical results",
"test_filename": "test_rt_runner_zero_change.py",
"search_hint": "async sync parity DAG framework",
"priority": "high"
}
]
}
Only include cells where the combination is meaningful — not every capability × facet makes sense. Target ~15-25 cells.
If /red-team <capability> was used, filter to only that capability's cells.
Phase 2 — Triage (coordinator)
- •Parse the matrix from the discovery agent
- •Group cells into batches of ~3-4 by related capability (each agent gets a focused theme)
- •Create tasks via TaskCreate for each batch
- •Create a team and spawn up to 6 general-purpose attack agents in parallel, each handling one batch
Phase 3 — Attack (parallel agents, up to 6)
Each agent receives a batch of matrix cells and for each cell:
- •Research (for cells with
search_hint): Use Perplexity/web search to find real-world workflow patterns that match the scenario — RAG pipelines, ETL flows, multi-agent orchestration, data processing DAGs, multi-turn chat loops - •Design: Create realistic test scenarios inspired by the research + agent knowledge
- •Implement: Write tests in
tests/red_team/<test_filename>following project conventions:- •Import from public API (
from hypergraph import ...) - •Class-based organization, descriptive docstrings
- •3-5 tests per cell (scenario escalation: basic → combined → stress)
- •Import from public API (
- •Run:
uv run pytest tests/red_team/<file> -v - •Classify each test result:
- •
PASS— framework handles this correctly - •
FAIL_BUG— framework bug (unexpected behavior/error) - •
FAIL_MISSING— feature gap (reasonable expectation the framework doesn't meet) - •
FAIL_INCONSISTENCY— works one way but not another (e.g., sync works, async doesn't)
- •
Phase 4 — Consolidation
After all attack agents complete:
- •Run
uv run pytest tests/red_team/ -v --tb=shortto confirm all results - •Separate tests:
- •Passing tests → keep as regression suite
- •Failing tests (bugs) → mark with
@pytest.mark.xfail(reason="RT: ...")so the suite passes, note in report - •Test errors → fix or remove
- •Generate consolidated report (markdown) with:
- •Matrix overview (capability × facet grid showing pass/fail/skip)
- •Bug list sorted by severity
- •Inconsistencies found
- •Feature gap suggestions
- •Recommendations
Phase 5 — Cleanup
- •Commit:
git add tests/red_team/ && git commit -m "test: red-team scenario tests" - •Shut down team, present report
- •Ask user: "Found N bugs and M feature gaps. Want me to fix the bugs?"
Safety Rules
- •Tests in
tests/red_team/only — never modify source code or existing tests - •Each agent writes separate files (no conflicts)
- •Always use
uv run pytestfor running tests - •Read existing code and tests for context, but don't change them