Sub-Agent Delegation
Permissions
- •NEVER spawn without explicit permission
- •ASK first: "I've identified [TASK] for sub-agent delegation. Should I spawn one?"
- •Explain WHY before requesting
When to Delegate
- •GPU kernel optimization with iterative benchmarking
- •Numerical correctness verification across test cases
- •Performance profiling and analysis
- •Parallel investigation of independent code paths
- •Long-running validation suites
Patterns
- •Parallel: Optimize independent kernels simultaneously (attention to A, MLP to B)
- •Correctness First: Make tests pass before performance
- •Incremental: Iterate until target speedup or report blockers
Kernel Optimization Template
code
Optimize [OPERATION] in [FILE]. Context: [current impl], [bottleneck source], [target HW: 3090/H100], [use case: train/inference] Requirements: 1. Implement with Triton/CUDA 2. Verify: torch.allclose(atol=1e-5, rtol=1e-5), gradients match autograd 3. Benchmark: warmup=10, bench=100, report min/max/mean/std us 4. Scales: (1,128), (8,512), (32,2048) Report: correctness status, perf table (scale, baseline_us, opt_us, speedup), memory
Workflow
Setup -> Develop -> Verify -> Benchmark -> Report
Requirements
- •Report measured numbers, never estimates
- •Include methodology (warmup, iterations, sync)
- •Flag regressions immediately