RLM Test Suite
Testing and evaluation for fleet-rlm. Validates dspy.RLM + ModalInterpreter pipeline from sandbox connectivity to recursive execution.
Running Tests
All tests run via pytest from the repo root:
bash
# All tests (unit + mocked integration, no Modal credentials needed) uv run pytest tests/ -v # Specific test files uv run pytest tests/test_rlm_integration.py -v # Integration (mocked Modal) uv run pytest tests/test_rlm_benchmarks.py -v # Benchmarks (chunking throughput) uv run pytest tests/test_rlm_regression.py -v # Regression edge cases uv run pytest tests/test_driver_protocol.py -v # SUBMIT/tool-call protocol uv run pytest tests/test_driver_helpers.py -v # Sandbox helpers (peek, grep, etc.) # Run by keyword uv run pytest tests/ -k "benchmark" -v uv run pytest tests/ -k "integration" -v
Test File Inventory
| File | Tests | What It Validates |
|---|---|---|
test_chunking.py | Chunking strategies (size, headers, timestamps, JSON) | Pure functions, no mocks needed |
test_cli_smoke.py | CLI help, command discovery, error handling | Typer interface |
test_config.py | Environment loading, quoted values, fallback keys | config.py |
test_context_manager.py | __enter__/__exit__ protocol | ModalInterpreter lifecycle |
test_driver_helpers.py | peek, grep, chunk, buffers, volume helpers | Sandbox-side functions |
test_driver_protocol.py | SUBMIT mapping, tool call round-trips | JSON protocol |
test_rlm_benchmarks.py | Chunking throughput performance | Performance baselines |
test_rlm_integration.py | End-to-end with mocked Modal sandbox | Full pipeline |
test_rlm_regression.py | Edge cases and error handling | Robustness |
test_tools.py | Regex extraction, groups, flags | tools.py |
test_volume_support.py | Volume mount/persistence config | Volume integration |
Validate Modal Environment
Before running live (non-mocked) tests:
bash
# Check Modal credentials uv run modal token set uv run modal secret list # Verify LITELLM secret exists # Check specific secret key uv run fleet-rlm check-secret uv run fleet-rlm check-secret-key --key DSPY_LLM_API_KEY # Verify sandbox connectivity uv run python scripts/test_modal_connection.py
Writing New Tests
Integration Test Pattern
python
def test_feature(monkeypatch):
"""Test with mocked Modal sandbox."""
# Mock Modal to avoid cloud dependency
monkeypatch.setattr("fleet_rlm.interpreter.modal", mock_modal)
interp = ModalInterpreter(timeout=60)
interp.start()
try:
result = interp.execute('x = 42\nSUBMIT(answer=x)')
assert result.answer == 42
finally:
interp.shutdown()
Benchmark Pattern
python
def test_benchmark_chunking(benchmark):
"""Benchmark chunking throughput."""
from fleet_rlm.chunking import chunk_by_size
text = "x" * 100_000
result = benchmark(chunk_by_size, text, 1000, 100)
assert len(result) > 0
Key points:
- •Access RLM results via
result.field_name(dot notation), notresult["field"] - •Always call
interp.shutdown()in afinallyblock - •Use
monkeypatchto mock Modal/DSPy for offline tests
Evaluation Metrics
| Metric | Target | Description |
|---|---|---|
| Iteration efficiency | < 2x optimal | Steps taken / optimal steps |
| Tool call success rate | > 95% | Successful invocations / attempts |
| Sandbox timeout rate | < 5% | Timeouts / total runs |
| SUBMIT usage | High | % of steps using SUBMIT vs print |
Baseline Expectations
| Task Type | Max Iterations | Typical Steps | Max Duration |
|---|---|---|---|
| Simple calculation | 5 | 1-2 | 10s |
| Text search | 10 | 2-4 | 30s |
| Code analysis | 20 | 5-10 | 60s |
| Multi-file exploration | 30 | 10-15 | 120s |
CI Integration
yaml
# .github/workflows/ci.yml
- name: Unit Tests
run: uv run pytest tests/ -v --ignore=tests/test_rlm_integration.py
- name: Integration Tests (main only)
if: github.ref == 'refs/heads/main'
env:
MODAL_TOKEN_ID: ${{ secrets.MODAL_TOKEN_ID }}
MODAL_TOKEN_SECRET: ${{ secrets.MODAL_TOKEN_SECRET }}
run: uv run pytest tests/test_rlm_integration.py -v
Troubleshooting
See rlm-debug skill for comprehensive diagnostics.