AgentSkillsCN

rlm-test-suite

测试并评估 fleet-rlm RLM 工作流。当您运行集成测试、基准测试、回归测试,评估 RLM 性能,或验证 Modal 沙盒的连接性时,可启用此技能。

SKILL.md
--- frontmatter
name: rlm-test-suite
description: Test and evaluate fleet-rlm RLM workflows. Use when running integration tests, benchmarks, regression tests, evaluating RLM performance, or validating Modal sandbox connectivity.

RLM Test Suite

Testing and evaluation for fleet-rlm. Validates dspy.RLM + ModalInterpreter pipeline from sandbox connectivity to recursive execution.

Running Tests

All tests run via pytest from the repo root:

bash
# All tests (unit + mocked integration, no Modal credentials needed)
uv run pytest tests/ -v

# Specific test files
uv run pytest tests/test_rlm_integration.py -v    # Integration (mocked Modal)
uv run pytest tests/test_rlm_benchmarks.py -v      # Benchmarks (chunking throughput)
uv run pytest tests/test_rlm_regression.py -v       # Regression edge cases
uv run pytest tests/test_driver_protocol.py -v      # SUBMIT/tool-call protocol
uv run pytest tests/test_driver_helpers.py -v        # Sandbox helpers (peek, grep, etc.)

# Run by keyword
uv run pytest tests/ -k "benchmark" -v
uv run pytest tests/ -k "integration" -v

Test File Inventory

FileTestsWhat It Validates
test_chunking.pyChunking strategies (size, headers, timestamps, JSON)Pure functions, no mocks needed
test_cli_smoke.pyCLI help, command discovery, error handlingTyper interface
test_config.pyEnvironment loading, quoted values, fallback keysconfig.py
test_context_manager.py__enter__/__exit__ protocolModalInterpreter lifecycle
test_driver_helpers.pypeek, grep, chunk, buffers, volume helpersSandbox-side functions
test_driver_protocol.pySUBMIT mapping, tool call round-tripsJSON protocol
test_rlm_benchmarks.pyChunking throughput performancePerformance baselines
test_rlm_integration.pyEnd-to-end with mocked Modal sandboxFull pipeline
test_rlm_regression.pyEdge cases and error handlingRobustness
test_tools.pyRegex extraction, groups, flagstools.py
test_volume_support.pyVolume mount/persistence configVolume integration

Validate Modal Environment

Before running live (non-mocked) tests:

bash
# Check Modal credentials
uv run modal token set
uv run modal secret list          # Verify LITELLM secret exists

# Check specific secret key
uv run fleet-rlm check-secret
uv run fleet-rlm check-secret-key --key DSPY_LLM_API_KEY

# Verify sandbox connectivity
uv run python scripts/test_modal_connection.py

Writing New Tests

Integration Test Pattern

python
def test_feature(monkeypatch):
    """Test with mocked Modal sandbox."""
    # Mock Modal to avoid cloud dependency
    monkeypatch.setattr("fleet_rlm.interpreter.modal", mock_modal)

    interp = ModalInterpreter(timeout=60)
    interp.start()
    try:
        result = interp.execute('x = 42\nSUBMIT(answer=x)')
        assert result.answer == 42
    finally:
        interp.shutdown()

Benchmark Pattern

python
def test_benchmark_chunking(benchmark):
    """Benchmark chunking throughput."""
    from fleet_rlm.chunking import chunk_by_size
    text = "x" * 100_000
    result = benchmark(chunk_by_size, text, 1000, 100)
    assert len(result) > 0

Key points:

  • Access RLM results via result.field_name (dot notation), not result["field"]
  • Always call interp.shutdown() in a finally block
  • Use monkeypatch to mock Modal/DSPy for offline tests

Evaluation Metrics

MetricTargetDescription
Iteration efficiency< 2x optimalSteps taken / optimal steps
Tool call success rate> 95%Successful invocations / attempts
Sandbox timeout rate< 5%Timeouts / total runs
SUBMIT usageHigh% of steps using SUBMIT vs print

Baseline Expectations

Task TypeMax IterationsTypical StepsMax Duration
Simple calculation51-210s
Text search102-430s
Code analysis205-1060s
Multi-file exploration3010-15120s

CI Integration

yaml
# .github/workflows/ci.yml
- name: Unit Tests
  run: uv run pytest tests/ -v --ignore=tests/test_rlm_integration.py

- name: Integration Tests (main only)
  if: github.ref == 'refs/heads/main'
  env:
    MODAL_TOKEN_ID: ${{ secrets.MODAL_TOKEN_ID }}
    MODAL_TOKEN_SECRET: ${{ secrets.MODAL_TOKEN_SECRET }}
  run: uv run pytest tests/test_rlm_integration.py -v

Troubleshooting

See rlm-debug skill for comprehensive diagnostics.