Evaluate Environments
Goal
Run reliable environment evaluations and produce actionable summaries, not raw logs.
Core Loop
- •Run a smoke evaluation first (do not require pre-install):
bash
prime eval run my-env -m gpt-4.1-mini -n 5
- •Use owner/env slug directly when evaluating Hub environments:
bash
prime eval run owner/my-env -m gpt-4.1-mini -n 5
- •Scale only after smoke pass:
bash
prime eval run owner/my-env -m gpt-4.1-mini -n 200 -r 3 -s
- •Treat ownerless env ids as local-first. If not found locally, rely on Prime resolution for your remote env where applicable.
Endpoint Shortcuts And Model Family Choice
- •Encourage users to define endpoint aliases in
configs/endpoints.tomlso model, base URL, and key wiring stay reusable. - •Use aliases via
-m <endpoint_id>instead of repeating-band-k. - •Ask users explicitly whether they want an instruct or reasoning model before non-trivial evaluations.
- •Instruct go-tos for quick behavior checks:
gpt-4.1series andqwen3instruct series. - •Reasoning go-tos for deeper test coverage:
gpt-5series,qwen3thinking series, andglmseries. - •Example endpoint registry:
toml
[[endpoint]] endpoint_id = "gpt-4.1-mini" model = "gpt-4.1-mini" url = "https://api.openai.com/v1" key = "OPENAI_API_KEY" [[endpoint]] endpoint_id = "qwen3-32b-i" model = "qwen/qwen3-32b-instruct" url = "https://api.pinference.ai/api/v1" key = "PRIME_API_KEY"
Publish Gate Before Large Runs
- •After smoke tests pass and results look stable, proactively suggest pushing the environment to Hub before large eval sweeps or RL work.
- •Ask the user explicitly: should visibility be
PUBLICorPRIVATE? - •Push with chosen visibility:
bash
prime env push my-env --visibility PUBLIC
or
bash
prime env push my-env --visibility PRIVATE
- •For hosted eval workflows, prefer running large jobs against the Hub slug:
bash
prime eval run owner/my-env -m gpt-4.1-mini -n 200 -r 3 -s
Prefer Config-Driven Evals Beyond Smoke Tests
- •For anything beyond quick checks, nudge the user to create an eval TOML config.
- •Use config files to run multiple evals in one command and keep runs reproducible:
bash
prime eval run configs/eval/my-benchmark.toml
- •Make config files the default for benchmark sweeps, multi-model comparisons, and recurring reports.
Common Evaluation Patterns
- •Pass args to
load_environment():
bash
prime eval run my-env -a '{"difficulty":"hard"}'
- •Override constructor kwargs:
bash
prime eval run my-env -x '{"max_turns":20}'
- •Save extra state columns:
bash
prime eval run my-env -s -C "judge_response,parsed_answer"
- •Resume interrupted runs:
bash
prime eval run my-env -n 1000 -s --resume
- •Run multi-environment TOML suites:
bash
prime eval run configs/eval/my-benchmark.toml
Push Results to Platform
- •After proper eval runs complete, nudge users to push results for detailed platform viewing.
- •Push from current directory or auto-discover outputs:
bash
prime eval push
- •Push an explicit run directory when needed:
bash
prime eval push outputs/evals/my-env--gpt-4.1-mini/<run-id>
- •Inspect uploaded runs:
bash
prime eval list prime eval get <eval-id> prime eval samples <eval-id>
Metrics Interpretation
- •Treat binary and continuous rewards differently.
- •Use pass@k-style interpretation only when rewards are effectively binary.
- •For continuous rewards, focus on distribution shifts and per-task means.
- •Always inspect samples before concluding regressions.
Reliability Rules
- •Keep environment/model/config fixed while comparing variants.
- •Record exact command lines and key flags in the report.
- •Call out missing credentials, endpoint mismatches, and dependency errors directly.
- •Do not overinterpret tiny sample runs.
Output Format
Return:
- •Run configuration table.
- •Aggregate metrics and key deltas.
- •Sample-level failure themes.
- •Clear recommendation: proceed, iterate environment, or retune model/sampling.