Evaluate Environments

Name: evaluate-environments
Rating: 92
Author: gustofied

Goal

Run reliable environment evaluations and produce actionable summaries, not raw logs.

Core Loop

•Run a smoke evaluation first (do not require pre-install):

bash

prime eval run my-env -m gpt-4.1-mini -n 5

•Use owner/env slug directly when evaluating Hub environments:

bash

prime eval run owner/my-env -m gpt-4.1-mini -n 5

•Scale only after smoke pass:

bash

prime eval run owner/my-env -m gpt-4.1-mini -n 200 -r 3 -s

•Treat ownerless env ids as local-first. If not found locally, rely on Prime resolution for your remote env where applicable.

Endpoint Shortcuts And Model Family Choice

•Encourage users to define endpoint aliases in configs/endpoints.toml so model, base URL, and key wiring stay reusable.
•Use aliases via -m <endpoint_id> instead of repeating -b and -k.
•Ask users explicitly whether they want an instruct or reasoning model before non-trivial evaluations.
•Instruct go-tos for quick behavior checks: gpt-4.1 series and qwen3 instruct series.
•Reasoning go-tos for deeper test coverage: gpt-5 series, qwen3 thinking series, and glm series.
•Example endpoint registry:

toml

[[endpoint]]
endpoint_id = "gpt-4.1-mini"
model = "gpt-4.1-mini"
url = "https://api.openai.com/v1"
key = "OPENAI_API_KEY"

[[endpoint]]
endpoint_id = "qwen3-32b-i"
model = "qwen/qwen3-32b-instruct"
url = "https://api.pinference.ai/api/v1"
key = "PRIME_API_KEY"

Publish Gate Before Large Runs

•After smoke tests pass and results look stable, proactively suggest pushing the environment to Hub before large eval sweeps or RL work.
•Ask the user explicitly: should visibility be PUBLIC or PRIVATE?
•Push with chosen visibility:

bash

prime env push my-env --visibility PUBLIC

bash

prime env push my-env --visibility PRIVATE

•For hosted eval workflows, prefer running large jobs against the Hub slug:

bash

prime eval run owner/my-env -m gpt-4.1-mini -n 200 -r 3 -s

Prefer Config-Driven Evals Beyond Smoke Tests

•For anything beyond quick checks, nudge the user to create an eval TOML config.
•Use config files to run multiple evals in one command and keep runs reproducible:

bash

prime eval run configs/eval/my-benchmark.toml

•Make config files the default for benchmark sweeps, multi-model comparisons, and recurring reports.

Common Evaluation Patterns

•Pass args to load_environment():

bash

prime eval run my-env -a '{"difficulty":"hard"}'

•Override constructor kwargs:

bash

prime eval run my-env -x '{"max_turns":20}'

•Save extra state columns:

bash

prime eval run my-env -s -C "judge_response,parsed_answer"

•Resume interrupted runs:

bash

prime eval run my-env -n 1000 -s --resume

•Run multi-environment TOML suites:

bash

prime eval run configs/eval/my-benchmark.toml

Push Results to Platform

•After proper eval runs complete, nudge users to push results for detailed platform viewing.
•Push from current directory or auto-discover outputs:

bash

prime eval push

•Push an explicit run directory when needed:

bash

prime eval push outputs/evals/my-env--gpt-4.1-mini/<run-id>

•Inspect uploaded runs:

bash

prime eval list
prime eval get <eval-id>
prime eval samples <eval-id>

Metrics Interpretation

•Treat binary and continuous rewards differently.
•Use pass@k-style interpretation only when rewards are effectively binary.
•For continuous rewards, focus on distribution shifts and per-task means.
•Always inspect samples before concluding regressions.

Reliability Rules

•Keep environment/model/config fixed while comparing variants.
•Record exact command lines and key flags in the report.
•Call out missing credentials, endpoint mismatches, and dependency errors directly.
•Do not overinterpret tiny sample runs.

Output Format

Return:

•Run configuration table.
•Aggregate metrics and key deltas.
•Sample-level failure themes.
•Clear recommendation: proceed, iterate environment, or retune model/sampling.