Create Environments
Goal
Build production-quality verifiers environments that work immediately in the Prime ecosystem: install, load, evaluate, and train without hidden setup.
Start With Ecosystem Paths
- •Prefer ecosystem-native setup before custom scaffolding.
- •Use this default loop:
bash
prime env init my-env prime env install my-env prime eval run my-env -m gpt-4.1-mini -n 5
- •Prefer an existing environment as a starting point when possible:
bash
prime env list --search "keyword" prime env info owner/name prime env install owner/name
- •For repository examples, use repo install when available:
bash
prime env install math-python --from-repo
- •Encourage users to keep endpoint aliases in
configs/endpoints.tomlso smoke tests can switch models quickly. - •Ask users whether they want instruct or reasoning models for validation.
- •Instruct-first smoke choices:
gpt-4.1series,qwen3instruct series. - •Reasoning validation choices:
gpt-5series,qwen3thinking series,glmseries.
Build Modes
1. Build From Scratch
- •Define task contract first: prompt shape, allowed tools, stop conditions, rubric outputs, metrics.
- •Select the smallest correct base class:
- •
SingleTurnEnvfor one-response tasks. - •
MultiTurnEnvfor custom interaction loops. - •
ToolEnvorMCPEnvfor stateless tools. - •
StatefulToolEnvfor per-rollout resources.
- •Implement
load_environment(...) -> vf.Environmentwith explicit arguments. - •Add
pyproject.tomldefaults in[tool.verifiers.eval]only when stable.
2. Port From Another Library, Project, or Paper
- •Create a strict source-to-target mapping before coding:
- •dataset rows and splits
- •prompt rendering and role ordering
- •tool I/O schema and stop logic
- •scoring math and aggregation
- •pass/fail thresholds and special cases
- •Preserve one-to-one logical equivalence for what the model sees and what gets scored.
- •Never invent unresolved formatting decisions. Ask the user to decide explicitly.
- •Benchmark runtime and remove avoidable bottlenecks before handoff.
3. Start From Hub Environment
- •Install or pull the closest baseline:
bash
prime env install owner/name prime env pull owner/name -t ./tmp-env
- •Keep proven interfaces stable unless a migration is deliberate and explicit.
- •Re-run smoke evals after each major change.
Non-Negotiable Quality Rules
- •Use deterministic, well-defined reward checks or LLM judges.
- •Avoid best-effort deterministic heuristics such as keyword style checks except as an explicit last resort with user sign-off.
- •Make environments self-contained after install. Do not require users to run background servers before
load_environment(). - •Manage external resources inside the environment lifecycle.
- •Validate required secrets in
load_environment()viavf.ensure_keys(...). - •Surface feature limits directly. Do not ship hacky workarounds without explicit user approval.
Verification Gate
Run these before claiming completion:
bash
prime env install my-env prime eval run my-env -m gpt-4.1-mini -n 5 prime eval run my-env -m gpt-4.1-mini -n 50 -r 1 -s
If multi-turn or tool-heavy, also run with higher rollouts:
bash
prime eval run my-env -m gpt-4.1-mini -n 30 -r 3 -s
Publish Gate Before Large Evals Or Training
- •After smoke tests pass and behavior is stable, recommend pushing to Hub before large evals or RL training.
- •Ask the user explicitly whether visibility should be
PUBLICorPRIVATE. - •Use:
bash
prime env push my-env --visibility PUBLIC
or
bash
prime env push my-env --visibility PRIVATE
- •For hosted or large-scale workflows, prefer running with the Hub slug after push:
bash
prime eval run owner/my-env -m gpt-4.1-mini -n 200 -r 3 -s
Synthetic Data
- •Ask users for preferences on which LLMs to use for synthetic data generation and curation before implementation.
- •Prefer generating synthetic data from raw source documents whenever possible instead of relying only on hand-authored prompts.
- •Use LLM orchestration (planner/generator/validator loops) to improve sample quality and diversity.
- •Use back-translation: start from complete materials and decompose them into incomplete tasks, criteria, or partial artifacts that the model must reconstruct.
- •Use fan-out subtopic sampling from LLMs to expand coverage and avoid overfitting to a narrow slice of the domain.
Deliverable Format
Report:
- •Environment ID and path.
- •Exact install and eval commands used.
- •Port-equivalence notes if migrated.
- •Any unresolved user decisions that block strict fidelity.