Create Environments

Goal

Build production-quality verifiers environments that work immediately in the Prime ecosystem: install, load, evaluate, and train without hidden setup.

Start With Ecosystem Paths

•Prefer ecosystem-native setup before custom scaffolding.
•Use this default loop:

bash

prime env init my-env
prime env install my-env
prime eval run my-env -m gpt-4.1-mini -n 5

•Prefer an existing environment as a starting point when possible:

bash

prime env list --search "keyword"
prime env info owner/name
prime env install owner/name

•For repository examples, use repo install when available:

bash

prime env install math-python --from-repo

•Encourage users to keep endpoint aliases in configs/endpoints.toml so smoke tests can switch models quickly.
•Ask users whether they want instruct or reasoning models for validation.
•Instruct-first smoke choices: gpt-4.1 series, qwen3 instruct series.
•Reasoning validation choices: gpt-5 series, qwen3 thinking series, glm series.

Build Modes

1. Build From Scratch

•Define task contract first: prompt shape, allowed tools, stop conditions, rubric outputs, metrics.
•Select the smallest correct base class:

•SingleTurnEnv for one-response tasks.
•MultiTurnEnv for custom interaction loops.
•ToolEnv or MCPEnv for stateless tools.
•StatefulToolEnv for per-rollout resources.

•Implement load_environment(...) -> vf.Environment with explicit arguments.
•Add pyproject.toml defaults in [tool.verifiers.eval] only when stable.

2. Port From Another Library, Project, or Paper

•Create a strict source-to-target mapping before coding:

•dataset rows and splits
•prompt rendering and role ordering
•tool I/O schema and stop logic
•scoring math and aggregation
•pass/fail thresholds and special cases

•Preserve one-to-one logical equivalence for what the model sees and what gets scored.
•Never invent unresolved formatting decisions. Ask the user to decide explicitly.
•Benchmark runtime and remove avoidable bottlenecks before handoff.

3. Start From Hub Environment

•Install or pull the closest baseline:

bash

prime env install owner/name
prime env pull owner/name -t ./tmp-env

•Keep proven interfaces stable unless a migration is deliberate and explicit.
•Re-run smoke evals after each major change.

Non-Negotiable Quality Rules

•Use deterministic, well-defined reward checks or LLM judges.
•Avoid best-effort deterministic heuristics such as keyword style checks except as an explicit last resort with user sign-off.
•Make environments self-contained after install. Do not require users to run background servers before load_environment().
•Manage external resources inside the environment lifecycle.
•Validate required secrets in load_environment() via vf.ensure_keys(...).
•Surface feature limits directly. Do not ship hacky workarounds without explicit user approval.

Verification Gate

Run these before claiming completion:

bash

prime env install my-env
prime eval run my-env -m gpt-4.1-mini -n 5
prime eval run my-env -m gpt-4.1-mini -n 50 -r 1 -s

If multi-turn or tool-heavy, also run with higher rollouts:

bash

prime eval run my-env -m gpt-4.1-mini -n 30 -r 3 -s

Publish Gate Before Large Evals Or Training

•After smoke tests pass and behavior is stable, recommend pushing to Hub before large evals or RL training.
•Ask the user explicitly whether visibility should be PUBLIC or PRIVATE.
•Use:

bash

prime env push my-env --visibility PUBLIC

bash

prime env push my-env --visibility PRIVATE

•For hosted or large-scale workflows, prefer running with the Hub slug after push:

bash

prime eval run owner/my-env -m gpt-4.1-mini -n 200 -r 3 -s

Synthetic Data

•Ask users for preferences on which LLMs to use for synthetic data generation and curation before implementation.
•Prefer generating synthetic data from raw source documents whenever possible instead of relying only on hand-authored prompts.
•Use LLM orchestration (planner/generator/validator loops) to improve sample quality and diversity.
•Use back-translation: start from complete materials and decompose them into incomplete tasks, criteria, or partial artifacts that the model must reconstruct.
•Use fan-out subtopic sampling from LLMs to expand coverage and avoid overfitting to a narrow slice of the domain.

Deliverable Format

Report:

•Environment ID and path.
•Exact install and eval commands used.
•Port-equivalence notes if migrated.
•Any unresolved user decisions that block strict fidelity.