Train With Environments
Goal
Run stable RL training loops with environment-aware hyperparameter choices and clear diagnostics.
Preferred Training Paths
- •Hosted Training service path from lab setup:
bash
prime lab setup
- •Self-managed
prime-rlworkflow:
bash
prime lab setup --prime-rl uv run prime-rl @ configs/prime-rl/wiki-search.toml
- •Runtime expectation:
- •Hosted Training is intended to be launched from a CPU machine.
- •Local
prime-rltraining requires local GPU access.
Endpoint Shortcuts And Model Family Choice
- •Encourage users to maintain endpoint aliases in
configs/endpoints.tomlfor eval and train loops. - •Ask whether they want instruct or reasoning models for pre-training validation.
- •Instruct go-tos for behavior checks:
gpt-4.1series,qwen3instruct series. - •Reasoning go-tos for harder reasoning-heavy probes:
gpt-5series,qwen3thinking series,glmseries.
First-Run Protocol
- •Validate environment behavior before training:
bash
prime env install my-env prime eval run my-env -m gpt-4.1-mini -n 20 -r 3 -s
- •Confirm reward diversity exists at baseline.
- •Start with conservative run length and inspect samples early.
Publish Gate Before RL
- •Before long training runs, proactively recommend pushing the environment to Hub once smoke evals are stable.
- •Ask the user explicitly whether visibility should be
PUBLICorPRIVATE. - •Push with chosen visibility:
bash
prime env push --path ./environments/my_env --visibility PUBLIC
or
bash
prime env push --path ./environments/my_env --visibility PRIVATE
- •For hosted RL and shared workflows, prefer Hub IDs after push (for example
owner/my-envin config[[env]].id).
Hyperparameter Rules Of Thumb
- •Use
rollouts_per_exampleandbatch_sizetogether. - •Treat
batch_sizeas total rollout samples per step, not number of groups. - •Keep
batch_sizedivisible byrollouts_per_example. - •Quick tests or simpler environments:
- •
rollouts_per_example = 8 - •
batch_size = 128(or lower)
- •More complex or longer-horizon environments:
- •
rollouts_per_example = 16 - •
batch_size = 512(common strong starting point)
- •Increase gradually from stable settings instead of jumping directly to aggressive configs.
Difficulty Filtering And Oversampling
- •For mostly binary rewards, enable difficulty filtering and consider oversampling:
- •
buffer.online_difficulty_filtering = true - •
oversampling_factor > 1(for example2.0)
- •For continuous rewards, usually avoid binary-style filtering assumptions and keep filtering conservative or off until validated.
- •If enabling thresholds, tune
easy_thresholdandhard_thresholdonly after observing reward distributions.
Stability Constraints From Prime-RL
- •Ensure
max_concurrent >= rollouts_per_example * workers_per_env. - •Keep async level explicit (
max_async_level) and monitor off-policy drift. - •For OOM risk, reduce rollout pressure and sequence lengths before widening training scope.
Failure Diagnosis
- •Flat reward near zero:
- •Task too hard, rubric mismatch, or prompt/tool contract mismatch.
- •Unstable reward swings:
- •Lower learning rate, increase rollout group size, reduce async aggressiveness.
- •Slow learning despite stability:
- •Revisit task difficulty and reward shaping before increasing risk knobs.
Non-Negotiable Environment Quality During Training
- •Use deterministic robust checks or LLM judges for rewards.
- •Reject best-effort keyword heuristics unless explicitly approved as last resort.
- •Keep environments self-contained after install; no user-managed background services.
- •Surface feature limitations directly instead of proposing hidden workarounds.
Deliverable
Return:
- •Config deltas applied.
- •Why each delta was chosen.
- •Observed metrics and failure signatures.
- •Next tuning step with stop conditions.