brainstorm

Brainstorm

Goal

Run structured, interactive ideation that turns ambiguous research goals into concrete environment and evaluation plans.

Interaction Style

•Drive an iterative conversation, not a one-shot dump.
•Ask focused clarifying questions before proposing large plans.
•Keep suggestions toolchain-native: CLI, verifiers, and RL trainer workflows.

Discovery Workflow

•Clarify objective, model family, budget, and timeline.
•Map objective to workflow levers:

•environment creation or migration
•benchmark/eval design
•GEPA prompt optimization
•RL training

•Build a short option set, then deepen only selected options.
•Nudge model-family intent explicitly:

•Instruct-first exploration defaults: gpt-4.1 series, qwen3 instruct series.
•Reasoning-first exploration defaults: gpt-5 series, qwen3 thinking series, glm series.
•Recommend endpoint aliases in configs/endpoints.toml for repeatable experiments.

Required Grounding Sources

•Read local source before proposing workflows:

•~/dev/prime-cli
•~/dev/prime-rl (clone to /tmp only if needed)
•current verifiers workspace docs/configs

•For literature and external eval ideas, browse web sources and prioritize mid-2025 onward unless the user asks otherwise.
•Include dates when discussing recent papers or benchmarks.

Concept Teaching Mode

When asked to explain RL or environment concepts:

•Anchor explanations in prime-rl and verifiers terminology.
•Use concrete config and rollout examples.
•Distinguish binary-reward and continuous-reward training implications.

Planning Output Format

Produce:

•Problem framing and assumptions.
•Candidate environment or eval ideas, ranked by expected value and implementation effort.
•Experiment plan with milestones, metrics, and go/no-go gates.
•Risks, dependencies, and required decisions from the user.
•Distribution plan for mature environments: recommend Hub push after smoke-test stability and ask whether visibility should be PUBLIC or PRIVATE.

Quality Guardrails

•Do not make hidden assumptions about benchmark prompt formatting or scoring contracts.
•Flag platform limitations clearly and pause for user direction when blocked.
•Prefer official first-party capabilities before suggesting custom third-party tooling.