MDP Designer
Use this skill when a task requires changing what the policy sees (observations), optimizes (rewards), how goals are produced (commands/targets), how episodes end (terminations), or how physics/assets are perturbed (randomization).
This skill is framework-agnostic: it applies to Gymnasium-style envs, simulator-backed RL envs, and custom training loops as long as there is a well-defined observation/reward/termination interface.
What This Skill Produces
- •A concrete list of files/functions to edit and the expected dataflow (config → env construction → MDP terms → rollout buffers/logging).
- •Safe, minimal code changes that follow the target codebase’s existing patterns and abstractions.
- •A sanity-check plan (smoke run / unit test / invariant checks) to catch silent failures.
Operating Procedure
- •
Locate the term boundary
- •Find where the env computes and returns the core MDP outputs (commonly
obs,reward,terminated/truncated,info). - •Identify whether the change belongs to:
- •Observations (feature construction, normalization, history/stacking, sensors)
- •Rewards (component definition, scaling, aggregation, per-component stats)
- •Terminations (done logic, timeouts, safety constraints, reset conditions)
- •Goals/commands (how targets are generated, curricula, reference trajectories)
- •Randomization (domain randomization, noise injection, parameter perturbations)
- •Prefer changing the smallest “term module” rather than scattering logic across the env step loop.
- •Find where the env computes and returns the core MDP outputs (commonly
- •
Follow naming + logging conventions
- •Ensure new reward components and diagnostics use stable, collision-free keys.
- •Prefer emitting scalar or low-dimensional summaries (means, mins/maxes, rates) rather than raw tensors.
- •If the codebase has a logger convention (e.g.,
info["stats"], “episode/*”, TensorBoard keys), follow it exactly so metrics surface without extra glue.
- •
Make the change “config-driven”
- •Add config knobs for weights, toggles, thresholds, and schedules (curriculum, annealing).
- •Preserve backwards-compatible defaults unless the task explicitly wants behavior changes.
- •Keep config names and units explicit (meters vs centimeters, radians vs degrees).
- •
Guardrails
- •Validate shapes and dtypes at module boundaries (single env vs vectorized env; CPU vs GPU tensors).
- •Ensure rewards are finite and well-scaled; clamp or safe-divide where division is possible.
- •Keep termination conditions mutually consistent (avoid contradictory “never terminate” or “always terminate” behavior).
- •Avoid creating hidden state unless you reset it correctly on episode reset.
- •
Minimal verification
- •Add a fast smoke test (instantiate env, step N times, reset a few times) or reuse an existing test harness.
- •Confirm
obsshapes match policy expectations, rewards are finite, and terminations occur at plausible rates. - •Sanity-check key metrics move in the expected direction (e.g., increasing survival reward should not decrease episode length).
Common Pitfalls Checklist
- •Observation mismatch between env output and policy input wiring (missing keys, wrong order, wrong normalization).
- •Reward component is computed but never added to the scalar reward, or its weight has the wrong sign.
- •Reward magnitudes are off by orders of magnitude, causing value loss instability or saturated policies.
- •Diagnostics are emitted under unexpected keys so they never appear in logs.
- •Termination triggers too frequently (learning collapses) or never (episodes never reset).
- •Randomization is applied after values are cached/compiled, resulting in no effect.
How To Adapt To Any Codebase
- •Start from the rollout boundary: where a batch of transitions is produced for training.
- •Trace backward to the source of each term: observation construction, reward aggregation, termination checks, goal generation.
- •Prefer “one responsibility per module” so term logic is testable without running long rollouts.
- •Keep a short mapping doc: config knob → code location → logged metric.
Example Prompts (Copy/Paste)
- •“Add a new reward term that penalizes lateral slip and log its component stats.”
- •“Change termination to end the episode when the agent tilts beyond a threshold, configurable via config.”
- •“Add an optional exteroceptive observation (e.g., height samples) behind a config flag.”
- •“Refactor reward computation into components with clear scaling and unit tests for each component.”