MDP Designer

Use this skill when a task requires changing what the policy sees (observations), optimizes (rewards), how goals are produced (commands/targets), how episodes end (terminations), or how physics/assets are perturbed (randomization).

This skill is framework-agnostic: it applies to Gymnasium-style envs, simulator-backed RL envs, and custom training loops as long as there is a well-defined observation/reward/termination interface.

What This Skill Produces

•A concrete list of files/functions to edit and the expected dataflow (config → env construction → MDP terms → rollout buffers/logging).
•Safe, minimal code changes that follow the target codebase’s existing patterns and abstractions.
•A sanity-check plan (smoke run / unit test / invariant checks) to catch silent failures.

Operating Procedure

•
Locate the term boundary
- •Find where the env computes and returns the core MDP outputs (commonly obs, reward, terminated/truncated, info).
- •
  Identify whether the change belongs to:
  - •Observations (feature construction, normalization, history/stacking, sensors)
  - •Rewards (component definition, scaling, aggregation, per-component stats)
  - •Terminations (done logic, timeouts, safety constraints, reset conditions)
  - •Goals/commands (how targets are generated, curricula, reference trajectories)
  - •Randomization (domain randomization, noise injection, parameter perturbations)
- •Prefer changing the smallest “term module” rather than scattering logic across the env step loop.
•
Follow naming + logging conventions
- •Ensure new reward components and diagnostics use stable, collision-free keys.
- •Prefer emitting scalar or low-dimensional summaries (means, mins/maxes, rates) rather than raw tensors.
- •If the codebase has a logger convention (e.g., info["stats"], “episode/*”, TensorBoard keys), follow it exactly so metrics surface without extra glue.
•
Make the change “config-driven”
- •Add config knobs for weights, toggles, thresholds, and schedules (curriculum, annealing).
- •Preserve backwards-compatible defaults unless the task explicitly wants behavior changes.
- •Keep config names and units explicit (meters vs centimeters, radians vs degrees).
•
Guardrails
- •Validate shapes and dtypes at module boundaries (single env vs vectorized env; CPU vs GPU tensors).
- •Ensure rewards are finite and well-scaled; clamp or safe-divide where division is possible.
- •Keep termination conditions mutually consistent (avoid contradictory “never terminate” or “always terminate” behavior).
- •Avoid creating hidden state unless you reset it correctly on episode reset.
•
Minimal verification
- •Add a fast smoke test (instantiate env, step N times, reset a few times) or reuse an existing test harness.
- •Confirm obs shapes match policy expectations, rewards are finite, and terminations occur at plausible rates.
- •Sanity-check key metrics move in the expected direction (e.g., increasing survival reward should not decrease episode length).

Common Pitfalls Checklist

•Observation mismatch between env output and policy input wiring (missing keys, wrong order, wrong normalization).
•Reward component is computed but never added to the scalar reward, or its weight has the wrong sign.
•Reward magnitudes are off by orders of magnitude, causing value loss instability or saturated policies.
•Diagnostics are emitted under unexpected keys so they never appear in logs.
•Termination triggers too frequently (learning collapses) or never (episodes never reset).
•Randomization is applied after values are cached/compiled, resulting in no effect.

How To Adapt To Any Codebase

•Start from the rollout boundary: where a batch of transitions is produced for training.
•Trace backward to the source of each term: observation construction, reward aggregation, termination checks, goal generation.
•Prefer “one responsibility per module” so term logic is testable without running long rollouts.
•Keep a short mapping doc: config knob → code location → logged metric.

Example Prompts (Copy/Paste)

•“Add a new reward term that penalizes lateral slip and log its component stats.”
•“Change termination to end the episode when the agent tilts beyond a threshold, configurable via config.”
•“Add an optional exteroceptive observation (e.g., height samples) behind a config flag.”
•“Refactor reward computation into components with clear scaling and unit tests for each component.”