AgentSkillsCN

mdp-designer

设计并编辑 MDP 术语(观察值、奖励、终止条件、目标/命令、随机化),并将这些要素串联至配置与日志记录中。适用于通过优化强化学习环境的 MDP 定义,从而提升学习效果时使用。

SKILL.md
--- frontmatter
name: mdp-designer
description: Designs/edits MDP terms (observations, rewards, terminations, goals/commands, randomization) and wires them into configs and logging. Use when improving an RL environment’s MDP definition for better learning.

MDP Designer

Use this skill when a task requires changing what the policy sees (observations), optimizes (rewards), how goals are produced (commands/targets), how episodes end (terminations), or how physics/assets are perturbed (randomization).

This skill is framework-agnostic: it applies to Gymnasium-style envs, simulator-backed RL envs, and custom training loops as long as there is a well-defined observation/reward/termination interface.

What This Skill Produces

  • A concrete list of files/functions to edit and the expected dataflow (config → env construction → MDP terms → rollout buffers/logging).
  • Safe, minimal code changes that follow the target codebase’s existing patterns and abstractions.
  • A sanity-check plan (smoke run / unit test / invariant checks) to catch silent failures.

Operating Procedure

  1. Locate the term boundary

    • Find where the env computes and returns the core MDP outputs (commonly obs, reward, terminated/truncated, info).
    • Identify whether the change belongs to:
      • Observations (feature construction, normalization, history/stacking, sensors)
      • Rewards (component definition, scaling, aggregation, per-component stats)
      • Terminations (done logic, timeouts, safety constraints, reset conditions)
      • Goals/commands (how targets are generated, curricula, reference trajectories)
      • Randomization (domain randomization, noise injection, parameter perturbations)
    • Prefer changing the smallest “term module” rather than scattering logic across the env step loop.
  2. Follow naming + logging conventions

    • Ensure new reward components and diagnostics use stable, collision-free keys.
    • Prefer emitting scalar or low-dimensional summaries (means, mins/maxes, rates) rather than raw tensors.
    • If the codebase has a logger convention (e.g., info["stats"], “episode/*”, TensorBoard keys), follow it exactly so metrics surface without extra glue.
  3. Make the change “config-driven”

    • Add config knobs for weights, toggles, thresholds, and schedules (curriculum, annealing).
    • Preserve backwards-compatible defaults unless the task explicitly wants behavior changes.
    • Keep config names and units explicit (meters vs centimeters, radians vs degrees).
  4. Guardrails

    • Validate shapes and dtypes at module boundaries (single env vs vectorized env; CPU vs GPU tensors).
    • Ensure rewards are finite and well-scaled; clamp or safe-divide where division is possible.
    • Keep termination conditions mutually consistent (avoid contradictory “never terminate” or “always terminate” behavior).
    • Avoid creating hidden state unless you reset it correctly on episode reset.
  5. Minimal verification

    • Add a fast smoke test (instantiate env, step N times, reset a few times) or reuse an existing test harness.
    • Confirm obs shapes match policy expectations, rewards are finite, and terminations occur at plausible rates.
    • Sanity-check key metrics move in the expected direction (e.g., increasing survival reward should not decrease episode length).

Common Pitfalls Checklist

  • Observation mismatch between env output and policy input wiring (missing keys, wrong order, wrong normalization).
  • Reward component is computed but never added to the scalar reward, or its weight has the wrong sign.
  • Reward magnitudes are off by orders of magnitude, causing value loss instability or saturated policies.
  • Diagnostics are emitted under unexpected keys so they never appear in logs.
  • Termination triggers too frequently (learning collapses) or never (episodes never reset).
  • Randomization is applied after values are cached/compiled, resulting in no effect.

How To Adapt To Any Codebase

  • Start from the rollout boundary: where a batch of transitions is produced for training.
  • Trace backward to the source of each term: observation construction, reward aggregation, termination checks, goal generation.
  • Prefer “one responsibility per module” so term logic is testable without running long rollouts.
  • Keep a short mapping doc: config knob → code location → logged metric.

Example Prompts (Copy/Paste)

  • “Add a new reward term that penalizes lateral slip and log its component stats.”
  • “Change termination to end the episode when the agent tilts beyond a threshold, configurable via config.”
  • “Add an optional exteroceptive observation (e.g., height samples) behind a config flag.”
  • “Refactor reward computation into components with clear scaling and unit tests for each component.”