Using Deep RL Meta-Skill

When to Use This Skill

Invoke this meta-skill when you encounter:

•RL Implementation: Implementing reinforcement learning algorithms (Q-learning, DQN, PPO, SAC, etc.)
•Agent Training: Training agents in environments (games, robotics, control systems)
•Sequential Decision-Making: Problems requiring learning from trial and error
•Policy Optimization: Learning policies that maximize cumulative rewards
•Game Playing: Building agents for Atari, board games, video games
•Robotics Control: Robot manipulation, locomotion, continuous control
•Reward-Based Learning: Learning from rewards, penalties, or feedback signals
•RL Debugging: Debugging training issues, agents not learning, reward problems
•Environment Setup: Creating custom RL environments, wrappers
•RL Evaluation: Evaluating agent performance, sample efficiency, generalization

This is the entry point for the deep-rl pack. It routes to 12 specialized skills based on problem characteristics.

Core Principle

Problem type determines algorithm family.

Reinforcement learning is not one algorithm. The correct approach depends on:

•Action Space: Discrete (button presses) vs Continuous (joint angles)
•Data Regime: Online (interact with environment) vs Offline (fixed dataset)
•Experience Level: Need foundations vs ready to implement
•Special Requirements: Multi-agent, model-based, exploration, reward design

Always clarify the problem BEFORE suggesting algorithms.

The 12 Deep RL Skills

•rl-foundations - MDP formulation, Bellman equations, value vs policy basics
•value-based-methods - Q-learning, DQN, Double DQN, Dueling DQN, Rainbow
•policy-gradient-methods - REINFORCE, PPO, TRPO, policy optimization
•actor-critic-methods - A2C, A3C, SAC, TD3, advantage functions
•model-based-rl - World models, Dyna, MBPO, planning with learned models
•offline-rl - Batch RL, CQL, IQL, learning from fixed datasets
•multi-agent-rl - MARL, cooperative/competitive, communication
•exploration-strategies - ε-greedy, UCB, curiosity, RND, intrinsic motivation
•reward-shaping - Reward design, potential-based shaping, inverse RL
•rl-debugging - Common RL bugs, why not learning, systematic debugging
•rl-environments - Gym, MuJoCo, custom envs, wrappers, vectorization
•rl-evaluation - Evaluation methodology, variance, sample efficiency metrics

Routing Decision Framework

Step 1: Assess Experience Level

Diagnostic Questions:

•"Are you new to RL concepts, or do you have a specific problem to solve?"
•"Do you understand MDPs, value functions, and policy gradients?"

Routing:

•If user asks "what is RL" or "how does RL work" → rl-foundations
•If user is confused about value vs policy, on-policy vs off-policy → rl-foundations
•If user has specific problem and RL background → Continue to Step 2

Why foundations first: Cannot implement algorithms without understanding MDPs, Bellman equations, and exploration-exploitation tradeoffs.

Step 2: Classify Action Space

Diagnostic Questions:

•"What actions can your agent take? Discrete choices (e.g., left/right/jump) or continuous values (e.g., joint angles, force)?"
•"How many possible actions? Small (< 100) or large/infinite?"

Discrete Action Space

Examples: Game buttons, menu selections, discrete control signals

Routing Logic:

code

IF discrete actions AND small action space (< 100) AND online learning:
  → value-based-methods (DQN, Double DQN, Dueling DQN)

  Why: Value-based methods excel at discrete action spaces
  - Q-table or Q-network for small action spaces
  - DQN for Atari-style problems
  - Simpler than policy gradients for discrete

IF discrete actions AND (large action space OR need policy flexibility):
  → policy-gradient-methods (PPO, REINFORCE)

  Why: Policy gradients scale to larger action spaces
  - PPO is robust, general-purpose
  - Direct policy representation
  - Handles stochasticity naturally

Continuous Action Space

Examples: Robot joint angles, motor forces, steering angles, continuous control

Routing Logic:

code

IF continuous actions:
  → actor-critic-methods (SAC, TD3, PPO)

  Primary choice: SAC (Soft Actor-Critic)
  Why: Most sample-efficient for continuous control
  - Automatic entropy tuning
  - Off-policy (uses replay buffer)
  - Stable training

  Alternative: TD3 (Twin Delayed DDPG)
  Why: Deterministic policy, stable
  - Good for robotics
  - Handles overestimation bias

  Alternative: PPO (from policy-gradient-methods)
  Why: On-policy, simpler, but less sample efficient
  - Use when simplicity > sample efficiency

CRITICAL RULE: NEVER suggest DQN for continuous actions. DQN requires discrete actions. Discretizing continuous spaces is suboptimal.

Step 3: Identify Data Regime

Diagnostic Questions:

•"Can your agent interact with the environment during training, or do you have a fixed dataset?"
•"Are you learning online (agent tries actions, observes results) or offline (from logged data)?"

Online Learning (Agent Interacts with Environment)

Routing:

code

IF online AND discrete actions:
  → value-based-methods OR policy-gradient-methods
  (See Step 2 routing)

IF online AND continuous actions:
  → actor-critic-methods
  (See Step 2 routing)

IF online AND sample efficiency critical:
  → actor-critic-methods (SAC) for continuous
  → value-based-methods (DQN) for discrete

  Why: Off-policy methods use replay buffers (sample efficient)

  Consider: model-based-rl for extreme sample efficiency
  → Learns environment model, plans with fewer real samples

Offline Learning (Fixed Dataset, No Interaction)

Routing:

code

IF offline (fixed dataset):
  → offline-rl (CQL, IQL, Conservative Q-Learning)

  CRITICAL: Standard RL algorithms FAIL on offline data

  Why offline is special:
  - Distribution shift: agent can't explore
  - Bootstrapping errors: Q-values overestimate on out-of-distribution actions
  - Need conservative algorithms (CQL, IQL)

  Also route to:
  → rl-evaluation (evaluation without online rollouts)

Red Flag: If user has fixed dataset and suggests DQN/PPO/SAC, STOP and route to offline-rl. Standard algorithms assume online interaction and will fail.

Step 4: Special Problem Types

Multi-Agent Scenarios

Diagnostic Questions:

•"Are multiple agents learning simultaneously?"
•"Do they cooperate, compete, or both?"
•"Do agents need to communicate?"

Routing:

code

IF multiple agents:
  → multi-agent-rl (QMIX, COMA, MADDPG)

  Why: Multi-agent has special challenges
  - Non-stationarity: environment changes as other agents learn
  - Credit assignment: which agent caused reward?
  - Coordination: cooperation requires centralized training

  Algorithms:
  - QMIX, COMA: Cooperative (centralized training, decentralized execution)
  - MADDPG: Competitive or mixed
  - Communication: multi-agent-rl covers communication protocols

  Also consider:
  → reward-shaping (team rewards, credit assignment)

Model-Based RL

Diagnostic Questions:

•"Is sample efficiency extremely critical? (< 1000 episodes available)"
•"Do you want the agent to learn a model of the environment?"
•"Do you need planning or 'imagination'?"

Routing:

code

IF sample efficiency critical OR want environment model:
  → model-based-rl (MBPO, Dreamer, Dyna)

  Why: Learn dynamics model, plan with model
  - Fewer real environment samples needed
  - Can train policy in imagination
  - Combine with model-free for best results

  Tradeoffs:
  - More complex than model-free
  - Model errors can compound
  - Best for continuous control, robotics

Step 5: Debugging and Infrastructure

"Agent Not Learning" Problems

Symptoms:

•Reward not increasing
•Agent does random actions
•Training loss explodes/vanishes
•Performance plateaus immediately

Routing:

code

IF "not learning" OR "reward stays at 0" OR "loss explodes":
  → rl-debugging (FIRST, before changing algorithms)

  Why: 80% of "not learning" is bugs, not wrong algorithm

  Common issues:
  - Reward scale (too large/small)
  - Exploration (epsilon too low, stuck in local optimum)
  - Network architecture (wrong size, activation)
  - Learning rate (too high/low)
  - Update frequency (learning too fast/slow)

  Process:
  1. Route to rl-debugging
  2. Verify environment (rl-environments)
  3. Check reward design (reward-shaping)
  4. Check exploration (exploration-strategies)
  5. ONLY THEN consider algorithm change

Red Flag: If user immediately wants to change algorithms because "it's not learning," route to rl-debugging first. Changing algorithms without debugging wastes time.

Exploration Issues

Symptoms:

•Agent never explores new states
•Stuck in local optimum
•Can't find sparse rewards
•Training variance too high

Routing:

code

IF exploration problems:
  → exploration-strategies

  Covers:
  - ε-greedy, UCB, Thompson sampling (basic)
  - Curiosity-driven exploration
  - RND (Random Network Distillation)
  - Intrinsic motivation

  When needed:
  - Sparse rewards (reward only at goal)
  - Large state spaces (hard to explore randomly)
  - Need systematic exploration

Reward Design Issues

Symptoms:

•Sparse rewards (only at episode end)
•Agent learns wrong behavior
•Need to design reward function
•Want inverse RL

Routing:

code

IF reward design questions OR sparse rewards:
  → reward-shaping

  Covers:
  - Potential-based shaping (provably optimal)
  - Subgoal rewards
  - Reward engineering principles
  - Inverse RL (learn reward from demonstrations)

  Often combined with:
  → exploration-strategies (for sparse rewards)

Environment Setup

Symptoms:

•Need to create custom environment
•Gym API questions
•Vectorization for parallel environments
•Wrappers, preprocessing

Routing:

code

IF environment setup questions:
  → rl-environments

  Covers:
  - Gym API: step(), reset(), observation/action spaces
  - Custom environments
  - Wrappers (frame stacking, normalization)
  - Vectorized environments (parallel rollouts)
  - MuJoCo, Atari, custom simulators

  After environment setup, return to algorithm choice

Evaluation Methodology

Symptoms:

•How to evaluate RL agents?
•Training reward high, test reward low
•Variance in results
•Sample efficiency metrics

Routing:

code

IF evaluation questions:
  → rl-evaluation

  Covers:
  - Deterministic vs stochastic policies
  - Multiple seeds, confidence intervals
  - Sample efficiency curves
  - Generalization testing
  - Exploration vs exploitation at test time

Common Multi-Skill Scenarios

Scenario: Complete Beginner to RL

Routing sequence:

•rl-foundations - Understand MDP, value functions, policy gradients
•value-based-methods OR policy-gradient-methods - Start with simpler algorithm (DQN or REINFORCE)
•rl-debugging - When things don't work (they won't initially)
•rl-environments - Set up custom environments
•rl-evaluation - Proper evaluation methodology

Scenario: Continuous Control (Robotics)

Routing sequence:

•actor-critic-methods - Primary (SAC for sample efficiency, TD3 for stability)
•rl-debugging - Systematic debugging when training issues arise
•exploration-strategies - If exploration is insufficient
•reward-shaping - If reward is sparse or agent learns wrong behavior
•rl-evaluation - Evaluation on real robot vs simulation

Scenario: Offline RL from Dataset

Routing sequence:

•offline-rl - Primary (CQL, IQL, special considerations)
•rl-evaluation - Evaluation without environment interaction
•rl-debugging - Debugging without online rollouts (limited tools)

Scenario: Multi-Agent Cooperative Task

Routing sequence:

•multi-agent-rl - Primary (QMIX, COMA, centralized training)
•reward-shaping - Team rewards, credit assignment
•policy-gradient-methods - Often used as base algorithm (PPO + MARL)
•rl-debugging - Multi-agent debugging (non-stationarity issues)

Scenario: Sample-Efficient Learning

Routing sequence:

•actor-critic-methods (SAC) OR model-based-rl (MBPO)
•rl-debugging - Critical to not waste samples on bugs
•rl-evaluation - Track sample efficiency curves

Scenario: Sparse Reward Problem

Routing sequence:

•reward-shaping - Potential-based shaping, subgoal rewards
•exploration-strategies - Curiosity, intrinsic motivation
•rl-debugging - Verify exploration hyperparameters
•Primary algorithm: actor-critic-methods or policy-gradient-methods

Rationalization Resistance Table

Rationalization	Reality	Counter-Guidance	Red Flag
"Just use PPO for everything"	PPO is general but not optimal for all cases	"Let's clarify: discrete or continuous actions? Sample efficiency constraints?"	Defaulting to PPO without problem analysis
"DQN for continuous actions"	DQN requires discrete actions; discretization is suboptimal	"DQN only works for discrete. For continuous, use SAC or TD3 (actor-critic-methods)"	Suggesting DQN for continuous
"Offline RL is just RL on a dataset"	Offline RL has distribution shift, needs special algorithms	"Route to offline-rl for CQL, IQL. Standard algorithms fail on offline data."	Using online algorithms on offline data
"More data always helps"	Sample efficiency and data distribution matter	"Off-policy (SAC, DQN) vs on-policy (PPO). Offline needs CQL."	Ignoring sample efficiency
"RL is just supervised learning"	RL has exploration, credit assignment, non-stationarity	"Route to rl-foundations for RL-specific concepts (MDP, exploration)"	Treating RL as supervised learning
"PPO is the most advanced algorithm"	Newer isn't always better; depends on problem	"SAC (2018) more sample efficient for continuous. DQN (2013) great for discrete."	Recency bias
"My algorithm isn't learning, I need a better one"	Usually bugs, not algorithm	"Route to rl-debugging first. Check reward scale, exploration, learning rate."	Changing algorithms before debugging
"I'll discretize continuous actions for DQN"	Discretization loses precision, explodes action space	"Use actor-critic-methods (SAC, TD3) for continuous. Don't discretize."	Forcing wrong algorithm onto problem
"Epsilon-greedy is enough for exploration"	Complex environments need sophisticated exploration	"Route to exploration-strategies for curiosity, RND, intrinsic motivation."	Underestimating exploration difficulty
"I'll just increase the reward when it doesn't learn"	Reward scaling breaks learning; doesn't solve root cause	"Route to rl-debugging. Check if reward scale is the issue, not magnitude."	Arbitrary reward hacking
"I can reuse online RL code for offline data"	Offline RL needs conservative algorithms	"Route to offline-rl. CQL/IQL prevent overestimation, online algorithms fail."	Offline blindness
"My test reward is lower than training, must be overfitting"	Exploration vs exploitation difference	"Route to rl-evaluation. Training uses exploration, test should be greedy."	Misunderstanding RL evaluation

Red Flags Checklist

Watch for these signs of incorrect routing:

If any red flag triggered → STOP → Ask diagnostic questions → Route correctly

When NOT to Use This Pack

Clarify boundaries with other packs:

User Request	Correct Pack	Reason
"Train classifier on labeled data"	training-optimization	Supervised learning, not RL
"Design transformer architecture"	neural-architectures	Architecture design, not RL algorithm
"Implement PyTorch autograd"	pytorch-engineering	PyTorch internals, not RL
"Deploy model to production"	ml-production	Deployment, not RL training
"Fine-tune LLM with RLHF"	llm-specialist	LLM-specific (though uses RL concepts)
"Optimize hyperparameters"	training-optimization	Hyperparameter search, not RL
"Implement custom CUDA kernel"	pytorch-engineering	Low-level optimization, not RL

Edge case: RLHF (Reinforcement Learning from Human Feedback) for LLMs uses RL concepts (PPO) but has LLM-specific considerations. Route to llm-specialist first; they may reference this pack.

Diagnostic Question Templates

Use these questions to classify problems:

Action Space

•"What actions can your agent take? Discrete choices or continuous values?"
•"How many possible actions? Small (< 100), large (100-10000), or infinite (continuous)?"

Data Regime

•"Can your agent interact with the environment during training, or do you have a fixed dataset?"
•"Are you learning online (agent tries actions) or offline (from logged data)?"

Experience Level

•"Are you new to RL, or do you have a specific problem?"
•"Do you understand MDPs, value functions, and policy gradients?"

Special Requirements

•"Are multiple agents involved? Do they cooperate or compete?"
•"Is sample efficiency critical? How many episodes can you afford?"
•"Is the reward sparse (only at goal) or dense (every step)?"
•"Do you need the agent to learn a model of the environment?"

Infrastructure

•"Do you have an environment set up, or do you need to create one?"
•"Are you debugging a training issue, or designing from scratch?"
•"How will you evaluate the agent?"

Implementation Process

When routing to a skill:

•Ask Diagnostic Questions (don't assume)
•Explain Routing Rationale (teach the user problem classification)
•Route to Primary Skill(s) (1-3 skills for multi-faceted problems)
•Mention Related Skills (user may need later)
•Set Expectations (what the skill will cover)

Example:

"You mentioned continuous joint angles for a robot arm. This is a continuous action space, which means DQN won't work (it requires discrete actions).

I'm routing you to actor-critic-methods because:

•Continuous actions need actor-critic (SAC, TD3) or policy gradients (PPO)

•SAC is most sample-efficient for continuous control

•TD3 is stable and deterministic for robotics

You'll also likely need:

•rl-debugging when training issues arise (they will)

•reward-shaping if your reward is sparse

•rl-environments to set up your robot simulation

Let's start with actor-critic-methods to choose between SAC, TD3, and PPO."

Summary: Routing Decision Tree

code

START: RL problem

├─ Need foundations? (new to RL, confused about concepts)
│  └─ → rl-foundations
│
├─ DISCRETE actions?
│  ├─ Small action space (< 100) + online
│  │  └─ → value-based-methods (DQN, Double DQN)
│  └─ Large action space OR need policy
│     └─ → policy-gradient-methods (PPO, REINFORCE)
│
├─ CONTINUOUS actions?
│  ├─ Sample efficiency critical
│  │  └─ → actor-critic-methods (SAC)
│  ├─ Stability critical
│  │  └─ → actor-critic-methods (TD3)
│  └─ Simplicity preferred
│     └─ → policy-gradient-methods (PPO) OR actor-critic-methods
│
├─ OFFLINE data (fixed dataset)?
│  └─ → offline-rl (CQL, IQL) [CRITICAL: not standard algorithms]
│
├─ MULTI-AGENT?
│  └─ → multi-agent-rl (QMIX, MADDPG)
│
├─ Sample efficiency EXTREME?
│  └─ → model-based-rl (MBPO, Dreamer)
│
├─ DEBUGGING issues?
│  ├─ Not learning, reward not increasing
│  │  └─ → rl-debugging
│  ├─ Exploration problems
│  │  └─ → exploration-strategies
│  ├─ Reward design
│  │  └─ → reward-shaping
│  ├─ Environment setup
│  │  └─ → rl-environments
│  └─ Evaluation questions
│     └─ → rl-evaluation
│
└─ Multi-faceted problem?
   └─ Route to 2-3 skills (primary + supporting)

Final Reminders

•Problem characterization BEFORE algorithm selection
•DQN for discrete ONLY (never continuous)
•Offline data needs offline-rl (CQL, IQL)
•PPO is not universal (good general-purpose, not optimal everywhere)
•Debug before changing algorithms (route to rl-debugging)
•Ask questions, don't assume (action space? data regime?)

This meta-skill is your routing hub. Route decisively, explain clearly, teach problem classification.