Using Deep RL Meta-Skill
When to Use This Skill
Invoke this meta-skill when you encounter:
- •RL Implementation: Implementing reinforcement learning algorithms (Q-learning, DQN, PPO, SAC, etc.)
- •Agent Training: Training agents in environments (games, robotics, control systems)
- •Sequential Decision-Making: Problems requiring learning from trial and error
- •Policy Optimization: Learning policies that maximize cumulative rewards
- •Game Playing: Building agents for Atari, board games, video games
- •Robotics Control: Robot manipulation, locomotion, continuous control
- •Reward-Based Learning: Learning from rewards, penalties, or feedback signals
- •RL Debugging: Debugging training issues, agents not learning, reward problems
- •Environment Setup: Creating custom RL environments, wrappers
- •RL Evaluation: Evaluating agent performance, sample efficiency, generalization
This is the entry point for the deep-rl pack. It routes to 12 specialized skills based on problem characteristics.
Core Principle
Problem type determines algorithm family.
Reinforcement learning is not one algorithm. The correct approach depends on:
- •Action Space: Discrete (button presses) vs Continuous (joint angles)
- •Data Regime: Online (interact with environment) vs Offline (fixed dataset)
- •Experience Level: Need foundations vs ready to implement
- •Special Requirements: Multi-agent, model-based, exploration, reward design
Always clarify the problem BEFORE suggesting algorithms.
The 12 Deep RL Skills
- •rl-foundations - MDP formulation, Bellman equations, value vs policy basics
- •value-based-methods - Q-learning, DQN, Double DQN, Dueling DQN, Rainbow
- •policy-gradient-methods - REINFORCE, PPO, TRPO, policy optimization
- •actor-critic-methods - A2C, A3C, SAC, TD3, advantage functions
- •model-based-rl - World models, Dyna, MBPO, planning with learned models
- •offline-rl - Batch RL, CQL, IQL, learning from fixed datasets
- •multi-agent-rl - MARL, cooperative/competitive, communication
- •exploration-strategies - ε-greedy, UCB, curiosity, RND, intrinsic motivation
- •reward-shaping - Reward design, potential-based shaping, inverse RL
- •rl-debugging - Common RL bugs, why not learning, systematic debugging
- •rl-environments - Gym, MuJoCo, custom envs, wrappers, vectorization
- •rl-evaluation - Evaluation methodology, variance, sample efficiency metrics
Routing Decision Framework
Step 1: Assess Experience Level
Diagnostic Questions:
- •"Are you new to RL concepts, or do you have a specific problem to solve?"
- •"Do you understand MDPs, value functions, and policy gradients?"
Routing:
- •If user asks "what is RL" or "how does RL work" → rl-foundations
- •If user is confused about value vs policy, on-policy vs off-policy → rl-foundations
- •If user has specific problem and RL background → Continue to Step 2
Why foundations first: Cannot implement algorithms without understanding MDPs, Bellman equations, and exploration-exploitation tradeoffs.
Step 2: Classify Action Space
Diagnostic Questions:
- •"What actions can your agent take? Discrete choices (e.g., left/right/jump) or continuous values (e.g., joint angles, force)?"
- •"How many possible actions? Small (< 100) or large/infinite?"
Discrete Action Space
Examples: Game buttons, menu selections, discrete control signals
Routing Logic:
IF discrete actions AND small action space (< 100) AND online learning: → value-based-methods (DQN, Double DQN, Dueling DQN) Why: Value-based methods excel at discrete action spaces - Q-table or Q-network for small action spaces - DQN for Atari-style problems - Simpler than policy gradients for discrete IF discrete actions AND (large action space OR need policy flexibility): → policy-gradient-methods (PPO, REINFORCE) Why: Policy gradients scale to larger action spaces - PPO is robust, general-purpose - Direct policy representation - Handles stochasticity naturally
Continuous Action Space
Examples: Robot joint angles, motor forces, steering angles, continuous control
Routing Logic:
IF continuous actions: → actor-critic-methods (SAC, TD3, PPO) Primary choice: SAC (Soft Actor-Critic) Why: Most sample-efficient for continuous control - Automatic entropy tuning - Off-policy (uses replay buffer) - Stable training Alternative: TD3 (Twin Delayed DDPG) Why: Deterministic policy, stable - Good for robotics - Handles overestimation bias Alternative: PPO (from policy-gradient-methods) Why: On-policy, simpler, but less sample efficient - Use when simplicity > sample efficiency
CRITICAL RULE: NEVER suggest DQN for continuous actions. DQN requires discrete actions. Discretizing continuous spaces is suboptimal.
Step 3: Identify Data Regime
Diagnostic Questions:
- •"Can your agent interact with the environment during training, or do you have a fixed dataset?"
- •"Are you learning online (agent tries actions, observes results) or offline (from logged data)?"
Online Learning (Agent Interacts with Environment)
Routing:
IF online AND discrete actions: → value-based-methods OR policy-gradient-methods (See Step 2 routing) IF online AND continuous actions: → actor-critic-methods (See Step 2 routing) IF online AND sample efficiency critical: → actor-critic-methods (SAC) for continuous → value-based-methods (DQN) for discrete Why: Off-policy methods use replay buffers (sample efficient) Consider: model-based-rl for extreme sample efficiency → Learns environment model, plans with fewer real samples
Offline Learning (Fixed Dataset, No Interaction)
Routing:
IF offline (fixed dataset): → offline-rl (CQL, IQL, Conservative Q-Learning) CRITICAL: Standard RL algorithms FAIL on offline data Why offline is special: - Distribution shift: agent can't explore - Bootstrapping errors: Q-values overestimate on out-of-distribution actions - Need conservative algorithms (CQL, IQL) Also route to: → rl-evaluation (evaluation without online rollouts)
Red Flag: If user has fixed dataset and suggests DQN/PPO/SAC, STOP and route to offline-rl. Standard algorithms assume online interaction and will fail.
Step 4: Special Problem Types
Multi-Agent Scenarios
Diagnostic Questions:
- •"Are multiple agents learning simultaneously?"
- •"Do they cooperate, compete, or both?"
- •"Do agents need to communicate?"
Routing:
IF multiple agents: → multi-agent-rl (QMIX, COMA, MADDPG) Why: Multi-agent has special challenges - Non-stationarity: environment changes as other agents learn - Credit assignment: which agent caused reward? - Coordination: cooperation requires centralized training Algorithms: - QMIX, COMA: Cooperative (centralized training, decentralized execution) - MADDPG: Competitive or mixed - Communication: multi-agent-rl covers communication protocols Also consider: → reward-shaping (team rewards, credit assignment)
Model-Based RL
Diagnostic Questions:
- •"Is sample efficiency extremely critical? (< 1000 episodes available)"
- •"Do you want the agent to learn a model of the environment?"
- •"Do you need planning or 'imagination'?"
Routing:
IF sample efficiency critical OR want environment model: → model-based-rl (MBPO, Dreamer, Dyna) Why: Learn dynamics model, plan with model - Fewer real environment samples needed - Can train policy in imagination - Combine with model-free for best results Tradeoffs: - More complex than model-free - Model errors can compound - Best for continuous control, robotics
Step 5: Debugging and Infrastructure
"Agent Not Learning" Problems
Symptoms:
- •Reward not increasing
- •Agent does random actions
- •Training loss explodes/vanishes
- •Performance plateaus immediately
Routing:
IF "not learning" OR "reward stays at 0" OR "loss explodes": → rl-debugging (FIRST, before changing algorithms) Why: 80% of "not learning" is bugs, not wrong algorithm Common issues: - Reward scale (too large/small) - Exploration (epsilon too low, stuck in local optimum) - Network architecture (wrong size, activation) - Learning rate (too high/low) - Update frequency (learning too fast/slow) Process: 1. Route to rl-debugging 2. Verify environment (rl-environments) 3. Check reward design (reward-shaping) 4. Check exploration (exploration-strategies) 5. ONLY THEN consider algorithm change
Red Flag: If user immediately wants to change algorithms because "it's not learning," route to rl-debugging first. Changing algorithms without debugging wastes time.
Exploration Issues
Symptoms:
- •Agent never explores new states
- •Stuck in local optimum
- •Can't find sparse rewards
- •Training variance too high
Routing:
IF exploration problems: → exploration-strategies Covers: - ε-greedy, UCB, Thompson sampling (basic) - Curiosity-driven exploration - RND (Random Network Distillation) - Intrinsic motivation When needed: - Sparse rewards (reward only at goal) - Large state spaces (hard to explore randomly) - Need systematic exploration
Reward Design Issues
Symptoms:
- •Sparse rewards (only at episode end)
- •Agent learns wrong behavior
- •Need to design reward function
- •Want inverse RL
Routing:
IF reward design questions OR sparse rewards: → reward-shaping Covers: - Potential-based shaping (provably optimal) - Subgoal rewards - Reward engineering principles - Inverse RL (learn reward from demonstrations) Often combined with: → exploration-strategies (for sparse rewards)
Environment Setup
Symptoms:
- •Need to create custom environment
- •Gym API questions
- •Vectorization for parallel environments
- •Wrappers, preprocessing
Routing:
IF environment setup questions: → rl-environments Covers: - Gym API: step(), reset(), observation/action spaces - Custom environments - Wrappers (frame stacking, normalization) - Vectorized environments (parallel rollouts) - MuJoCo, Atari, custom simulators After environment setup, return to algorithm choice
Evaluation Methodology
Symptoms:
- •How to evaluate RL agents?
- •Training reward high, test reward low
- •Variance in results
- •Sample efficiency metrics
Routing:
IF evaluation questions: → rl-evaluation Covers: - Deterministic vs stochastic policies - Multiple seeds, confidence intervals - Sample efficiency curves - Generalization testing - Exploration vs exploitation at test time
Common Multi-Skill Scenarios
Scenario: Complete Beginner to RL
Routing sequence:
- •rl-foundations - Understand MDP, value functions, policy gradients
- •value-based-methods OR policy-gradient-methods - Start with simpler algorithm (DQN or REINFORCE)
- •rl-debugging - When things don't work (they won't initially)
- •rl-environments - Set up custom environments
- •rl-evaluation - Proper evaluation methodology
Scenario: Continuous Control (Robotics)
Routing sequence:
- •actor-critic-methods - Primary (SAC for sample efficiency, TD3 for stability)
- •rl-debugging - Systematic debugging when training issues arise
- •exploration-strategies - If exploration is insufficient
- •reward-shaping - If reward is sparse or agent learns wrong behavior
- •rl-evaluation - Evaluation on real robot vs simulation
Scenario: Offline RL from Dataset
Routing sequence:
- •offline-rl - Primary (CQL, IQL, special considerations)
- •rl-evaluation - Evaluation without environment interaction
- •rl-debugging - Debugging without online rollouts (limited tools)
Scenario: Multi-Agent Cooperative Task
Routing sequence:
- •multi-agent-rl - Primary (QMIX, COMA, centralized training)
- •reward-shaping - Team rewards, credit assignment
- •policy-gradient-methods - Often used as base algorithm (PPO + MARL)
- •rl-debugging - Multi-agent debugging (non-stationarity issues)
Scenario: Sample-Efficient Learning
Routing sequence:
- •actor-critic-methods (SAC) OR model-based-rl (MBPO)
- •rl-debugging - Critical to not waste samples on bugs
- •rl-evaluation - Track sample efficiency curves
Scenario: Sparse Reward Problem
Routing sequence:
- •reward-shaping - Potential-based shaping, subgoal rewards
- •exploration-strategies - Curiosity, intrinsic motivation
- •rl-debugging - Verify exploration hyperparameters
- •Primary algorithm: actor-critic-methods or policy-gradient-methods
Rationalization Resistance Table
| Rationalization | Reality | Counter-Guidance | Red Flag |
|---|---|---|---|
| "Just use PPO for everything" | PPO is general but not optimal for all cases | "Let's clarify: discrete or continuous actions? Sample efficiency constraints?" | Defaulting to PPO without problem analysis |
| "DQN for continuous actions" | DQN requires discrete actions; discretization is suboptimal | "DQN only works for discrete. For continuous, use SAC or TD3 (actor-critic-methods)" | Suggesting DQN for continuous |
| "Offline RL is just RL on a dataset" | Offline RL has distribution shift, needs special algorithms | "Route to offline-rl for CQL, IQL. Standard algorithms fail on offline data." | Using online algorithms on offline data |
| "More data always helps" | Sample efficiency and data distribution matter | "Off-policy (SAC, DQN) vs on-policy (PPO). Offline needs CQL." | Ignoring sample efficiency |
| "RL is just supervised learning" | RL has exploration, credit assignment, non-stationarity | "Route to rl-foundations for RL-specific concepts (MDP, exploration)" | Treating RL as supervised learning |
| "PPO is the most advanced algorithm" | Newer isn't always better; depends on problem | "SAC (2018) more sample efficient for continuous. DQN (2013) great for discrete." | Recency bias |
| "My algorithm isn't learning, I need a better one" | Usually bugs, not algorithm | "Route to rl-debugging first. Check reward scale, exploration, learning rate." | Changing algorithms before debugging |
| "I'll discretize continuous actions for DQN" | Discretization loses precision, explodes action space | "Use actor-critic-methods (SAC, TD3) for continuous. Don't discretize." | Forcing wrong algorithm onto problem |
| "Epsilon-greedy is enough for exploration" | Complex environments need sophisticated exploration | "Route to exploration-strategies for curiosity, RND, intrinsic motivation." | Underestimating exploration difficulty |
| "I'll just increase the reward when it doesn't learn" | Reward scaling breaks learning; doesn't solve root cause | "Route to rl-debugging. Check if reward scale is the issue, not magnitude." | Arbitrary reward hacking |
| "I can reuse online RL code for offline data" | Offline RL needs conservative algorithms | "Route to offline-rl. CQL/IQL prevent overestimation, online algorithms fail." | Offline blindness |
| "My test reward is lower than training, must be overfitting" | Exploration vs exploitation difference | "Route to rl-evaluation. Training uses exploration, test should be greedy." | Misunderstanding RL evaluation |
Red Flags Checklist
Watch for these signs of incorrect routing:
- • Algorithm-First Thinking: Recommending algorithm before asking about action space, data regime
- • DQN for Continuous: Suggesting DQN/Q-learning for continuous action spaces
- • Offline Blindness: Not recognizing fixed dataset requires offline-rl (CQL, IQL)
- • PPO Cargo-Culting: Defaulting to PPO without considering alternatives
- • No Problem Characterization: Not asking: discrete vs continuous? online vs offline?
- • Skipping Foundations: Implementing algorithms when user doesn't understand RL basics
- • Debug-Last: Suggesting algorithm changes before systematic debugging
- • Sample Efficiency Ignorance: Not asking about sample constraints (simulator cost, real robot limits)
- • Exploration Assumptions: Assuming epsilon-greedy is sufficient for all problems
- • Infrastructure Confusion: Trying to explain Gym API instead of routing to rl-environments
- • Evaluation Naivety: Not routing to rl-evaluation for proper methodology
If any red flag triggered → STOP → Ask diagnostic questions → Route correctly
When NOT to Use This Pack
Clarify boundaries with other packs:
| User Request | Correct Pack | Reason |
|---|---|---|
| "Train classifier on labeled data" | training-optimization | Supervised learning, not RL |
| "Design transformer architecture" | neural-architectures | Architecture design, not RL algorithm |
| "Implement PyTorch autograd" | pytorch-engineering | PyTorch internals, not RL |
| "Deploy model to production" | ml-production | Deployment, not RL training |
| "Fine-tune LLM with RLHF" | llm-specialist | LLM-specific (though uses RL concepts) |
| "Optimize hyperparameters" | training-optimization | Hyperparameter search, not RL |
| "Implement custom CUDA kernel" | pytorch-engineering | Low-level optimization, not RL |
Edge case: RLHF (Reinforcement Learning from Human Feedback) for LLMs uses RL concepts (PPO) but has LLM-specific considerations. Route to llm-specialist first; they may reference this pack.
Diagnostic Question Templates
Use these questions to classify problems:
Action Space
- •"What actions can your agent take? Discrete choices or continuous values?"
- •"How many possible actions? Small (< 100), large (100-10000), or infinite (continuous)?"
Data Regime
- •"Can your agent interact with the environment during training, or do you have a fixed dataset?"
- •"Are you learning online (agent tries actions) or offline (from logged data)?"
Experience Level
- •"Are you new to RL, or do you have a specific problem?"
- •"Do you understand MDPs, value functions, and policy gradients?"
Special Requirements
- •"Are multiple agents involved? Do they cooperate or compete?"
- •"Is sample efficiency critical? How many episodes can you afford?"
- •"Is the reward sparse (only at goal) or dense (every step)?"
- •"Do you need the agent to learn a model of the environment?"
Infrastructure
- •"Do you have an environment set up, or do you need to create one?"
- •"Are you debugging a training issue, or designing from scratch?"
- •"How will you evaluate the agent?"
Implementation Process
When routing to a skill:
- •Ask Diagnostic Questions (don't assume)
- •Explain Routing Rationale (teach the user problem classification)
- •Route to Primary Skill(s) (1-3 skills for multi-faceted problems)
- •Mention Related Skills (user may need later)
- •Set Expectations (what the skill will cover)
Example:
"You mentioned continuous joint angles for a robot arm. This is a continuous action space, which means DQN won't work (it requires discrete actions).
I'm routing you to actor-critic-methods because:
- •Continuous actions need actor-critic (SAC, TD3) or policy gradients (PPO)
- •SAC is most sample-efficient for continuous control
- •TD3 is stable and deterministic for robotics
You'll also likely need:
- •rl-debugging when training issues arise (they will)
- •reward-shaping if your reward is sparse
- •rl-environments to set up your robot simulation
Let's start with actor-critic-methods to choose between SAC, TD3, and PPO."
Summary: Routing Decision Tree
START: RL problem ├─ Need foundations? (new to RL, confused about concepts) │ └─ → rl-foundations │ ├─ DISCRETE actions? │ ├─ Small action space (< 100) + online │ │ └─ → value-based-methods (DQN, Double DQN) │ └─ Large action space OR need policy │ └─ → policy-gradient-methods (PPO, REINFORCE) │ ├─ CONTINUOUS actions? │ ├─ Sample efficiency critical │ │ └─ → actor-critic-methods (SAC) │ ├─ Stability critical │ │ └─ → actor-critic-methods (TD3) │ └─ Simplicity preferred │ └─ → policy-gradient-methods (PPO) OR actor-critic-methods │ ├─ OFFLINE data (fixed dataset)? │ └─ → offline-rl (CQL, IQL) [CRITICAL: not standard algorithms] │ ├─ MULTI-AGENT? │ └─ → multi-agent-rl (QMIX, MADDPG) │ ├─ Sample efficiency EXTREME? │ └─ → model-based-rl (MBPO, Dreamer) │ ├─ DEBUGGING issues? │ ├─ Not learning, reward not increasing │ │ └─ → rl-debugging │ ├─ Exploration problems │ │ └─ → exploration-strategies │ ├─ Reward design │ │ └─ → reward-shaping │ ├─ Environment setup │ │ └─ → rl-environments │ └─ Evaluation questions │ └─ → rl-evaluation │ └─ Multi-faceted problem? └─ Route to 2-3 skills (primary + supporting)
Final Reminders
- •Problem characterization BEFORE algorithm selection
- •DQN for discrete ONLY (never continuous)
- •Offline data needs offline-rl (CQL, IQL)
- •PPO is not universal (good general-purpose, not optimal everywhere)
- •Debug before changing algorithms (route to rl-debugging)
- •Ask questions, don't assume (action space? data regime?)
This meta-skill is your routing hub. Route decisively, explain clearly, teach problem classification.