Value-Based Methods
When to Use This Skill
Invoke this skill when you encounter:
- •Algorithm Selection: "Should I use DQN or policy gradient for my problem?"
- •DQN Implementation: User implementing DQN and needs guidance on architecture
- •Training Issues: "DQN is diverging", "Q-values too high", "slow to learn"
- •Variant Questions: "What's Double DQN?", "Should I use Dueling?", "Is Rainbow worth it?"
- •Discrete Action RL: User has discrete action space and implementing value method
- •Hyperparameter Tuning: Debugging learning rates, replay buffer size, network architecture
- •Implementation Bugs: Target network missing, frame stacking wrong, reward scaling issues
- •Custom Environments: Designing states, rewards, action spaces for DQN
This skill provides practical implementation guidance for discrete action RL.
Do NOT use this skill for:
- •Continuous action spaces (route to actor-critic-methods)
- •Policy gradients (route to policy-gradient-methods)
- •Model-based RL (route to model-based-rl)
- •Offline RL (route to offline-rl-methods)
- •Theory foundations (route to rl-foundations)
Core Principle
Value-based methods solve discrete action RL by learning Q(s,a) = expected return from taking action a in state s, then acting greedily. They're powerful for discrete spaces but require careful implementation to avoid instability.
Key insight: Value methods assume you can enumerate and compare all action values. This breaks down with continuous actions (infinite actions to compare). Use them for:
- •Games (Atari, Chess)
- •Discrete control (robot navigation, discrete movement)
- •Dialog systems (discrete utterances)
- •Combinatorial optimization
Do not use for:
- •Continuous control (robot arm angles, vehicle acceleration)
- •Stochastic policies required (multi-agent, exploration in deterministic policy)
- •Exploration of large action space (too slow to learn all actions)
Part 1: Q-Learning Foundation
From TD Learning to Q-Learning
You understand TD learning from rl-foundations. Q-learning extends it to action-values.
TD(0) for V(s):
V[s] ← V[s] + α(r + γV[s'] - V[s])
Q-Learning for Q(s,a):
Q[s,a] ← Q[s,a] + α(r + γ max_a' Q[s',a'] - Q[s,a])
Key difference: Q-learning has max over next actions (off-policy).
Off-Policy Learning
Q-learning learns the optimal policy π(a|s) = argmax_a Q(s,a)* regardless of exploration policy.
Example: Cliff Walking
Agent follows epsilon-greedy (explores 10% random) But Q-learning learns: "Take safe path away from cliff" (optimal) NOT: "Walk along cliff edge" (what exploring policy does sometimes) Q-learning separates: - Behavior policy: ε-greedy (for exploration) - Target policy: greedy (what we're learning toward)
Why This Matters: Off-policy learning is sample-efficient (can learn from any exploration strategy). On-policy methods like SARSA would learn the exploration noise into policy.
Convergence Guarantee
Theorem: Q-learning converges to Q*(s,a) if:
- •All state-action pairs visited infinitely often
- •Learning rate α(t) → 0 (e.g., α = 1/N(s,a))
- •Sufficiently small ε (exploration not zero)
Practical: Use ε-decay schedule that ensures eventual convergence.
epsilon = max(epsilon_min, epsilon * decay_rate) # Start: ε=1.0, decay to ε=0.01 # Ensures: all actions eventually tried, then exploitation takes over
Q-Learning Pitfall #1: Small State Spaces Only
Scenario: User implements tabular Q-learning for Atari.
Problem:
Atari image: 210×160 RGB = 20,160 pixels Possible states: 256^20160 (astronomical) Tabular Q-learning: impossible
Solution: Use function approximation (neural networks) → Deep Q-Networks
Red Flag: Tabular Q-learning works only for small state spaces (<10,000 unique states).
Part 2: Deep Q-Networks (DQN)
What DQN Adds to Q-Learning
DQN = Q-learning + neural network + two critical stability mechanisms:
- •Experience Replay: Break temporal correlation
- •Target Network: Prevent moving target problem
Mechanism 1: Experience Replay
Problem without replay:
# Naive approach (WRONG)
state = env.reset()
for t in range(1000):
action = epsilon_greedy(state)
next_state, reward = env.step(action)
# Update Q from this single transition
Q(state, action) += α(reward + γ max Q(next_state) - Q(state, action))
state = next_state
Why this fails:
- •Consecutive transitions are highly correlated (state_t and state_{t+1} very similar)
- •Neural network gradient updates are unstable with correlated data
- •Network overfits to recent trajectory
Experience Replay Solution:
# Collect experiences in buffer
replay_buffer = []
for episode in range(num_episodes):
state = env.reset()
for t in range(max_steps):
action = epsilon_greedy(state)
next_state, reward = env.step(action)
# Store experience (not learn yet)
replay_buffer.append((state, action, reward, next_state, done))
# Sample random batch and learn
if len(replay_buffer) > batch_size:
batch = random.sample(replay_buffer, batch_size)
for (s, a, r, s_next, done) in batch:
if done:
target = r
else:
target = r + gamma * max(Q(s_next))
loss = (Q(s,a) - target)^2
# Update network weights
optimizer.step(loss)
state = next_state
Why this works:
- •Breaks correlation: Random sampling decorrelates gradient updates
- •Sample efficiency: Reuse old experiences (learn more from same env interactions)
- •Stability: Averaged gradients are smoother
Mechanism 2: Target Network
Problem without target network:
# Moving target problem (WRONG)
loss = (Q(s,a) - [r + γ max Q(s_next, a_next)])^2
# ^^^^ ^^^^
# Same network computing both target and prediction
Issue: Network updates move both the prediction AND the target, creating instability.
Analogy: Trying to hit a moving target that moves whenever you aim.
Target Network Solution:
# Separate networks
main_network = create_network() # Learning network
target_network = create_network() # Stable target (frozen)
# Training loop
loss = (main_network(s,a) - [r + γ max target_network(s_next)])^2
^^^^^^^^
Target network doesn't update every step
# Periodically synchronize
if t % update_frequency == 0:
target_network = copy(main_network) # Freeze for N steps
Why this works:
- •Stability: Target doesn't move as much (frozen for many steps)
- •Bellman consistency: Gives network time to learn, then adjusts target
- •Convergence: Bootstrapping no longer destabilized by moving target
DQN Architecture Pattern
import torch
import torch.nn as nn
class DQN(nn.Module):
def __init__(self, input_size, num_actions):
super().__init__()
# For Atari: CNN backbone
self.conv1 = nn.Conv2d(4, 32, kernel_size=8, stride=4) # Frame stack: 4 frames
self.conv2 = nn.Conv2d(32, 64, kernel_size=4, stride=2)
self.conv3 = nn.Conv2d(64, 64, kernel_size=3, stride=1)
# Flatten and FC layers
self.fc1 = nn.Linear(64*7*7, 512) # After convolutions
self.fc_value = nn.Linear(512, 1) # For dueling: value stream
self.fc_actions = nn.Linear(512, num_actions) # For dueling: advantage stream
def forward(self, x):
# x shape: (batch, 4, 84, 84) for Atari
x = torch.relu(self.conv1(x))
x = torch.relu(self.conv2(x))
x = torch.relu(self.conv3(x))
x = x.flatten(start_dim=1)
x = torch.relu(self.fc1(x))
# For basic DQN: just action values
q_values = self.fc_actions(x)
return q_values
Hyperparameter Guidance
| Parameter | Value Range | Effect | Guidance |
|---|---|---|---|
| Replay buffer size | 10k-1M | Memory, sample diversity | Start 100k, increase for slow learning |
| Batch size | 32-256 | Stability vs memory | 32-64 common; larger = more stable |
| Learning rate α | 0.0001-0.001 | Convergence speed | Start 0.0001, increase if too slow |
| Target update freq | 1k-10k steps | Stability | Update every 1000-5000 steps |
| ε initial | 0.5-1.0 | Exploration | Start 1.0 (random) |
| ε final | 0.01-0.05 | Late exploitation | 0.01-0.05 typical |
| ε decay | 10k-1M steps | Exploration → Exploitation | Tune to problem (larger env → longer decay) |
DQN Pitfall #1: Missing Target Network
Symptom: "DQN loss explodes immediately, Q-values diverge to ±infinity"
Root cause: No target network (or target updates too frequently)
# WRONG - target network updates every step loss = (Q(s,a) - [r + γ max Q(s_next)])^2 # Both from same network # CORRECT - target network frozen for steps loss = (Q_main(s,a) - [r + γ max Q_target(s_next)])^2 # Update target: if step % 1000 == 0: Q_target = copy(Q_main)
Fix: Verify target network update frequency (1000-5000 steps typical).
DQN Pitfall #2: Replay Buffer Too Small
Symptom: "Sample efficiency very poor, agent takes millions of steps to learn"
Root cause: Small replay buffer = replay many recent correlated experiences
# WRONG replay_buffer_size = 10_000 # After 10k steps, only seeing recent experience (no diversity) # CORRECT replay_buffer_size = 100_000 or 1_000_000 # See diverse experiences from long history
Rule of Thumb: Replay buffer ≥ 10 × episode length (more is usually better)
Memory vs Sample Efficiency Tradeoff:
- •10k buffer: Low memory, high correlation (bad)
- •100k buffer: Moderate memory, good diversity (usually sufficient)
- •1M buffer: High memory, excellent diversity (overkill unless long episodes)
DQN Pitfall #3: No Frame Stacking
Symptom: "Learning very slow or doesn't converge"
Root cause: Single frame doesn't show velocity (violates Markov property)
# WRONG - single frame
state = current_frame # No velocity information
# Network cannot infer: is ball moving left or right?
# CORRECT - stack frames
state = np.stack([frame_t, frame_{t-1}, frame_{t-2}, frame_{t-3}])
# Velocity: difference between consecutive frames
Implementation:
from collections import deque
class FrameBuffer:
def __init__(self, num_frames=4):
self.buffer = deque(maxlen=num_frames)
def add_frame(self, frame):
self.buffer.append(frame)
def get_state(self):
return np.stack(list(self.buffer)) # (4, 84, 84)
DQN Pitfall #4: Reward Clipping Wrong
Symptom: "Training unstable" or "Learned policy much worse than Q-values suggest"
Context: Atari papers clip rewards to {-1, 0, +1} for stability.
Misunderstanding: Clipping destroys reward information.
# WRONG - unthinking clip reward = np.clip(reward, -1, 1) # All rewards become -1,0,+1 # In custom env with rewards in [-100, 1000], loses critical information # CORRECT - Normalize instead reward = (reward - reward_mean) / reward_std # Preserves differences, stabilizes scale
When to clip: Only if rewards are naturally in {-1, 0, +1} (like Atari).
When to normalize: Custom environments with arbitrary scales.
Part 3: Double DQN
The Overestimation Bias Problem
Max operator bias: In stochastic environments, max over noisy estimates is biased upward.
Example:
True Q*(s,a) values: [10.0, 5.0, 8.0]
Due to noise, estimates: [11.0, 4.0, 9.0]
↑
True Q = 10, estimate = 11
Standard DQN takes max: max(Q_estimates) = 11
But true Q*(s,best_action) = 10
Systematic overestimation! Agent thinks actions better than they are.
Consequence:
- •Inflated Q-values during training
- •Learned policy (greedy) performs worse than Q-values suggest
- •Especially bad early in training when estimates very noisy
Double DQN Solution
Insight: Use one network to select best action, another to evaluate it.
# Standard DQN (overestimates)
target = r + γ max_a Q_target(s_next, a)
# ^^^^
# Both selecting and evaluating with same network
# Double DQN (unbiased)
best_action = argmax_a Q_main(s_next, a) # Select with main network
target = r + γ Q_target(s_next, best_action) # Evaluate with target network
Why it works:
- •Decouples selection and evaluation
- •Removes systematic bias
- •Unbiased estimator of true Q*
Implementation
class DoubleDQN(DQNAgent):
def compute_loss(self, batch):
states, actions, rewards, next_states, dones = batch
# Main network Q-values for current state
q_values = self.main_network(states)
q_values_current = q_values.gather(1, actions)
# Double DQN: select action with main network
next_q_main = self.main_network(next_states)
best_actions = next_q_main.argmax(1, keepdim=True)
# Evaluate with target network
next_q_target = self.target_network(next_states)
max_next_q = next_q_target.gather(1, best_actions).detach()
# TD target (handles done flag)
targets = rewards + (1 - dones) * self.gamma * max_next_q
loss = F.smooth_l1_loss(q_values_current, targets)
return loss
When to Use Double DQN
Use Double DQN if:
- •Training a medium-complexity task (Atari)
- •Suspicious that Q-values are too optimistic
- •Want slightly better sample efficiency
Standard DQN is OK if:
- •Small action space (less overestimation)
- •Training is otherwise stable
- •Sample efficiency not critical
Takeaway: Double DQN is strictly better, minimal cost, use it.
Part 4: Dueling DQN
Dueling Architecture: Separating Value and Advantage
Insight: Q(s,a) = V(s) + A(s,a) where:
- •V(s): How good is this state? (independent of action)
- •A(s,a): How much better is action a than average? (action-specific advantage)
Why separate:
- •Better feature learning: Network learns state features independently from action value
- •Stabilization: Value stream sees many states (more gradient signal)
- •Generalization: Advantage stream learns which actions matter
Example:
Atari Breakout: V(s) = "Ball in good position, paddle ready" (state value) A(s,LEFT) = -2 (moving left here hurts) A(s,RIGHT) = +3 (moving right here helps) A(s,NOOP) = 0 (staying still is neutral) Q(s,LEFT) = V + A = 5 + (-2) = 3 Q(s,RIGHT) = V + A = 5 + 3 = 8 ← Best action Q(s,NOOP) = V + A = 5 + 0 = 5
Architecture
class DuelingDQN(nn.Module):
def __init__(self, input_size, num_actions):
super().__init__()
# Shared feature backbone
self.conv1 = nn.Conv2d(4, 32, kernel_size=8, stride=4)
self.conv2 = nn.Conv2d(32, 64, kernel_size=4, stride=2)
self.conv3 = nn.Conv2d(64, 64, kernel_size=3, stride=1)
self.fc = nn.Linear(64*7*7, 512)
# Value stream (single output)
self.value_fc = nn.Linear(512, 256)
self.value = nn.Linear(256, 1)
# Advantage stream (num_actions outputs)
self.advantage_fc = nn.Linear(512, 256)
self.advantage = nn.Linear(256, num_actions)
def forward(self, x):
# Shared backbone
x = torch.relu(self.conv1(x))
x = torch.relu(self.conv2(x))
x = torch.relu(self.conv3(x))
x = x.flatten(start_dim=1)
x = torch.relu(self.fc(x))
# Value stream
v = torch.relu(self.value_fc(x))
v = self.value(v)
# Advantage stream
a = torch.relu(self.advantage_fc(x))
a = self.advantage(a)
# Combine: Q = V + (A - mean(A))
# Subtract mean(A) for normalization (prevents instability)
q = v + (a - a.mean(dim=1, keepdim=True))
return q
Why Subtract Mean of Advantages?
# Without mean subtraction q = v + a # Problem: V and A not separately identifiable # V could be 100 + A = -90 or V = 50 + A = -40 (same Q) # With mean subtraction q = v + (a - mean(a)) # Mean advantage = 0 on average # Forces: V learns state value, A learns relative advantage # More stable training
When to Use Dueling DQN
Use Dueling if:
- •Training complex environments (Atari)
- •Want better feature learning
- •Training is unstable (helps stabilization)
Standard DQN is OK if:
- •Simple environments
- •Computational budget tight
Takeaway: Dueling is strictly better for neural network learning, minimal cost, use it.
Part 5: Prioritized Experience Replay
Problem with Uniform Sampling
Issue: All transitions equally likely to be sampled.
# Uniform sampling batch = random.sample(replay_buffer, batch_size) # Includes: boring transitions, important transitions, rare transitions # All mixed together with equal weight
Problem:
- •Wasted learning on transitions already understood
- •Rare important transitions sampled rarely
- •Sample inefficiency
Example:
Atari agent learns mostly: "Move paddle left-right in routine positions" Rarely: "What happens when ball is in corner?" (rare, important) Uniform replay: 95% learning about paddle, 5% about corners Should be: More focus on corners (rarer, more surprising)
Prioritized Experience Replay Solution
Insight: Sample transitions proportional to TD error (surprise).
# Compute TD error (surprise) td_error = |r + γ max Q(s_next) - Q(s,a)| # ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ # How wrong was our prediction? # Probability ∝ TD error^α # High error transitions sampled more batch = sample_proportional_to_priority(replay_buffer, priorities)
Implementation
import numpy as np
class PrioritizedReplayBuffer:
def __init__(self, size, alpha=0.6):
self.buffer = []
self.priorities = []
self.size = size
self.alpha = alpha # How much to prioritize (0=uniform, 1=full priority)
self.epsilon = 1e-6 # Small value to avoid zero priority
def add(self, experience):
# New experiences get max priority (important!)
max_priority = np.max(self.priorities) if self.priorities else 1.0
if len(self.buffer) < self.size:
self.buffer.append(experience)
self.priorities.append(max_priority)
else:
# Replace oldest if full
self.buffer[len(self.buffer) % self.size] = experience
self.priorities[len(self.priorities) % self.size] = max_priority
def sample(self, batch_size):
# Compute sampling probabilities
priorities = np.array(self.priorities) ** self.alpha
priorities = priorities / np.sum(priorities)
# Sample indices
indices = np.random.choice(len(self.buffer), batch_size, p=priorities)
batch = [self.buffer[i] for i in indices]
# Importance sampling weights (correct for bias from prioritized sampling)
weights = (1 / (len(self.buffer) * priorities[indices])) ** (1/3) # β=1/3
weights = weights / np.max(weights) # Normalize
return batch, indices, weights
def update_priorities(self, indices, td_errors):
# Update priorities based on new TD errors
for idx, td_error in zip(indices, td_errors):
self.priorities[idx] = (np.abs(td_error) + self.epsilon) ** self.alpha
Importance Sampling Weights
Problem: Prioritized sampling introduces bias (samples important transitions more).
Solution: Reweight gradients by inverse probability.
# Uniform sampling: each transition contributes equally loss = mean((r + γ max Q(s_next) - Q(s,a))^2) # Prioritized sampling: bias toward high TD error # Correct with importance weight (large TD error → small weight) loss = mean(weights * (r + γ max Q(s_next) - Q(s,a))^2) # ^^^^^^^ # Importance sampling correction # weights ∝ 1/priority (inverse)
When to Use Prioritized Replay
Use if:
- •Training large environments (Atari)
- •Sample efficiency critical
- •Have computational budget for priority updates
Use standard uniform if:
- •Small environments
- •Computational budget tight
- •Standard training is working fine
Note: Adds complexity (priority updates), minimal empirical gain in many cases.
Part 6: Rainbow DQN
Combining All Improvements
Rainbow = Double DQN + Dueling DQN + Prioritized Replay + 3 more innovations:
- •Double DQN: Reduce overestimation bias
- •Dueling DQN: Separate value and advantage
- •Prioritized Replay: Sample important transitions
- •Noisy Networks: Exploration through network parameters
- •Distributional RL: Learn Q distribution not just mean
- •Multi-step Returns: n-step TD learning instead of 1-step
When to Use Rainbow
Use Rainbow if:
- •Need state-of-the-art Atari performance
- •Have weeks of compute for tuning
- •Paper requires it
Use Double + Dueling DQN if:
- •Standard DQN training unstable
- •Want good performance with less tuning
- •Typical development
Use Basic DQN if:
- •Learning the method
- •Sample efficiency not critical
- •Simple environments
Lesson: Understand components separately before combining.
Learning progression: 1. Q-learning (understand basics) 2. Basic DQN (add neural networks) 3. Double DQN (fix overestimation) 4. Dueling DQN (improve architecture) 5. Add prioritized replay (sample efficiency) 6. Rainbow (combine all)
Part 7: Common Bugs and Debugging
Bug #1: Training Divergence (Q-values explode)
Diagnosis Tree:
- •
Check target network:
python# WRONG - updating every step loss = (Q_main(s,a) - [r + γ max Q_main(s_next)])^2 # FIX - use separate target network loss = (Q_main(s,a) - [r + γ max Q_target(s_next)])^2
- •
Check learning rate:
python# WRONG - too high optimizer = torch.optim.Adam(network.parameters(), lr=0.1) # FIX - reduce learning rate optimizer = torch.optim.Adam(network.parameters(), lr=0.0001)
- •
Check reward scale:
python# WRONG - rewards too large reward = 1000 * indicator # Values explode # FIX - normalize reward = 10 * indicator # Or: reward = (reward - reward_mean) / reward_std
- •
Check replay buffer:
python# WRONG - too small replay_buffer_size = 1000 # FIX - increase size replay_buffer_size = 100_000
Bug #2: Poor Sample Efficiency (Slow Learning)
Diagnosis Tree:
- •
Check replay buffer size:
python# Too small → high correlation if len(replay_buffer) < 100_000: print("WARNING: Replay buffer too small for Atari") - •
Check target network update frequency:
python# Too frequent → moving target # Too infrequent → slow target adjustment # Good: every 1000-5000 steps if update_frequency > 10_000: print("Target updates too infrequent") - •
Check batch size:
python# Too small → noisy gradients # Too large → slow training # Good: 32-64 if batch_size < 16 or batch_size > 256: print("Consider adjusting batch size") - •
Check epsilon decay:
python# Decaying too fast → premature exploitation # Decaying too slow → wastes steps exploring # Typical: decay over 10% of total steps if decay_steps < total_steps * 0.05: print("Epsilon decays too quickly")
Bug #3: Q-Values Too Optimistic (Learned Policy << Training Q)
Diagnosis:
Red Flag: Policy performance much worse than max Q-value during training.
# Symptom max_q_value = 100.0 actual_episode_return = 5.0 # 20x gap suggests overestimation # Solutions (try in order) 1. Use Double DQN (reduces overestimation) 2. Reduce learning rate (slower updates → less optimistic) 3. Increase target network update frequency (more stable target) 4. Check reward function (might be wrong)
Bug #4: Frame Stacking Wrong
Symptoms:
- •Very slow learning despite "correct" implementation
- •Network can't learn velocity-dependent behaviors
Diagnosis:
# WRONG - single frame
state_shape = (84, 84, 3)
# Network sees only position, not velocity
# CORRECT - stack 4 frames
state_shape = (84, 84, 4)
# Last 4 frames show motion
# Check frame stacking implementation
frame_stack = deque(maxlen=4)
for frame in frames:
frame_stack.append(frame)
state = np.stack(list(frame_stack)) # (4, 84, 84)
Bug #5: Network Architecture Mismatch
Symptoms:
- •CNN on non-image input (or vice versa)
- •Output layer wrong number of actions
- •Input preprocessing wrong
Diagnosis:
# Image input → use CNN
if input_type == 'image':
network = CNN(num_actions)
# Vector input → use FC
elif input_type == 'vector':
network = FullyConnected(input_size, num_actions)
# Output layer MUST have num_actions outputs
assert network.output_size == num_actions
Part 8: Hyperparameter Tuning
Learning Rate
Too high (α > 0.001):
- •Divergence, unstable training
- •Q-values explode
Too low (α < 0.00001):
- •Very slow learning
- •May not converge in reasonable time
Start: α = 0.0001, adjust if needed
# Adaptive strategy
if max_q_value > 1000:
print("Reduce learning rate")
alpha = alpha / 2
if learning_curve_flat:
print("Increase learning rate")
alpha = alpha * 1.1
Replay Buffer Size
Too small (< 10k for Atari):
- •High correlation in gradients
- •Slow learning, poor sample efficiency
Too large (> 10M):
- •Excessive memory
- •Stale experiences dominate
- •Diminishing returns
Rule of thumb: 10 × episode length
episode_length = 1000 # typical
ideal_buffer = 100_000 # 10 × typical Atari episode
# Can increase if GPU memory available and learning slow
if learning_slow:
buffer_size = 500_000 # More diversity
Epsilon Decay
Too fast (decay in 10k steps):
- •Agent exploits before learning
- •Suboptimal policy
Too slow (decay in 1M steps):
- •Wasted exploration time
- •Slow performance improvement
Rule: Decay over ~10% of total training steps
total_steps = 1_000_000 epsilon_decay_steps = total_steps * 0.1 # 100k steps epsilon = max(epsilon_min, epsilon * (epsilon_decay_steps / current_step))
Target Network Update Frequency
Too frequent (every 100 steps):
- •Target still moves rapidly
- •Less stabilization benefit
Too infrequent (every 100k steps):
- •Network drifts far from target
- •Large jumps in learning
Sweet spot: Every 1k-5k steps (1000 typical)
update_frequency = 1000 # steps between target updates
if update_frequency < 500:
print("Target updates might be too frequent")
if update_frequency > 10_000:
print("Target updates might be too infrequent")
Reward Scaling
No scaling (raw rewards vary wildly):
- •Learning rate effects vary by task
- •Convergence issues
Clipping (clip to {-1, 0, +1}):
- •Good for Atari, loses information in custom envs
Normalization (zero-mean, unit variance):
- •General solution
- •Preserves reward differences
# Track running statistics
running_mean = 0.0
running_var = 1.0
def normalize_reward(reward):
global running_mean, running_var
running_mean = 0.99 * running_mean + 0.01 * reward
running_var = 0.99 * running_var + 0.01 * (reward - running_mean)**2
return (reward - running_mean) / np.sqrt(running_var + 1e-8)
Part 9: When to Use Each Method
DQN Selection Matrix
| Situation | Method | Why |
|---|---|---|
| Learning method | Basic DQN | Understand target network, replay buffer |
| Medium task | Double DQN | Fix overestimation, minimal overhead |
| Complex task | Double + Dueling | Better architecture + bias reduction |
| Sample critical | Add Prioritized | Focus on important transitions |
| State-of-art | Rainbow | Best Atari performance |
| Simple Atari | DQN | Sufficient, faster to debug |
| Non-Atari discrete | DQN/Double | Adapt architecture to input type |
Action Space Check
Before implementing DQN, ask:
if action_space == 'continuous':
print("ERROR: Use actor-critic or policy gradient")
print("Value methods only for discrete actions")
redirect_to_actor_critic_methods()
elif action_space == 'discrete' and len(actions) <= 100:
print("✓ DQN appropriate")
elif action_space == 'discrete' and len(actions) > 1000:
print("⚠ Large action space, consider policy gradient")
print("Or: hierarchical RL, action abstraction")
Part 10: Red Flags Checklist
When you see these, suspect bugs:
- • Single frame input: No velocity info, add frame stacking
- • No target network: Divergence expected, add it
- • Small replay buffer (< 10k): Poor efficiency, increase
- • High learning rate (> 0.001): Instability likely, decrease
- • No frame preprocessing: Raw image pixels, normalize to [0,1]
- • Updating target every step: Moving target problem, freeze it
- • No exploration decay: Explores forever, add epsilon decay
- • Continuous actions: Wrong method, use actor-critic
- • Very large rewards (> 100): Scaling issues, normalize
- • Only one environment: Bias high, use frame skipping or multiple envs
- • Immediate best performance: Overfitting to initial conditions, likely divergence later
- • Q-values >> rewards: Overestimation, try Double DQN
- • All Q-values zero: Network not learning, check learning rate
- • Training loss increasing: Learning rate too high, divergence
Part 11: Pitfall Rationalization
| Rationalization | Reality | Counter-Guidance | Red Flag |
|---|---|---|---|
| "I'll skip target network, save memory" | Causes instability/divergence | Target network critical, minimal memory cost | "Target network optional" |
| "DQN works for continuous actions" | Breaks fundamental assumption (enumerate all actions) | Value methods discrete-only, use SAC/TD3 for continuous | Continuous action DQN attempt |
| "Uniform replay is fine" | Wastes learning on boring transitions | Prioritized replay better, but uniform adequate for many tasks | Always recommending prioritized |
| "I'll use tiny replay buffer, it's faster" | High correlation, poor learning | 100k+ buffer typical, speed tradeoff acceptable | Buffer < 10k for Atari |
| "Frame stacking unnecessary, CNN sees motion" | Single frame Markov-violating | Frame stacking required for velocity from pixels | Single frame policy |
| "Rainbow is just DQN + tricks" | Missing that components solve specific problems | Each component fixes identified issue (overestimation, architecture, sampling) | Jumping to Rainbow without understanding |
| "Clip rewards, I saw it in a paper" | Clips away important reward information | Only clip for {-1,0,+1} Atari-style, normalize otherwise | Blind reward clipping |
| "Larger network will learn faster" | Overfitting, slower gradients, memory issues | Standard architecture (32-64-64 CNN) works, don't over-engineer | Unreasonably large networks |
| "Policy gradient would be simpler here" | Value methods discrete-only right choice | Know when each applies (discrete → value, continuous → policy) | Wrong method choice for action space |
| "Epsilon decay is a hyperparameter like any other" | decay schedule should match task complexity | Tune decay to problem (game length), not arbitrary | Epsilon decay without reasoning |
Part 12: Pressure Test Scenarios
Scenario 1: Continuous Action Space
User: "I have a robot with continuous action space (joint angles in ℝ^7). Can I use DQN?"
Wrong Response: "Sure, discretize the actions" (Combinatorial explosion, inefficient)
Correct Response: "No, value methods are discrete-only. Use actor-critic (SAC) or policy gradient (PPO). They handle continuous actions naturally. Discretization would create 7-dimensional action space explosion (e.g., 10 values per joint = 10^7 actions)."
Scenario 2: Training Unstable
User: "My DQN is diverging immediately, loss explodes. Implementation looks right. What's wrong?"
Systematic Debug:
1. Check target network - Print: "Is target_network separate from main_network?" - Likely cause: updating together 2. Check learning rate - Print: "Learning rate = ?" - If > 0.001, reduce 3. Check reward scale - Print: "max(rewards) = ?" - If > 100, normalize 4. Check initial Q-values - Print: "mean(Q-values) = ?" - Should start near zero
Answer: Target network most likely culprit. Verify separate networks with proper update frequency.
Scenario 3: Rainbow vs Double DQN
User: "Should I implement Rainbow or just Double DQN? Is Rainbow worth the complexity?"
Guidance:
Double DQN: + Fixes overestimation bias + Simple to implement + 90% of Rainbow benefits in many cases - Missing other optimizations Rainbow: + Best Atari performance + State-of-the-art - Complex (6 components) - Harder to debug - More hyperparameters Recommendation: Start: Double DQN If unstable: Add Dueling If slow: Add Prioritized Only go to Rainbow: If need SotA and have time
Scenario 4: Frame Stacking Issue
User: "My agent trains on Atari but learning is slow. How many frames should I stack?"
Diagnosis:
# Check if frame stacking implemented
if state.shape != (4, 84, 84):
print("ERROR: Not using frame stacking")
print("Single frame (1, 84, 84) violates Markov property")
print("Add frame stacking: stack last 4 frames")
# Frame count
4 frames: Standard (shows ~80ms at 50fps = ~4 frames)
3 frames: OK, slightly less velocity info
2 frames: Minimum, just barely Markovian
1 frame: WRONG, not Markovian
8+ frames: Too many, outdated states in stack
Scenario 5: Hyperparameter Tuning
User: "I've tuned learning rate, buffer size, epsilon. What else affects performance?"
Guidance:
Priority 1 (Critical): - Target network update frequency (1000-5000 steps) - Replay buffer size (100k+ typical) - Frame stacking (4 frames) Priority 2 (Important): - Learning rate (0.0001-0.0005) - Epsilon decay schedule (over ~10% of steps) - Batch size (32-64) Priority 3 (Nice to have): - Network architecture (32-64-64 CNN standard) - Reward normalization (helps but not required) - Double/Dueling DQN (improvements, not essentials) Start with Priority 1, only adjust Priority 2-3 if unstable.
Part 13: When to Route Elsewhere
Route to rl-foundations if
- •User confused about Bellman equations
- •Unclear on value function definition
- •Needs theory behind Q-learning convergence
Route to actor-critic-methods if
- •Continuous action space
- •Need deterministic policy gradients
- •Stochastic policy required
Route to policy-gradient-methods if
- •Large discrete action space (> 1000 actions)
- •Need policy regularization
- •Exploration by stochasticity useful
Route to offline-rl-methods if
- •No environment access (batch learning)
- •Learning from logged data only
Route to rl-debugging if
- •General training issues
- •Need systematic debugging methodology
- •Credit assignment problems
Route to reward-shaping if
- •Sparse rewards
- •Reward design affecting learning
- •Potential-based shaping questions
Summary
You now understand:
- •Q-Learning: TD learning for action values, off-policy convergence guarantee
- •DQN: Add neural networks + experience replay + target network for stability
- •Stability Mechanisms:
- •Replay buffer: Break correlation
- •Target network: Prevent moving target problem
- •Common Variants:
- •Double DQN: Fix overestimation bias
- •Dueling DQN: Separate value and advantage
- •Prioritized Replay: Focus on important transitions
- •Rainbow: Combine improvements
- •When to Use: Discrete action spaces only, not continuous
- •Common Bugs: Divergence, poor efficiency, overoptimism, frame issues
- •Hyperparameter Tuning: Buffer size, learning rate, epsilon decay, target frequency
- •Debugging Strategy: Systematic diagnosis (target network → learning rate → reward scale)
Key Takeaways:
- •Value methods are for discrete actions ONLY
- •DQN requires target network and experience replay
- •Frame stacking needed for video inputs (Markov property)
- •Double DQN fixes overestimation, use it
- •Start simple, add Dueling/Prioritized only if needed
- •Systematic debugging beats random tuning
Next: Implement on simple environment first (CartPole or small custom task), then scale to Atari.