Value-Based Methods

When to Use This Skill

Invoke this skill when you encounter:

•Algorithm Selection: "Should I use DQN or policy gradient for my problem?"
•DQN Implementation: User implementing DQN and needs guidance on architecture
•Training Issues: "DQN is diverging", "Q-values too high", "slow to learn"
•Variant Questions: "What's Double DQN?", "Should I use Dueling?", "Is Rainbow worth it?"
•Discrete Action RL: User has discrete action space and implementing value method
•Hyperparameter Tuning: Debugging learning rates, replay buffer size, network architecture
•Implementation Bugs: Target network missing, frame stacking wrong, reward scaling issues
•Custom Environments: Designing states, rewards, action spaces for DQN

This skill provides practical implementation guidance for discrete action RL.

Do NOT use this skill for:

•Continuous action spaces (route to actor-critic-methods)
•Policy gradients (route to policy-gradient-methods)
•Model-based RL (route to model-based-rl)
•Offline RL (route to offline-rl-methods)
•Theory foundations (route to rl-foundations)

Core Principle

Value-based methods solve discrete action RL by learning Q(s,a) = expected return from taking action a in state s, then acting greedily. They're powerful for discrete spaces but require careful implementation to avoid instability.

Key insight: Value methods assume you can enumerate and compare all action values. This breaks down with continuous actions (infinite actions to compare). Use them for:

•Games (Atari, Chess)
•Discrete control (robot navigation, discrete movement)
•Dialog systems (discrete utterances)
•Combinatorial optimization

Do not use for:

•Continuous control (robot arm angles, vehicle acceleration)
•Stochastic policies required (multi-agent, exploration in deterministic policy)
•Exploration of large action space (too slow to learn all actions)

Part 1: Q-Learning Foundation

From TD Learning to Q-Learning

You understand TD learning from rl-foundations. Q-learning extends it to action-values.

TD(0) for V(s):

code

V[s] ← V[s] + α(r + γV[s'] - V[s])

Q-Learning for Q(s,a):

code

Q[s,a] ← Q[s,a] + α(r + γ max_a' Q[s',a'] - Q[s,a])

Key difference: Q-learning has max over next actions (off-policy).

Off-Policy Learning

Q-learning learns the optimal policy π(a|s) = argmax_a Q(s,a)* regardless of exploration policy.

Example: Cliff Walking

code

Agent follows epsilon-greedy (explores 10% random)
But Q-learning learns: "Take safe path away from cliff" (optimal)
NOT: "Walk along cliff edge" (what exploring policy does sometimes)

Q-learning separates:
- Behavior policy: ε-greedy (for exploration)
- Target policy: greedy (what we're learning toward)

Why This Matters: Off-policy learning is sample-efficient (can learn from any exploration strategy). On-policy methods like SARSA would learn the exploration noise into policy.

Convergence Guarantee

Theorem: Q-learning converges to Q*(s,a) if:

•All state-action pairs visited infinitely often
•Learning rate α(t) → 0 (e.g., α = 1/N(s,a))
•Sufficiently small ε (exploration not zero)

Practical: Use ε-decay schedule that ensures eventual convergence.

python

epsilon = max(epsilon_min, epsilon * decay_rate)
# Start: ε=1.0, decay to ε=0.01
# Ensures: all actions eventually tried, then exploitation takes over

Q-Learning Pitfall #1: Small State Spaces Only

Scenario: User implements tabular Q-learning for Atari.

Problem:

code

Atari image: 210×160 RGB = 20,160 pixels
Possible states: 256^20160 (astronomical)
Tabular Q-learning: impossible

Solution: Use function approximation (neural networks) → Deep Q-Networks

Red Flag: Tabular Q-learning works only for small state spaces (<10,000 unique states).

Part 2: Deep Q-Networks (DQN)

What DQN Adds to Q-Learning

DQN = Q-learning + neural network + two critical stability mechanisms:

•Experience Replay: Break temporal correlation
•Target Network: Prevent moving target problem

Mechanism 1: Experience Replay

Problem without replay:

python

# Naive approach (WRONG)
state = env.reset()
for t in range(1000):
    action = epsilon_greedy(state)
    next_state, reward = env.step(action)

    # Update Q from this single transition
    Q(state, action) += α(reward + γ max Q(next_state) - Q(state, action))
    state = next_state

Why this fails:

•Consecutive transitions are highly correlated (state_t and state_{t+1} very similar)
•Neural network gradient updates are unstable with correlated data
•Network overfits to recent trajectory

Experience Replay Solution:

python

# Collect experiences in buffer
replay_buffer = []

for episode in range(num_episodes):
    state = env.reset()
    for t in range(max_steps):
        action = epsilon_greedy(state)
        next_state, reward = env.step(action)

        # Store experience (not learn yet)
        replay_buffer.append((state, action, reward, next_state, done))

        # Sample random batch and learn
        if len(replay_buffer) > batch_size:
            batch = random.sample(replay_buffer, batch_size)
            for (s, a, r, s_next, done) in batch:
                if done:
                    target = r
                else:
                    target = r + gamma * max(Q(s_next))
                loss = (Q(s,a) - target)^2

            # Update network weights
            optimizer.step(loss)

        state = next_state

Why this works:

•Breaks correlation: Random sampling decorrelates gradient updates
•Sample efficiency: Reuse old experiences (learn more from same env interactions)
•Stability: Averaged gradients are smoother

Mechanism 2: Target Network

Problem without target network:

python

# Moving target problem (WRONG)
loss = (Q(s,a) - [r + γ max Q(s_next, a_next)])^2
       #     ^^^^             ^^^^
       # Same network computing both target and prediction

Issue: Network updates move both the prediction AND the target, creating instability.

Analogy: Trying to hit a moving target that moves whenever you aim.

Target Network Solution:

python

# Separate networks
main_network = create_network()      # Learning network
target_network = create_network()    # Stable target (frozen)

# Training loop
loss = (main_network(s,a) - [r + γ max target_network(s_next)])^2
                                  ^^^^^^^^
                    Target network doesn't update every step

# Periodically synchronize
if t % update_frequency == 0:
    target_network = copy(main_network)  # Freeze for N steps

Why this works:

•Stability: Target doesn't move as much (frozen for many steps)
•Bellman consistency: Gives network time to learn, then adjusts target
•Convergence: Bootstrapping no longer destabilized by moving target

DQN Architecture Pattern

python

import torch
import torch.nn as nn

class DQN(nn.Module):
    def __init__(self, input_size, num_actions):
        super().__init__()
        # For Atari: CNN backbone
        self.conv1 = nn.Conv2d(4, 32, kernel_size=8, stride=4)  # Frame stack: 4 frames
        self.conv2 = nn.Conv2d(32, 64, kernel_size=4, stride=2)
        self.conv3 = nn.Conv2d(64, 64, kernel_size=3, stride=1)

        # Flatten and FC layers
        self.fc1 = nn.Linear(64*7*7, 512)  # After convolutions
        self.fc_value = nn.Linear(512, 1)  # For dueling: value stream
        self.fc_actions = nn.Linear(512, num_actions)  # For dueling: advantage stream

    def forward(self, x):
        # x shape: (batch, 4, 84, 84) for Atari
        x = torch.relu(self.conv1(x))
        x = torch.relu(self.conv2(x))
        x = torch.relu(self.conv3(x))
        x = x.flatten(start_dim=1)
        x = torch.relu(self.fc1(x))

        # For basic DQN: just action values
        q_values = self.fc_actions(x)
        return q_values

Hyperparameter Guidance

Parameter	Value Range	Effect	Guidance
Replay buffer size	10k-1M	Memory, sample diversity	Start 100k, increase for slow learning
Batch size	32-256	Stability vs memory	32-64 common; larger = more stable
Learning rate α	0.0001-0.001	Convergence speed	Start 0.0001, increase if too slow
Target update freq	1k-10k steps	Stability	Update every 1000-5000 steps
ε initial	0.5-1.0	Exploration	Start 1.0 (random)
ε final	0.01-0.05	Late exploitation	0.01-0.05 typical
ε decay	10k-1M steps	Exploration → Exploitation	Tune to problem (larger env → longer decay)

DQN Pitfall #1: Missing Target Network

Symptom: "DQN loss explodes immediately, Q-values diverge to ±infinity"

Root cause: No target network (or target updates too frequently)

python

# WRONG - target network updates every step
loss = (Q(s,a) - [r + γ max Q(s_next)])^2  # Both from same network

# CORRECT - target network frozen for steps
loss = (Q_main(s,a) - [r + γ max Q_target(s_next)])^2
# Update target: if step % 1000 == 0: Q_target = copy(Q_main)

Fix: Verify target network update frequency (1000-5000 steps typical).

DQN Pitfall #2: Replay Buffer Too Small

Symptom: "Sample efficiency very poor, agent takes millions of steps to learn"

Root cause: Small replay buffer = replay many recent correlated experiences

python

# WRONG
replay_buffer_size = 10_000
# After 10k steps, only seeing recent experience (no diversity)

# CORRECT
replay_buffer_size = 100_000 or 1_000_000
# See diverse experiences from long history

Rule of Thumb: Replay buffer ≥ 10 × episode length (more is usually better)

Memory vs Sample Efficiency Tradeoff:

•10k buffer: Low memory, high correlation (bad)
•100k buffer: Moderate memory, good diversity (usually sufficient)
•1M buffer: High memory, excellent diversity (overkill unless long episodes)

DQN Pitfall #3: No Frame Stacking

Symptom: "Learning very slow or doesn't converge"

Root cause: Single frame doesn't show velocity (violates Markov property)

python

# WRONG - single frame
state = current_frame  # No velocity information
# Network cannot infer: is ball moving left or right?

# CORRECT - stack frames
state = np.stack([frame_t, frame_{t-1}, frame_{t-2}, frame_{t-3}])
# Velocity: difference between consecutive frames

Implementation:

python

from collections import deque

class FrameBuffer:
    def __init__(self, num_frames=4):
        self.buffer = deque(maxlen=num_frames)

    def add_frame(self, frame):
        self.buffer.append(frame)

    def get_state(self):
        return np.stack(list(self.buffer))  # (4, 84, 84)

DQN Pitfall #4: Reward Clipping Wrong

Symptom: "Training unstable" or "Learned policy much worse than Q-values suggest"

Context: Atari papers clip rewards to {-1, 0, +1} for stability.

Misunderstanding: Clipping destroys reward information.

python

# WRONG - unthinking clip
reward = np.clip(reward, -1, 1)  # All rewards become -1,0,+1
# In custom env with rewards in [-100, 1000], loses critical information

# CORRECT - Normalize instead
reward = (reward - reward_mean) / reward_std
# Preserves differences, stabilizes scale

When to clip: Only if rewards are naturally in {-1, 0, +1} (like Atari).

When to normalize: Custom environments with arbitrary scales.

Part 3: Double DQN

The Overestimation Bias Problem

Max operator bias: In stochastic environments, max over noisy estimates is biased upward.

Example:

code

True Q*(s,a) values: [10.0, 5.0, 8.0]

Due to noise, estimates: [11.0, 4.0, 9.0]
                            ↑
                        True Q = 10, estimate = 11

Standard DQN takes max: max(Q_estimates) = 11
But true Q*(s,best_action) = 10

Systematic overestimation! Agent thinks actions better than they are.

Consequence:

•Inflated Q-values during training
•Learned policy (greedy) performs worse than Q-values suggest
•Especially bad early in training when estimates very noisy

Double DQN Solution

Insight: Use one network to select best action, another to evaluate it.

python

# Standard DQN (overestimates)
target = r + γ max_a Q_target(s_next, a)
         #        ^^^^
         # Both selecting and evaluating with same network

# Double DQN (unbiased)
best_action = argmax_a Q_main(s_next, a)      # Select with main network
target = r + γ Q_target(s_next, best_action)  # Evaluate with target network

Why it works:

•Decouples selection and evaluation
•Removes systematic bias
•Unbiased estimator of true Q*

Implementation

python

class DoubleDQN(DQNAgent):
    def compute_loss(self, batch):
        states, actions, rewards, next_states, dones = batch

        # Main network Q-values for current state
        q_values = self.main_network(states)
        q_values_current = q_values.gather(1, actions)

        # Double DQN: select action with main network
        next_q_main = self.main_network(next_states)
        best_actions = next_q_main.argmax(1, keepdim=True)

        # Evaluate with target network
        next_q_target = self.target_network(next_states)
        max_next_q = next_q_target.gather(1, best_actions).detach()

        # TD target (handles done flag)
        targets = rewards + (1 - dones) * self.gamma * max_next_q

        loss = F.smooth_l1_loss(q_values_current, targets)
        return loss

When to Use Double DQN

Use Double DQN if:

•Training a medium-complexity task (Atari)
•Suspicious that Q-values are too optimistic
•Want slightly better sample efficiency

Standard DQN is OK if:

•Small action space (less overestimation)
•Training is otherwise stable
•Sample efficiency not critical

Takeaway: Double DQN is strictly better, minimal cost, use it.

Part 4: Dueling DQN

Dueling Architecture: Separating Value and Advantage

Insight: Q(s,a) = V(s) + A(s,a) where:

•V(s): How good is this state? (independent of action)
•A(s,a): How much better is action a than average? (action-specific advantage)

Why separate:

•Better feature learning: Network learns state features independently from action value
•Stabilization: Value stream sees many states (more gradient signal)
•Generalization: Advantage stream learns which actions matter

Example:

code

Atari Breakout:
V(s) = "Ball in good position, paddle ready" (state value)
A(s,LEFT) = -2 (moving left here hurts)
A(s,RIGHT) = +3 (moving right here helps)
A(s,NOOP) = 0 (staying still is neutral)

Q(s,LEFT) = V + A = 5 + (-2) = 3
Q(s,RIGHT) = V + A = 5 + 3 = 8  ← Best action
Q(s,NOOP) = V + A = 5 + 0 = 5

Architecture

python

class DuelingDQN(nn.Module):
    def __init__(self, input_size, num_actions):
        super().__init__()

        # Shared feature backbone
        self.conv1 = nn.Conv2d(4, 32, kernel_size=8, stride=4)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=4, stride=2)
        self.conv3 = nn.Conv2d(64, 64, kernel_size=3, stride=1)
        self.fc = nn.Linear(64*7*7, 512)

        # Value stream (single output)
        self.value_fc = nn.Linear(512, 256)
        self.value = nn.Linear(256, 1)

        # Advantage stream (num_actions outputs)
        self.advantage_fc = nn.Linear(512, 256)
        self.advantage = nn.Linear(256, num_actions)

    def forward(self, x):
        # Shared backbone
        x = torch.relu(self.conv1(x))
        x = torch.relu(self.conv2(x))
        x = torch.relu(self.conv3(x))
        x = x.flatten(start_dim=1)
        x = torch.relu(self.fc(x))

        # Value stream
        v = torch.relu(self.value_fc(x))
        v = self.value(v)

        # Advantage stream
        a = torch.relu(self.advantage_fc(x))
        a = self.advantage(a)

        # Combine: Q = V + (A - mean(A))
        # Subtract mean(A) for normalization (prevents instability)
        q = v + (a - a.mean(dim=1, keepdim=True))
        return q

Why Subtract Mean of Advantages?

python

# Without mean subtraction
q = v + a
# Problem: V and A not separately identifiable
# V could be 100 + A = -90 or V = 50 + A = -40 (same Q)

# With mean subtraction
q = v + (a - mean(a))
# Mean advantage = 0 on average
# Forces: V learns state value, A learns relative advantage
# More stable training

When to Use Dueling DQN

Use Dueling if:

•Training complex environments (Atari)
•Want better feature learning
•Training is unstable (helps stabilization)

Standard DQN is OK if:

•Simple environments
•Computational budget tight

Takeaway: Dueling is strictly better for neural network learning, minimal cost, use it.

Part 5: Prioritized Experience Replay

Problem with Uniform Sampling

Issue: All transitions equally likely to be sampled.

python

# Uniform sampling
batch = random.sample(replay_buffer, batch_size)
# Includes: boring transitions, important transitions, rare transitions
# All mixed together with equal weight

Problem:

•Wasted learning on transitions already understood
•Rare important transitions sampled rarely
•Sample inefficiency

Example:

code

Atari agent learns mostly: "Move paddle left-right in routine positions"
Rarely: "What happens when ball is in corner?" (rare, important)

Uniform replay: 95% learning about paddle, 5% about corners
Should be: More focus on corners (rarer, more surprising)

Prioritized Experience Replay Solution

Insight: Sample transitions proportional to TD error (surprise).

python

# Compute TD error (surprise)
td_error = |r + γ max Q(s_next) - Q(s,a)|
#           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#           How wrong was our prediction?

# Probability ∝ TD error^α
# High error transitions sampled more
batch = sample_proportional_to_priority(replay_buffer, priorities)

Implementation

python

import numpy as np

class PrioritizedReplayBuffer:
    def __init__(self, size, alpha=0.6):
        self.buffer = []
        self.priorities = []
        self.size = size
        self.alpha = alpha  # How much to prioritize (0=uniform, 1=full priority)
        self.epsilon = 1e-6  # Small value to avoid zero priority

    def add(self, experience):
        # New experiences get max priority (important!)
        max_priority = np.max(self.priorities) if self.priorities else 1.0

        if len(self.buffer) < self.size:
            self.buffer.append(experience)
            self.priorities.append(max_priority)
        else:
            # Replace oldest if full
            self.buffer[len(self.buffer) % self.size] = experience
            self.priorities[len(self.priorities) % self.size] = max_priority

    def sample(self, batch_size):
        # Compute sampling probabilities
        priorities = np.array(self.priorities) ** self.alpha
        priorities = priorities / np.sum(priorities)

        # Sample indices
        indices = np.random.choice(len(self.buffer), batch_size, p=priorities)
        batch = [self.buffer[i] for i in indices]

        # Importance sampling weights (correct for bias from prioritized sampling)
        weights = (1 / (len(self.buffer) * priorities[indices])) ** (1/3)  # β=1/3
        weights = weights / np.max(weights)  # Normalize

        return batch, indices, weights

    def update_priorities(self, indices, td_errors):
        # Update priorities based on new TD errors
        for idx, td_error in zip(indices, td_errors):
            self.priorities[idx] = (np.abs(td_error) + self.epsilon) ** self.alpha

Importance Sampling Weights

Problem: Prioritized sampling introduces bias (samples important transitions more).

Solution: Reweight gradients by inverse probability.

python

# Uniform sampling: each transition contributes equally
loss = mean((r + γ max Q(s_next) - Q(s,a))^2)

# Prioritized sampling: bias toward high TD error
# Correct with importance weight (large TD error → small weight)
loss = mean(weights * (r + γ max Q(s_next) - Q(s,a))^2)
#            ^^^^^^^
#      Importance sampling correction

# weights ∝ 1/priority (inverse)

When to Use Prioritized Replay

Use if:

•Training large environments (Atari)
•Sample efficiency critical
•Have computational budget for priority updates

Use standard uniform if:

•Small environments
•Computational budget tight
•Standard training is working fine

Note: Adds complexity (priority updates), minimal empirical gain in many cases.

Part 6: Rainbow DQN

Combining All Improvements

Rainbow = Double DQN + Dueling DQN + Prioritized Replay + 3 more innovations:

•Double DQN: Reduce overestimation bias
•Dueling DQN: Separate value and advantage
•Prioritized Replay: Sample important transitions
•Noisy Networks: Exploration through network parameters
•Distributional RL: Learn Q distribution not just mean
•Multi-step Returns: n-step TD learning instead of 1-step

When to Use Rainbow

Use Rainbow if:

•Need state-of-the-art Atari performance
•Have weeks of compute for tuning
•Paper requires it

Use Double + Dueling DQN if:

•Standard DQN training unstable
•Want good performance with less tuning
•Typical development

Use Basic DQN if:

•Learning the method
•Sample efficiency not critical
•Simple environments

Lesson: Understand components separately before combining.

code

Learning progression:
1. Q-learning (understand basics)
2. Basic DQN (add neural networks)
3. Double DQN (fix overestimation)
4. Dueling DQN (improve architecture)
5. Add prioritized replay (sample efficiency)
6. Rainbow (combine all)

Part 7: Common Bugs and Debugging

Bug #1: Training Divergence (Q-values explode)

Diagnosis Tree:

•

Check target network:

python

# WRONG - updating every step
loss = (Q_main(s,a) - [r + γ max Q_main(s_next)])^2
# FIX - use separate target network
loss = (Q_main(s,a) - [r + γ max Q_target(s_next)])^2

•

Check learning rate:

python

# WRONG - too high
optimizer = torch.optim.Adam(network.parameters(), lr=0.1)
# FIX - reduce learning rate
optimizer = torch.optim.Adam(network.parameters(), lr=0.0001)

•

Check reward scale:

python

# WRONG - rewards too large
reward = 1000 * indicator  # Values explode
# FIX - normalize
reward = 10 * indicator
# Or: reward = (reward - reward_mean) / reward_std

•

Check replay buffer:

python

# WRONG - too small
replay_buffer_size = 1000
# FIX - increase size
replay_buffer_size = 100_000

Bug #2: Poor Sample Efficiency (Slow Learning)

Diagnosis Tree:

•

Check replay buffer size:

python

# Too small → high correlation
if len(replay_buffer) < 100_000:
    print("WARNING: Replay buffer too small for Atari")

•

Check target network update frequency:

python

# Too frequent → moving target
# Too infrequent → slow target adjustment
# Good: every 1000-5000 steps
if update_frequency > 10_000:
    print("Target updates too infrequent")

•

Check batch size:

python

# Too small → noisy gradients
# Too large → slow training
# Good: 32-64
if batch_size < 16 or batch_size > 256:
    print("Consider adjusting batch size")

•

Check epsilon decay:

python

# Decaying too fast → premature exploitation
# Decaying too slow → wastes steps exploring
# Typical: decay over 10% of total steps
if decay_steps < total_steps * 0.05:
    print("Epsilon decays too quickly")

Bug #3: Q-Values Too Optimistic (Learned Policy << Training Q)

Diagnosis:

Red Flag: Policy performance much worse than max Q-value during training.

python

# Symptom
max_q_value = 100.0
actual_episode_return = 5.0
# 20x gap suggests overestimation

# Solutions (try in order)
1. Use Double DQN (reduces overestimation)
2. Reduce learning rate (slower updates → less optimistic)
3. Increase target network update frequency (more stable target)
4. Check reward function (might be wrong)

Bug #4: Frame Stacking Wrong

Symptoms:

•Very slow learning despite "correct" implementation
•Network can't learn velocity-dependent behaviors

Diagnosis:

python

# WRONG - single frame
state_shape = (84, 84, 3)
# Network sees only position, not velocity

# CORRECT - stack 4 frames
state_shape = (84, 84, 4)
# Last 4 frames show motion

# Check frame stacking implementation
frame_stack = deque(maxlen=4)
for frame in frames:
    frame_stack.append(frame)
    state = np.stack(list(frame_stack))  # (4, 84, 84)

Bug #5: Network Architecture Mismatch

Symptoms:

•CNN on non-image input (or vice versa)
•Output layer wrong number of actions
•Input preprocessing wrong

Diagnosis:

python

# Image input → use CNN
if input_type == 'image':
    network = CNN(num_actions)

# Vector input → use FC
elif input_type == 'vector':
    network = FullyConnected(input_size, num_actions)

# Output layer MUST have num_actions outputs
assert network.output_size == num_actions

Part 8: Hyperparameter Tuning

Learning Rate

Too high (α > 0.001):

•Divergence, unstable training
•Q-values explode

Too low (α < 0.00001):

•Very slow learning
•May not converge in reasonable time

Start: α = 0.0001, adjust if needed

python

# Adaptive strategy
if max_q_value > 1000:
    print("Reduce learning rate")
    alpha = alpha / 2
if learning_curve_flat:
    print("Increase learning rate")
    alpha = alpha * 1.1

Replay Buffer Size

Too small (< 10k for Atari):

•High correlation in gradients
•Slow learning, poor sample efficiency

Too large (> 10M):

•Excessive memory
•Stale experiences dominate
•Diminishing returns

Rule of thumb: 10 × episode length

python

episode_length = 1000  # typical
ideal_buffer = 100_000  # 10 × typical Atari episode

# Can increase if GPU memory available and learning slow
if learning_slow:
    buffer_size = 500_000  # More diversity

Epsilon Decay

Too fast (decay in 10k steps):

•Agent exploits before learning
•Suboptimal policy

Too slow (decay in 1M steps):

•Wasted exploration time
•Slow performance improvement

Rule: Decay over ~10% of total training steps

python

total_steps = 1_000_000
epsilon_decay_steps = total_steps * 0.1  # 100k steps
epsilon = max(epsilon_min, epsilon * (epsilon_decay_steps / current_step))

Target Network Update Frequency

Too frequent (every 100 steps):

•Target still moves rapidly
•Less stabilization benefit

Too infrequent (every 100k steps):

•Network drifts far from target
•Large jumps in learning

Sweet spot: Every 1k-5k steps (1000 typical)

python

update_frequency = 1000  # steps between target updates
if update_frequency < 500:
    print("Target updates might be too frequent")
if update_frequency > 10_000:
    print("Target updates might be too infrequent")

Reward Scaling

No scaling (raw rewards vary wildly):

•Learning rate effects vary by task
•Convergence issues

Clipping (clip to {-1, 0, +1}):

•Good for Atari, loses information in custom envs

Normalization (zero-mean, unit variance):

•General solution
•Preserves reward differences

python

# Track running statistics
running_mean = 0.0
running_var = 1.0

def normalize_reward(reward):
    global running_mean, running_var
    running_mean = 0.99 * running_mean + 0.01 * reward
    running_var = 0.99 * running_var + 0.01 * (reward - running_mean)**2
    return (reward - running_mean) / np.sqrt(running_var + 1e-8)

Part 9: When to Use Each Method

DQN Selection Matrix

Situation	Method	Why
Learning method	Basic DQN	Understand target network, replay buffer
Medium task	Double DQN	Fix overestimation, minimal overhead
Complex task	Double + Dueling	Better architecture + bias reduction
Sample critical	Add Prioritized	Focus on important transitions
State-of-art	Rainbow	Best Atari performance
Simple Atari	DQN	Sufficient, faster to debug
Non-Atari discrete	DQN/Double	Adapt architecture to input type

Action Space Check

Before implementing DQN, ask:

python

if action_space == 'continuous':
    print("ERROR: Use actor-critic or policy gradient")
    print("Value methods only for discrete actions")
    redirect_to_actor_critic_methods()

elif action_space == 'discrete' and len(actions) <= 100:
    print("✓ DQN appropriate")

elif action_space == 'discrete' and len(actions) > 1000:
    print("⚠ Large action space, consider policy gradient")
    print("Or: hierarchical RL, action abstraction")

Part 10: Red Flags Checklist

When you see these, suspect bugs:

Part 11: Pitfall Rationalization

Rationalization	Reality	Counter-Guidance	Red Flag
"I'll skip target network, save memory"	Causes instability/divergence	Target network critical, minimal memory cost	"Target network optional"
"DQN works for continuous actions"	Breaks fundamental assumption (enumerate all actions)	Value methods discrete-only, use SAC/TD3 for continuous	Continuous action DQN attempt
"Uniform replay is fine"	Wastes learning on boring transitions	Prioritized replay better, but uniform adequate for many tasks	Always recommending prioritized
"I'll use tiny replay buffer, it's faster"	High correlation, poor learning	100k+ buffer typical, speed tradeoff acceptable	Buffer < 10k for Atari
"Frame stacking unnecessary, CNN sees motion"	Single frame Markov-violating	Frame stacking required for velocity from pixels	Single frame policy
"Rainbow is just DQN + tricks"	Missing that components solve specific problems	Each component fixes identified issue (overestimation, architecture, sampling)	Jumping to Rainbow without understanding
"Clip rewards, I saw it in a paper"	Clips away important reward information	Only clip for {-1,0,+1} Atari-style, normalize otherwise	Blind reward clipping
"Larger network will learn faster"	Overfitting, slower gradients, memory issues	Standard architecture (32-64-64 CNN) works, don't over-engineer	Unreasonably large networks
"Policy gradient would be simpler here"	Value methods discrete-only right choice	Know when each applies (discrete → value, continuous → policy)	Wrong method choice for action space
"Epsilon decay is a hyperparameter like any other"	decay schedule should match task complexity	Tune decay to problem (game length), not arbitrary	Epsilon decay without reasoning

Part 12: Pressure Test Scenarios

Scenario 1: Continuous Action Space

User: "I have a robot with continuous action space (joint angles in ℝ^7). Can I use DQN?"

Wrong Response: "Sure, discretize the actions" (Combinatorial explosion, inefficient)

Correct Response: "No, value methods are discrete-only. Use actor-critic (SAC) or policy gradient (PPO). They handle continuous actions naturally. Discretization would create 7-dimensional action space explosion (e.g., 10 values per joint = 10^7 actions)."

Scenario 2: Training Unstable

User: "My DQN is diverging immediately, loss explodes. Implementation looks right. What's wrong?"

Systematic Debug:

code

1. Check target network
   - Print: "Is target_network separate from main_network?"
   - Likely cause: updating together

2. Check learning rate
   - Print: "Learning rate = ?"
   - If > 0.001, reduce

3. Check reward scale
   - Print: "max(rewards) = ?"
   - If > 100, normalize

4. Check initial Q-values
   - Print: "mean(Q-values) = ?"
   - Should start near zero

Answer: Target network most likely culprit. Verify separate networks with proper update frequency.

Scenario 3: Rainbow vs Double DQN

User: "Should I implement Rainbow or just Double DQN? Is Rainbow worth the complexity?"

Guidance:

code

Double DQN:
+ Fixes overestimation bias
+ Simple to implement
+ 90% of Rainbow benefits in many cases
- Missing other optimizations

Rainbow:
+ Best Atari performance
+ State-of-the-art
- Complex (6 components)
- Harder to debug
- More hyperparameters

Recommendation:
Start: Double DQN
If unstable: Add Dueling
If slow: Add Prioritized
Only go to Rainbow: If need SotA and have time

Scenario 4: Frame Stacking Issue

User: "My agent trains on Atari but learning is slow. How many frames should I stack?"

Diagnosis:

python

# Check if frame stacking implemented
if state.shape != (4, 84, 84):
    print("ERROR: Not using frame stacking")
    print("Single frame (1, 84, 84) violates Markov property")
    print("Add frame stacking: stack last 4 frames")

# Frame count
4 frames: Standard (shows ~80ms at 50fps = ~4 frames)
3 frames: OK, slightly less velocity info
2 frames: Minimum, just barely Markovian
1 frame: WRONG, not Markovian
8+ frames: Too many, outdated states in stack

Scenario 5: Hyperparameter Tuning

User: "I've tuned learning rate, buffer size, epsilon. What else affects performance?"

Guidance:

code

Priority 1 (Critical):
- Target network update frequency (1000-5000 steps)
- Replay buffer size (100k+ typical)
- Frame stacking (4 frames)

Priority 2 (Important):
- Learning rate (0.0001-0.0005)
- Epsilon decay schedule (over ~10% of steps)
- Batch size (32-64)

Priority 3 (Nice to have):
- Network architecture (32-64-64 CNN standard)
- Reward normalization (helps but not required)
- Double/Dueling DQN (improvements, not essentials)

Start with Priority 1, only adjust Priority 2-3 if unstable.

Part 13: When to Route Elsewhere

Route to rl-foundations if

•User confused about Bellman equations
•Unclear on value function definition
•Needs theory behind Q-learning convergence

Route to actor-critic-methods if

•Continuous action space
•Need deterministic policy gradients
•Stochastic policy required

Route to policy-gradient-methods if

•Large discrete action space (> 1000 actions)
•Need policy regularization
•Exploration by stochasticity useful

Route to offline-rl-methods if

•No environment access (batch learning)
•Learning from logged data only

Route to rl-debugging if

•General training issues
•Need systematic debugging methodology
•Credit assignment problems

Route to reward-shaping if

•Sparse rewards
•Reward design affecting learning
•Potential-based shaping questions

Summary

You now understand:

•Q-Learning: TD learning for action values, off-policy convergence guarantee
•DQN: Add neural networks + experience replay + target network for stability
•
Stability Mechanisms:
- •Replay buffer: Break correlation
- •Target network: Prevent moving target problem
•
Common Variants:
- •Double DQN: Fix overestimation bias
- •Dueling DQN: Separate value and advantage
- •Prioritized Replay: Focus on important transitions
- •Rainbow: Combine improvements
•When to Use: Discrete action spaces only, not continuous
•Common Bugs: Divergence, poor efficiency, overoptimism, frame issues
•Hyperparameter Tuning: Buffer size, learning rate, epsilon decay, target frequency
•Debugging Strategy: Systematic diagnosis (target network → learning rate → reward scale)

Key Takeaways:

•Value methods are for discrete actions ONLY
•DQN requires target network and experience replay
•Frame stacking needed for video inputs (Markov property)
•Double DQN fixes overestimation, use it
•Start simple, add Dueling/Prioritized only if needed
•Systematic debugging beats random tuning

Next: Implement on simple environment first (CartPole or small custom task), then scale to Atari.