To avoid the need to constantly refer to the original notebooks, here is a cheat sheet summarizing the most important concepts and algorithms in reinforcement learning (RL).
- Core Concepts
- Basic / Tabular Methods
- Policy Gradient Methods
- Actor-Critic Methods
- Value-Based Deep Methods
- Multi-Agent RL (MARL)
- Hierarchical RL (HRL)
- Planning & Model-Based Methods
The fundamental interaction cycle in RL:
- Agent observes state
$s_t$ . - Agent selects action
$a_t$ based on policy$\pi(a_t|s_t)$ . - Environment transitions to next state
$s_{t+1}$ . - Environment provides reward
$r_t$ . - Agent updates policy/values based on
$(s_t, a_t, r_t, s_{t+1})$ .
Formal framework for RL problems, defined by
-
$S$ : Set of states. -
$A$ : Set of actions. -
$P(s'|s, a)$ : Transition probability function. -
$R(s, a, s')$ : Reward function. -
$\gamma$ : Discount factor ($0 \le \gamma \le 1$ ).
-
State-Value Function ($V^\pi(s)$): Expected return starting from state
$s$ and following policy$\pi$ . [ V^\pi(s) = \mathbb{E}\pi [ G_t | S_t=s ] = \mathbb{E}\pi [ \sum_{k=0}^\infty \gamma^k r_{t+k+1} | S_t=s ] ] -
Action-Value Function ($Q^\pi(s, a)$): Expected return starting from state
$s$ , taking action$a$ , and following policy$\pi$ . [ Q^\pi(s, a) = \mathbb{E}\pi [ G_t | S_t=s, A_t=a ] = \mathbb{E}\pi [ \sum_{k=0}^\infty \gamma^k r_{t+k+1} | S_t=s, A_t=a ] ] -
Bellman Expectation Equation for
$V^\pi$ : [ V^\pi(s) = \sum_{a} \pi(a|s) \sum_{s', r} P(s', r | s, a) [r + \gamma V^\pi(s')] ] -
Bellman Optimality Equation for
$Q^*$ (used by Q-Learning): [ Q^(s, a) = \sum_{s', r} P(s', r | s, a) [r + \gamma \max_{a'} Q^(s', a')] ]
- Exploration: Trying new actions to discover better rewards.
- Exploitation: Choosing the action currently known to yield the best expected reward.
-
$\epsilon$ -Greedy: Common strategy: With probability$\epsilon$ , explore (random action); with probability$1-\epsilon$ , exploit (greedy action).$\epsilon$ often decays over time.
- Core Idea: Demonstrates the basic agent-environment loop. Agent remembers immediate rewards for state-action pairs and uses a simple epsilon-greedy policy based on average immediate rewards. Does not perform true RL value learning.
- Mathematical Formulation: No Bellman updates. Policy based on: [ \text{AvgR}(s, a) = \frac{\sum \text{rewards observed after } (s, a)}{\text{count of } (s, a)} ]
-
Pseudocode:
- Initialize memory
mem[s][a] -> [rewards]
- For each episode:
- Reset env to get
s
- For each step:
- Choose
a
using$\epsilon$ -greedy onAvgR(s, a)
- Take action
a
, getr
,s'
- Store
r
inmem[s][a]
s = s'
- Choose
- Reset env to get
- Initialize memory
-
Code Snippet:
# Choosing action based on average immediate reward avg_rewards = [] for a in range(n_actions): rewards = memory[state][a] avg_rewards.append(np.mean(rewards) if rewards else 0) best_action = np.random.choice(np.where(avg_rewards == np.max(avg_rewards))[0]) # Updating memory memory[state][action].append(reward)
-
Key Hyperparameters:
epsilon
,epsilon_decay
. - Pros: Simple illustration of interaction loop and memory.
- Cons: Does not learn long-term values, only immediate rewards. Not true RL. Inefficient memory.
- Use Cases: Educational demonstration of basic agent structure.
-
Core Idea: Learns the optimal action-value function (
$Q^*$ ) off-policy using Temporal Difference (TD) updates. -
Mathematical Formulation: Bellman Optimality update:
[ Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha [r_t + \gamma \max_{a'} Q(s_{t+1}, a') - Q(s_t, a_t)] ]
-
$\alpha$ : Learning rate -
$\gamma$ : Discount factor -
$\max_{a'} Q(s_{t+1}, a')$ : Max Q-value in next state (greedy estimate of future value)
-
-
Pseudocode:
- Initialize Q-table
Q(s, a)
to zeros. - For each episode:
- Initialize state
s
. - For each step:
- Choose action
a
froms
using policy derived from Q (e.g.,$\epsilon$ -greedy). - Take action
a
, observer
,s'
. - Update
Q(s, a)
using the Q-learning rule. s = s'
- Choose action
- Initialize state
- Initialize Q-table
-
Code Snippet:
# Q-Learning update current_q = q_table[state][action] max_next_q = max(q_table[next_state].values()) if next_state in q_table else 0.0 td_target = reward + gamma * max_next_q td_error = td_target - current_q q_table[state][action] += alpha * td_error
-
Key Hyperparameters:
alpha
(learning rate),gamma
(discount factor),epsilon
(exploration rate),epsilon_decay
. - Pros: Off-policy (can learn optimal policy while exploring), simple concept, guaranteed convergence under conditions.
- Cons: Tabular form doesn't scale to large state spaces, can suffer from maximization bias (addressed by Double Q-learning).
-
Common Pitfalls: Tuning
$\alpha$ and$\epsilon$ . Ensuring sufficient exploration. - Use Cases: Small, discrete state/action spaces, foundational understanding.
-
Core Idea: Learns the action-value function (
$Q^\pi$ ) for the policy currently being followed (on-policy) using TD updates. -
Mathematical Formulation: Update uses the next action chosen by the policy:
[ Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha [r_t + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t)] ]
-
$a_{t+1}$ is the action chosen in state$s_{t+1}$ by the current policy (e.g.,$\epsilon$ -greedy).
-
-
Pseudocode:
- Initialize Q-table
Q(s, a)
. - For each episode:
- Initialize
s
. - Choose
a
froms
using policy derived from Q (e.g.,$\epsilon$ -greedy). - For each step:
- Take action
a
, observer
,s'
. - Choose next action
a'
froms'
using policy derived from Q. - Update
Q(s, a)
usingr
,s'
,a'
. -
s = s'
,a = a'
- Take action
- Initialize
- Initialize Q-table
-
Code Snippet:
# SARSA update current_q = q_table[state][action] next_q = q_table[next_state][next_action] # Q-value of the *next* action taken td_target = reward + gamma * next_q td_error = td_target - current_q q_table[state][action] += alpha * td_error
-
Key Hyperparameters:
alpha
,gamma
,epsilon
,epsilon_decay
. - Pros: On-policy (learns value of the exploration policy), often more stable/conservative in risky environments than Q-learning.
- Cons: Tabular, can be slower to converge to optimal if exploration persists, sensitive to policy changes.
-
Common Pitfalls: Ensuring the next action
a'
is chosen correctly before the update. - Use Cases: When evaluating the current policy is important, safer exploration needed.
- Core Idea: Like SARSA, but updates using the expected value over next actions, weighted by policy probabilities, reducing variance. Still on-policy.
-
Mathematical Formulation:
[ Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha [r_t + \gamma \mathbb{E}{\pi}[Q(s{t+1}, A')] - Q(s_t, a_t)] ]
[ \mathbb{E}{\pi}[Q(s', A')] = \sum{a'} \pi(a'|s') Q(s', a') ]
For
$\epsilon$ -greedy: $\mathbb{E}{\pi}[Q(s', A')] = (1 - \epsilon) \max{a''} Q(s', a'') + \epsilon \frac{\sum_{a'} Q(s', a')}{|\mathcal{A}|}$ -
Pseudocode:
- Initialize Q-table
Q(s, a)
. - For each episode:
- Initialize
s
. - For each step:
- Choose
a
froms
using policy derived from Q (e.g.,$\epsilon$ -greedy). - Take action
a
, observer
,s'
. - Calculate expected Q-value
$E[Q(s', A')]$ based on policy$\pi$ in states'
. - Update
Q(s, a)
usingr
and$E[Q(s', A')]$ . s = s'
- Choose
- Initialize
- Initialize Q-table
-
Code Snippet:
# Expected SARSA update (assuming epsilon-greedy) current_q = q_table[state][action] if next_state in q_table and q_table[next_state]: q_values_next = q_table[next_state] max_q_next = max(q_values_next.values()) num_actions = len(action_space) expected_q_next = (1.0 - epsilon) * max_q_next + (epsilon / num_actions) * sum(q_values_next.values()) else: expected_q_next = 0.0 td_target = reward + gamma * expected_q_next td_error = td_target - current_q q_table[state][action] += alpha * td_error
-
Key Hyperparameters:
alpha
,gamma
,epsilon
,epsilon_decay
. - Pros: On-policy, lower variance than SARSA, often more stable, same computational cost as Q-learning per update.
- Cons: Tabular, slightly more complex update calculation than SARSA.
- Use Cases: Where SARSA is applicable but stability/variance is an issue.
- Core Idea: Integrates model-free learning (Q-learning) with model-based planning. Learns a model of the environment from real experience and uses it to perform extra "planning" updates on the Q-table using simulated experience.
-
Mathematical Formulation:
-
Direct RL: Standard Q-learning update on real transition
$(s, a, r, s')$ . -
Model Learning:
$Model(s, a) \leftarrow (r, s')$ (for deterministic env). -
Planning: For
$k$ steps: Sample$(s_p, a_p)$ from previously experienced pairs. Get$(r_p, s'_p) = Model(s_p, a_p)$ . Apply Q-learning update to$Q(s_p, a_p)$ using$(s_p, a_p, r_p, s'_p)$ .
-
Direct RL: Standard Q-learning update on real transition
-
Pseudocode:
- Initialize
Q(s, a)
andModel(s, a)
. - For each episode:
- Initialize
s
. - For each step:
- Choose
a
using policy based on Q. - Take action
a
, observer
,s'
. - Update
Q(s, a)
with$(s, a, r, s')$ (Direct RL). - Update
Model(s, a)
with$(r, s')$ . - Repeat
k
times (Planning):- Sample previously seen
$(s_p, a_p)$ . - Get
$(r_p, s'_p)$ fromModel(s_p, a_p)
. - Update
Q(s_p, a_p)
with$(s_p, a_p, r_p, s'_p)$ .
- Sample previously seen
s = s'
- Choose
- Initialize
- Initialize
-
Code Snippet:
# Direct RL Update (same as Q-Learning) # ... q_learning_update(q_table, state, action, reward, next_state, ...) # Model Update model[(state, action)] = (reward, next_state) if (state, action) not in observed_pairs: observed_pairs.append((state, action)) # Planning Step for _ in range(planning_steps_k): if not observed_pairs: break s_p, a_p = random.choice(observed_pairs) r_p, s_prime_p = model[(s_p, a_p)] q_learning_update(q_table, s_p, a_p, r_p, s_prime_p, ...)
-
Key Hyperparameters:
alpha
,gamma
,epsilon
,k
(number of planning steps). - Pros: Improves sample efficiency compared to pure Q-learning by reusing experience via the model. Simple integration of learning and planning.
- Cons: Tabular. Effectiveness depends heavily on model accuracy. Assumes deterministic model in simple form.
-
Common Pitfalls: Poor model accuracy can lead to suboptimal policy. Choosing
k
. - Use Cases: Environments where interaction is costly but computation is cheap. Simple planning tasks.
-
Core Idea: Directly learns a parameterized policy
$\pi(a|s; \theta)$ by increasing the probability of actions that led to high cumulative episode returns ($G_t$ ). On-policy, Monte Carlo. -
Mathematical Formulation: Updates policy parameters
$\theta$ via gradient ascent on$J(\theta) = \mathbb{E}[G_t]$ . [ \nabla_\theta J(\theta) \approx \sum_{t=0}^{T-1} G_t \nabla_\theta \log \pi(a_t | s_t; \theta) ] Loss function (for minimization):$L(\theta) = -\sum_{t=0}^{T-1} G_t \log \pi(a_t | s_t; \theta)$ -
$G_t = \sum_{k=t}^{T-1} \gamma^{k-t} r_{k+1}$ : Discounted return from step$t$ .
-
-
Pseudocode:
- Initialize policy network
$\pi(a|s; \theta)$ . - For each episode:
- Generate trajectory
$(s_0, a_0, r_1, ...)$ by sampling actions$a_t \sim \pi(a|s_t; \theta)$ . Store log probs$\log \pi(a_t|s_t; \theta)$ and rewards$r_{t+1}$ . - Calculate discounted returns
$G_t$ for all steps$t$ . - Compute loss
$L$ . - Update
$\theta$ using gradient descent on$L$ .
- Generate trajectory
- Initialize policy network
-
Code Snippet:
# Calculate returns (backward loop) returns = [] G = 0.0 for r in reversed(episode_rewards): G = r + gamma * G returns.insert(0, G) returns = torch.tensor(returns) # Standardize returns (optional but recommended) returns = (returns - returns.mean()) / (returns.std() + 1e-8) # Calculate loss log_probs_tensor = torch.stack(episode_log_probs) loss = -torch.sum(returns * log_probs_tensor) # Update policy optimizer.zero_grad() loss.backward() optimizer.step()
-
Key Hyperparameters:
learning_rate
,gamma
. Network architecture. - Pros: Simple policy gradient concept, works with discrete/continuous actions, learns stochastic policies.
- Cons: High variance due to Monte Carlo returns, episodic updates (waits until episode end), on-policy sample inefficiency.
- Common Pitfalls: High variance leading to unstable training, requires careful learning rate tuning.
- Use Cases: Simple benchmarks, conceptual understanding, basis for actor-critic.
- Core Idea: Improves policy gradient updates by constraining the change in the policy (measured by KL divergence) at each step, ensuring more stable and monotonic improvement. On-policy.
-
Mathematical Formulation: Solves a constrained optimization problem (approximately):
[ \max_{\theta} \quad \mathbb{E}t \left[ \frac{\pi\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)} \hat{A}t \right] ]
[ \text{s.t.} \quad \mathbb{E}t [D{KL}(\pi{\theta_{old}}(\cdot|s_t) || \pi_{\theta}(\cdot|s_t))] \le \delta ]
Solved using Conjugate Gradient (to find direction
$\approx F^{-1}g$ ) and Line Search (to satisfy constraint).$F$ is the Fisher Information Matrix. -
Pseudocode:
- Initialize actor
$\pi_\theta$ , critic$V_\phi$ . - For each iteration:
- Collect trajectories using
$\pi_{\theta_{old}}$ . Store states, actions, rewards, log probs. - Compute advantages $\hat{A}t$ (using GAE with $V\phi$).
- Compute policy gradient
$g$ . - Use Conjugate Gradient + Fisher-Vector Products to find step direction
$s \approx F^{-1}g$ . - Perform line search to find step size
$\beta \alpha$ satisfying KL constraint$\delta$ and improving surrogate objective. - Update actor:
$\theta_{new} \leftarrow \theta_{old} + \beta \alpha s$ . - Update critic
$V_\phi$ using collected data.
- Collect trajectories using
- Initialize actor
-
Code Snippet: (Focus on conceptual update call)
# Conceptual TRPO update call policy_gradient = calculate_policy_gradient(...) step_direction = conjugate_gradient(fisher_vector_product_func, policy_gradient, ...) initial_step_size = calculate_initial_step_size(step_direction, policy_gradient, max_kl, ...) final_update, success = backtracking_line_search(actor, ..., step_direction, initial_step_size, max_kl, ...) if success: apply_update(actor, final_update) update_critic(...)
-
Key Hyperparameters:
delta
(KL constraint),gamma
,lambda
(GAE), CG iterations, CG damping, line search parameters. - Pros: Provides theoretical monotonic improvement guarantee (under approximations), very stable updates.
- Cons: Complex implementation (FVP, CG, line search), computationally expensive per update, on-policy.
-
Common Pitfalls: Implementing FVP and CG correctly, tuning trust region
$\delta$ . - Use Cases: Continuous control, situations requiring high stability, benchmark for simpler algorithms like PPO.
- Core Idea: A synchronous, simpler version of A3C. Uses an actor (policy) and a critic (value function) trained on batches of experience collected by the actor. Reduces variance compared to REINFORCE by using advantage estimates. On-policy.
-
Mathematical Formulation:
- Actor Loss (minimize):
$L_{actor} = - \mathbb{E}_t [ \log \pi(a_t | s_t; \theta) \hat{A}_t^{\text{detached}} + c_e H(\pi) ]$ - Critic Loss (minimize):
$L_{critic} = \mathbb{E}_t [ (R_t - V(s_t; \phi))^2 ]$ -
$\hat{A}_t = R_t - V(s_t; \phi)$ : Advantage estimate (often using n-step returns or GAE for$R_t$ ).
- Actor Loss (minimize):
-
Pseudocode:
- Initialize shared actor
$\pi_\theta$ and critic$V_\phi$ . - Loop for iterations:
- Collect batch of N steps of experience
$(s_t, a_t, r_{t+1}, s_{t+1}, d_t)$ using$\pi_\theta$ . - Compute n-step returns
$R_t$ and advantages $\hat{A}t$ using $V\phi$. - Compute actor loss (policy gradient + entropy) and critic loss (MSE).
- Compute gradients for actor and critic based on the batch.
- Apply synchronous gradient update to
$\theta$ and$\phi$ .
- Collect batch of N steps of experience
- Initialize shared actor
-
Code Snippet:
# Calculate Advantage and Returns (e.g., using GAE) advantages, returns_to_go = compute_gae_and_returns(...) # Evaluate current policy and value policy_dist = actor(states) log_probs = policy_dist.log_prob(actions) entropy = policy_dist.entropy().mean() values_pred = critic(states).squeeze() # Losses policy_loss = -(log_probs * advantages.detach()).mean() - entropy_coeff * entropy value_loss = F.mse_loss(values_pred, returns_to_go.detach()) # Optimize Actor actor_optimizer.zero_grad() policy_loss.backward() actor_optimizer.step() # Optimize Critic critic_optimizer.zero_grad() (value_loss_coeff * value_loss).backward() critic_optimizer.step()
-
Key Hyperparameters:
learning_rates
(actor/critic),gamma
,lambda
(GAE),n_steps
(rollout length),value_loss_coeff
,entropy_coeff
. - Pros: More stable than REINFORCE, simpler than A3C/TRPO/PPO, good baseline, utilizes GPUs well.
- Cons: On-policy (sample inefficient), updates can still have variance, performance sometimes lower than PPO.
-
Common Pitfalls: Balancing actor/critic learning rates, choosing
n_steps
. - Use Cases: Discrete/continuous control benchmarks, simpler alternative to A3C/PPO.
(9_a3c.ipynb & a3c_training.py)
- Core Idea: Uses multiple parallel workers, each with a local copy of the actor-critic network and an environment instance. Workers compute gradients locally based on n-step returns and asynchronously update a shared global network. On-policy.
-
Mathematical Formulation: Same loss function as A2C (per worker), but updates are applied asynchronously to global parameters
$\theta_{global}, \phi_{global}$ using gradients computed from local parameters$\theta', \phi'$ . -
Pseudocode (Worker):
- Initialize local network, sync with global.
- Loop:
- Reset local gradients. Sync with global.
- Collect n-steps of experience using local policy.
- Calculate n-step returns
$R_t$ and advantages$\hat{A}_t$ . - Compute gradients for actor and critic losses based on the n-step experience.
- Apply gradients asynchronously to the global network using a shared optimizer.
- If episode done, reset environment.
-
Code Snippet: (Conceptual - see
a3c_training.py
)# Inside worker loop local_model.load_state_dict(global_model.state_dict()) # ... collect n-steps data ... returns, advantages = compute_n_step_returns_advantages(...) policy_loss = -(log_probs * advantages.detach()).mean() - entropy_coeff * entropy value_loss = F.mse_loss(values_pred, returns.detach()) total_loss = policy_loss + value_loss_coeff * value_loss global_optimizer.zero_grad() total_loss.backward() # Calculates grad on local model # Transfer gradients to global model for local_param, global_param in zip(local_model.parameters(), global_model.parameters()): if global_param.grad is not None: global_param.grad.data.zero_() # Optional safety zero if local_param.grad is not None: global_param.grad = local_param.grad.clone() global_optimizer.step() # Updates global model
-
Key Hyperparameters:
num_workers
,n_steps
, learning rates,gamma
, coefficients$c_v, c_e$ . Optimizer details (e.g., shared Adam). - Pros: No replay buffer needed, decorrelates data via parallelism, efficient on multi-core CPUs.
- Cons: Complex implementation (multiprocessing, shared memory, async updates), potential for stale gradients, often less GPU-efficient than A2C.
- Common Pitfalls: Race conditions with shared optimizer/gradients, worker synchronization.
- Use Cases: Historically significant for Atari/continuous control, CPU-based parallel training.
- Core Idea: An off-policy actor-critic algorithm primarily for continuous action spaces. Learns a deterministic policy (actor) alongside a Q-function (critic). Uses ideas from DQN (replay buffer, target networks) for stability.
-
Mathematical Formulation:
- Critic Update (minimize loss):
$L(\phi) = \mathbb{E}_{(s,a,r,s') \sim \mathcal{D}} [ (y - Q(s, a; \phi))^2 ]$ where$y = r + \gamma Q'(s', \mu'(s'; \theta'); \phi')$ . - Actor Update (maximize objective via gradient ascent, often minimize negative):
$L(\theta) = - \mathbb{E}_{s \sim \mathcal{D}} [ Q(s, \mu(s; \theta); \phi) ]$ . -
$\mu, Q$ : Main networks;$\mu', Q'$ : Target networks.
- Critic Update (minimize loss):
-
Pseudocode:
- Initialize actor
$\mu_\theta$ , critic$Q_\phi$ , target networks $\mu'{\theta'}, Q'{\phi'}$, replay buffer$\mathcal{D}$ . - For each step:
- Select action
$a = \mu(s; \theta) + \text{Noise}$ . - Execute
$a$ , get$r, s'$ . Store$(s, a, r, s')$ in$\mathcal{D}$ . - Sample mini-batch from
$\mathcal{D}$ . - Update critic
$Q_\phi$ using TD error derived from target networks. - Update actor
$\mu_\theta$ using gradient from critic's output$Q(s, \mu(s))$ . - Soft-update target networks:
$\theta' \leftarrow \tau \theta + (1-\tau)\theta'$ ,$\phi' \leftarrow \tau \phi + (1-\tau)\phi'$ .
- Select action
- Initialize actor
-
Code Snippet:
# Critic Update with torch.no_grad(): next_actions = target_actor(next_state_batch) target_q = target_critic(next_state_batch, next_actions) y = reward_batch + gamma * (1 - done_batch) * target_q current_q = critic(state_batch, action_batch) critic_loss = F.mse_loss(current_q, y) critic_optimizer.zero_grad() critic_loss.backward() critic_optimizer.step() # Actor Update actor_actions = actor(state_batch) q_for_actor = critic(state_batch, actor_actions) # No detach! Grad flows from critic actor_loss = -q_for_actor.mean() actor_optimizer.zero_grad() actor_loss.backward() actor_optimizer.step() # Soft Updates soft_update(target_critic, critic, tau) soft_update(target_actor, actor, tau)
-
Key Hyperparameters:
buffer_size
,batch_size
,gamma
,tau
(soft update rate), actor/critic learning rates, exploration noise parameters. - Pros: Off-policy sample efficiency, handles continuous actions directly.
- Cons: Sensitive to hyperparameters, can suffer from Q-value overestimation, exploration can be tricky.
-
Common Pitfalls: Learning rates, noise scale/decay, target update rate
$\tau$ . - Use Cases: Continuous control (robotics, physics simulation).
- Core Idea: An off-policy actor-critic algorithm for continuous actions based on the maximum entropy framework. Learns a stochastic policy that maximizes both expected return and policy entropy, leading to improved exploration and robustness.
-
Mathematical Formulation: Objective includes entropy term:
$J(\pi) = \mathbb{E}_{\tau \sim \pi} [\sum \gamma^t (R_t + \alpha H(\pi(\cdot|s_t)))]$ . Uses twin Q-critics, target critics, and often auto-tunes entropy coefficient$\alpha$ .- Critic Update (minimize loss for
$Q_1, Q_2$ ):$L(\phi_i) = \mathbb{E} [ (Q_i(s,a) - y)^2 ]$ where$y = r + \gamma (1-d) [\min_{j=1,2} Q'_j(s', a') - \alpha \log \pi(a'|s')]$ ,$a' \sim \pi(\cdot|s')$ . - Actor Update (minimize loss): $L(\theta) = \mathbb{E}{s, a \sim \pi} [ \alpha \log \pi(a|s) - \min{j=1,2} Q_j(s, a) ]$.
- Alpha Update (minimize loss):
$L(\log \alpha) = \mathbb{E}_{a \sim \pi} [ -\log \alpha (\log \pi(a|s) + \bar{H}) ]$ (where$\bar{H}$ is target entropy).
- Critic Update (minimize loss for
-
Pseudocode:
- Initialize actor
$\pi_\theta$ , twin critics$Q_{\phi_1}, Q_{\phi_2}$ , target critics $Q'_{\phi'1}, Q'{\phi'_2}$, replay buffer$\mathcal{D}$ ,$\log \alpha$ . - For each step:
- Select action
$a \sim \pi(\cdot|s; \theta)$ (sampling). - Execute
$a$ , get$r, s'$ . Store$(s, a, r, s')$ in$\mathcal{D}$ . - Sample mini-batch from
$\mathcal{D}$ . - Update critics
$Q_{\phi_1}, Q_{\phi_2}$ using soft TD target (min of target Q's minus scaled log prob). - Update actor
$\pi_\theta$ using gradient based on min Q and log prob. - Update
$\alpha$ (if auto-tuning) based on policy entropy. - Soft-update target critics.
- Select action
- Initialize actor
-
Code Snippet:
# Critic Target Calculation with torch.no_grad(): next_action, next_log_prob = actor(next_state_batch) q1_target_next, q2_target_next = target_critic(next_state_batch, next_action) q_target_next = torch.min(q1_target_next, q2_target_next) alpha = torch.exp(log_alpha).detach() soft_target = q_target_next - alpha * next_log_prob y = reward_batch + gamma * (1.0 - done_batch) * soft_target # ... Critic Update (MSE loss) ... # Actor Update pi_action, pi_log_prob = actor(state_batch) q1_pi, q2_pi = critic(state_batch, pi_action) # Grads enabled for critic here min_q_pi = torch.min(q1_pi, q2_pi) actor_loss = (alpha * pi_log_prob - min_q_pi).mean() # ... Actor Optimizer Step ... # Alpha Update alpha_loss = -(log_alpha * (pi_log_prob.detach() + target_entropy)).mean() # ... Alpha Optimizer Step ... # Soft Updates ...
-
Key Hyperparameters:
buffer_size
,batch_size
,gamma
,tau
, learning rates (actor, critic, alpha), initialalpha
,target_entropy
(if auto-tuning). - Pros: State-of-the-art sample efficiency and performance on continuous control, robust, good exploration.
- Cons: More complex than DDPG/PPO, requires careful implementation (especially squashing correction).
- Common Pitfalls: Correct log prob calculation (tanh squashing correction), alpha tuning stability, target entropy choice.
- Use Cases: Continuous control (robotics, benchmarks), situations needing robust exploration.
- Core Idea: An on-policy actor-critic method that simplifies TRPO's constrained update using a clipped surrogate objective. Allows multiple epochs of updates on collected data for better sample efficiency.
-
Mathematical Formulation:
- Ratio:
$r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}$ - Clipped Objective (minimize negative):
$L^{CLIP}(\theta) = -\mathbb{E}_t [ \min( r_t(\theta) \hat{A}_t, \text{clip}(r_t(\theta), 1 - \epsilon, 1 + \epsilon) \hat{A}_t ) ]$ - Often includes value loss
$L^{VF}$ and entropy bonus$S$ :$L = L^{CLIP} + c_1 L^{VF} - c_2 S$ .
- Ratio:
-
Pseudocode:
- Initialize actor
$\pi_\theta$ , critic$V_\phi$ . - For each iteration:
- Collect batch of trajectories using
$\pi_{\theta_{old}}$ . Store states, actions, rewards, dones, old log probs. - Compute advantages
$\hat{A}_t$ (GAE) and returns$R_t$ . - For K epochs:
- For each mini-batch in collected data:
- Calculate policy ratio
$r_t(\theta)$ . - Compute clipped surrogate loss
$L^{CLIP}$ . - Compute value loss
$L^{VF}$ . - Compute entropy bonus
$S$ . - Compute combined loss
$L$ . - Update
$\theta$ and$\phi$ using gradient descent on$L$ .
- Calculate policy ratio
- For each mini-batch in collected data:
- Collect batch of trajectories using
- Initialize actor
-
Code Snippet:
# Inside PPO update loop (for one epoch/minibatch) policy_dist = actor(states) log_probs_new = policy_dist.log_prob(actions) entropy = policy_dist.entropy().mean() values_pred = critic(states).squeeze() # Calculate ratio ratio = torch.exp(log_probs_new - log_probs_old) # Calculate policy loss surr1 = ratio * advantages surr2 = torch.clamp(ratio, 1.0 - ppo_clip_epsilon, 1.0 + ppo_clip_epsilon) * advantages policy_loss = -torch.min(surr1, surr2).mean() - entropy_coeff * entropy # Calculate value loss value_loss = F.mse_loss(values_pred, returns_to_go) # Update networks (typically combined loss or separate updates) # ... optimizer steps ...
-
Key Hyperparameters:
clip_epsilon
,gamma
,lambda
(GAE), learning rates,num_epochs
,mini_batch_size
,value_loss_coeff
,entropy_coeff
. - Pros: Simpler than TRPO, stable updates, good performance (often SOTA or near-SOTA), relatively sample efficient for an on-policy method.
- Cons: Still on-policy (less efficient than off-policy), performance sensitive to implementation details and hyperparameters.
-
Common Pitfalls: Advantage/observation normalization, learning rate schedule, choice of
$\epsilon$ . - Use Cases: Default choice for many discrete/continuous control tasks, RLHF for LLMs.
-
Core Idea: Combines Q-learning with a deep neural network to approximate
$Q(s, a; \theta)$ . Uses Experience Replay and Target Networks for stability. Off-policy. -
Mathematical Formulation: Minimizes TD error using target network
$Q'$ : [ L(\theta) = \mathbb{E}{(s, a, r, s', d) \sim \mathcal{D}} [ (y - Q(s, a; \theta))^2 ] ] [ y = r + \gamma (1-d) \max{a'} Q'(s', a'; \theta^{-}) ]-
$\mathcal{D}$ : Replay buffer.$\theta^{-}$ : Target network parameters.
-
-
Pseudocode:
- Initialize Q-network
$Q_\theta$ , target network$Q'_{\theta^-}$ , replay buffer$\mathcal{D}$ . - For each episode:
- For each step:
- Choose action
a
using$\epsilon$ -greedy on$Q_\theta$ . - Execute
a
, getr
,s'
. Store$(s, a, r, s', done)$ in$\mathcal{D}$ . - Sample mini-batch from
$\mathcal{D}$ . - Compute target
y
using$Q'_{\theta^-}$ . - Update
$Q_\theta$ by minimizing loss$L(\theta)$ . - Periodically update target network:
$\theta^- \leftarrow \theta$ . s = s'
- Choose action
- For each step:
- Initialize Q-network
-
Code Snippet:
# DQN Optimization Step non_final_mask = torch.tensor(...) # Mask for non-terminal next states non_final_next_states = torch.cat(...) state_batch, action_batch, reward_batch, done_batch = ... # From replay buffer state_action_values = policy_net(state_batch).gather(1, action_batch) next_state_values = torch.zeros(batch_size, device=device) with torch.no_grad(): next_state_values[non_final_mask] = target_net(non_final_next_states).max(1)[0] expected_state_action_values = (next_state_values * gamma) + reward_batch loss = criterion(state_action_values, expected_state_action_values.unsqueeze(1)) optimizer.zero_grad() loss.backward() optimizer.step()
-
Key Hyperparameters:
buffer_size
,batch_size
,gamma
,tau
ortarget_update_freq
,learning_rate
,epsilon
schedule. - Pros: Handles high-dimensional states (e.g., pixels), off-policy sample efficiency, stable due to replay/target nets.
- Cons: Primarily for discrete actions, can overestimate Q-values, sensitive to hyperparameters.
- Common Pitfalls: Target network updates, buffer management, hyperparameter tuning.
- Use Cases: Atari games from pixels, tasks with discrete actions and large state spaces.
- Core Idea: Extends DDPG to multi-agent settings using the "centralized training, decentralized execution" paradigm. Each agent has an actor and a centralized critic that observes joint states/observations and actions. Off-policy.
-
Mathematical Formulation:
- Centralized Critic
$Q_i(x, a_1, ..., a_N; \phi_i)$ for agent$i$ .$x = (o_1, ..., o_N)$ . - Critic Update: Minimize
$L(\phi_i) = \mathbb{E} [ (y_i - Q_i(x, \mathbf{a}))^2 ]$ where$y_i = r_i + \gamma Q'_i(x', \mu'_1(o'_1), ..., \mu'_N(o'_N))$ . - Actor Update: Minimize
$L(\theta_i) = - \mathbb{E} [ Q_i(x, \mu_1(o_1), ..., \mu_N(o_N)) ]$ (using main critic$Q_i$ ).
- Centralized Critic
-
Pseudocode:
- Initialize actors
$\mu_i$ , critics$Q_i$ , targets$\mu'_i, Q'_i$ , replay buffer$\mathcal{D}$ . - For each step:
- Each agent
$i$ chooses$a_i = \mu_i(o_i) + \text{Noise}$ . - Execute joint action
$\mathbf{a}$ , get$r=(r_1,...), o'=(o'_1,...)$ . Store$(o, \mathbf{a}, r, o')$ in$\mathcal{D}$ . - Sample mini-batch from
$\mathcal{D}$ . - For each agent
$i$ : Update critic$Q_i$ and actor$\mu_i$ . - Soft-update all target networks.
- Each agent
- Initialize actors
-
Code Snippet: (Conceptual - Update involves joint info)
# Critic Update (Agent i) with torch.no_grad(): target_actions_next = [target_actors[j](next_obs_batch[:, j]) for j in range(num_agents)] target_q_next = target_critics[i](joint_next_obs_batch, torch.cat(target_actions_next, dim=1)) y_i = rewards_batch[:, i] + gamma * (1 - dones_batch[:, i]) * target_q_next current_q_i = critics[i](joint_obs_batch, joint_actions_batch) # Actions from buffer critic_loss_i = F.mse_loss(current_q_i, y_i) # ... optimize critic i ... # Actor Update (Agent i) current_actions_policy = [actors[j](obs_batch[:, j]) for j in range(num_agents)] # Need grads only for actor i's action output current_actions_policy[i] = actors[i](obs_batch[:, i]) # Ensure grad enabled if needed q_actor_loss = critics[i](joint_obs_batch, torch.cat(current_actions_policy, dim=1)) actor_loss_i = -q_actor_loss.mean() # ... optimize actor i ...
-
Key Hyperparameters: Similar to DDPG, but potentially per-agent. Buffer size, batch size,
$\gamma, \tau$ , learning rates, noise. - Pros: Addresses non-stationarity in MARL, decentralized execution, handles mixed cooperative/competitive settings.
- Cons: Centralized critic scales poorly with many agents, credit assignment can be hard in cooperative settings.
- Use Cases: Multi-robot coordination, predator-prey, cooperative navigation.
-
Core Idea: A value-based MARL algorithm for cooperative tasks. Learns individual agent Q-functions
$Q_i$ and mixes them monotonically using a mixing network conditioned on the global state to produce$Q_{tot}$ . Off-policy, centralized training, decentralized execution. -
Mathematical Formulation:
$Q_{tot}(x, \mathbf{a}) = f_{mix}(Q_1(o_1, a_1), ..., Q_N(o_N, a_N); x)$ - Constraint:
$\frac{\partial Q_{tot}}{\partial Q_i} \ge 0$ (enforced by non-negative mixer weights, often via hypernetworks). - Loss: Minimize TD error on
$Q_{tot}$ :$L = \mathbb{E} [ (y - Q_{tot}(x, \mathbf{a}))^2 ]$ where$y = r + \gamma Q'_{tot}(x', \mathbf{a}')$ with$a'_i = \arg\max_a Q'_i(o'_i, a)$ .
-
Pseudocode:
- Initialize agent networks
$Q_i$ , target networks $Q'i$, mixer $f{mix}$, target mixer$f'_{mix}$ , replay buffer$\mathcal{D}$ . - For each step:
- Each agent
$i$ chooses$a_i$ using$\epsilon$ -greedy on$Q_i(o_i)$ . - Execute
$\mathbf{a}$ , get$r$ ,$o'$ ,$x'$ . Store$(o, \mathbf{a}, r, o', x, x', done)$ in$\mathcal{D}$ . - Sample mini-batch.
- Calculate target
$y$ using target networks $Q'i$ and target mixer $f'{mix}$. - Calculate current
$Q_{tot}$ using main networks$Q_i$ and main mixer$f_{mix}$ . - Compute loss
$L$ . - Update all
$Q_i$ and$f_{mix}$ parameters via gradient descent on$L$ . - Soft-update target networks.
- Each agent
- Initialize agent networks
-
Code Snippet:
# Calculate Target Q_tot' with torch.no_grad(): # ... Get max Q'_i for each agent i in next state ... target_agent_qs = torch.cat(...) # Shape (batch, num_agents) q_tot_target = target_mixer(target_agent_qs, next_global_state_batch) y = reward_batch + gamma * (1 - done_batch) * q_tot_target # Calculate Current Q_tot # ... Get Q_i for the action *taken* by each agent i in current state ... current_agent_qs = torch.cat(...) # Shape (batch, num_agents) q_tot_current = mixer(current_agent_qs, global_state_batch) # Loss and Optimize loss = F.mse_loss(q_tot_current, y) optimizer.zero_grad() loss.backward() # Gradients flow back to all agent nets and mixer optimizer.step() # ... Soft update targets ...
- Key Hyperparameters: Like DQN, plus mixing network architecture, hypernetwork details.
- Pros: Good for cooperative tasks, enforces IQL principle (local optimum -> global optimum), scales better in action space than joint Q-learning.
- Cons: Limited representational power due to monotonicity, requires global state for mixer.
- Use Cases: Cooperative MARL (e.g., SMAC benchmark), resource allocation.
-
Core Idea: Learns policies at multiple levels of abstraction. High levels set subgoals (as "actions") for lower levels, which execute primitive actions to achieve them within a time limit
$H$ . Uses intrinsic rewards and hindsight for learning subgoals. Off-policy. -
Mathematical Formulation (Conceptual):
- Level
$i$ : Policy$\pi_i(a_i | s, g_i)$ , Q-function$Q_i(s, a_i, g_i)$ .$a_i$ is subgoal for level$i-1$ . - Low Level (0): Learns using intrinsic reward
$r_{int}$ (success/fail to reach$g_0$ ). - High Level (1): Learns using environment reward
$R = \sum r_{env}$ . - Hindsight: Relabel transitions with achieved states as goals, granting artificial success.
- Level
-
Pseudocode (2-Level):
- Initialize networks
$Q_0, Q'_0, Q_1, Q'_1$ , buffers$D_0, D_1$ . - For episode:
s = env.reset()
- While overall goal
G
not reached:- High level chooses subgoal
g_0 = select_action(level=1, state=s, goal=G)
. -
transitions = []
,total_env_reward = 0
s_start = s
- For
h
from 1 toH
:- Low level chooses primitive action
a = select_action(level=0, state=s, goal=g_0)
. - Take
a
, getr_env
,s_next
,env_done
. total_env_reward += r_env
r_int = get_intrinsic_reward(s_next, g_0)
- Store low-level tuple
(s, a, r_int, s_next, g_0, env_done or test_goal(s_next,g_0), achieved_goal=s_next)
intransitions
. s = s_next
- If
test_goal(s, g_0)
orenv_done
: break inner loop.
- Low level chooses primitive action
s_end = s
- Store high-level tuple
(s_start, g_0, total_env_reward, s_end, G, env_done)
intransitions
. - Add
transitions
to buffers with hindsight relabeling. - Update
$Q_0, Q_1$ from buffers. Update targets. - If
env_done
: break outer loop.
- High level chooses subgoal
- Initialize networks
-
Code Snippet: (Focus on hindsight and intrinsic reward)
# Inside low-level execution loop # ... execute action a, get next_state_norm, env_done ... intrinsic_reward = -1.0 # Default failure subgoal_achieved = self._test_goal(next_state_norm, goal_norm) if subgoal_achieved: intrinsic_reward = 0.0 # Success reward # Store original transition buffer.push(..., reward=intrinsic_reward, goal=goal_norm, done=env_done or subgoal_achieved, achieved_goal=next_state_norm, level=0) # Hindsight is handled in buffer.sample() by replacing goal with achieved_goal # and setting reward/done accordingly for the hindsight sample.
-
Key Hyperparameters: Number of levels, time limit
H
, learning rates,gamma
,tau
,epsilon
schedule, buffer sizes, hindsight probabilityp
. - Pros: Can solve long-horizon/sparse reward tasks, structured exploration, potential for skill reuse.
- Cons: Very complex implementation, sensitive to goal definition, time limits, and hyperparameters, potential for suboptimal subgoal setting. The notebook implementation has known issues.
-
Common Pitfalls: Hindsight logic, intrinsic reward definition, goal feasibility, tuning
H
. - Use Cases: Complex robotics tasks, long-horizon planning, navigation.
- Core Idea: An online planning algorithm that builds a search tree using simulated trajectories (rollouts) from the current state. Uses statistics (visit counts, values) and UCT to balance exploration/exploitation within the search tree. Requires a simulator/model.
-
Mathematical Formulation:
- Tree Nodes store state
$s$ , visit count$N(s)$ , total value$W(s)$ . - Edges store action
$a$ , count$N(s, a)$ , value$W(s, a)$ . - Selection policy (UCT): Choose action
$a$ maximizing$Q(s, a) + C \sqrt{\frac{\ln N(s)}{N(s, a)}}$ .
- Tree Nodes store state
-
Pseudocode (Single MCTS step for action selection):
- Initialize tree with root node = current state
$s_{root}$ . - Repeat for
N
simulations:node = root_node
-
Selection: While
node
is fully expanded and not terminal,node = select_best_child_uct(node)
. -
Expansion: If
node
not fully expanded and not terminal, expand one childnode = expand_node(node)
. -
Simulation: Run rollout from
node
's state using default policy, get rewardR
. -
Backpropagation: Update
N
andW
for nodes/edges fromnode
back up to root usingR
.
- Choose best action from root based on visit counts (or values).
- Initialize tree with root node = current state
-
Code Snippet: (UCT Selection)
# Inside select_best_child_uct best_score = -float('inf') best_child = None for action, child in node.children.items(): if child.visit_count == 0: uct_score = float('inf') else: exploit = child.total_value / child.visit_count explore = exploration_constant * math.sqrt(math.log(node.visit_count) / child.visit_count) uct_score = exploit + explore # ... update best_child ... return best_child
-
Key Hyperparameters:
num_simulations
(budget per step),exploration_constant C
,rollout_depth
,gamma
(for rollouts). - Pros: Anytime algorithm, handles large state/action spaces, no explicit value function needed for search, asymmetric tree growth.
- Cons: Requires a simulator/model, computationally intensive per step, rollout policy quality affects performance.
-
Common Pitfalls: Tuning
C
, efficient implementation of steps, quality of rollouts. - Use Cases: Game playing (Go, Chess), planning problems with simulators.
- Core Idea: A model-based RL agent that learns a latent dynamics model (often an RSSM) directly from experience (potentially high-dimensional observations). It then performs planning directly in the latent space using algorithms like CEM to select actions. Off-policy.
-
Mathematical Formulation (Conceptual):
- Learns models:
$p(s_{t+1}|s_t, a_t)$ (transition),$p(r_t|s_t)$ (reward), possibly$p(o_t|s_t)$ (observation/reconstruction),$q(s_t|...)$ (encoder). - Model Loss: Maximize data likelihood (often via ELBO, including reconstruction, reward prediction, KL regularization terms). Simplified: MSE on next state & reward.
- Planning (CEM): Optimize $\mathbb{E} [ \sum_{k=t}^{t+H-1} \gamma^{k-t} \hat{r}k ]$ over action sequences $a_t..a{t+H-1}$ using the learned latent model
$(\hat{p}, \hat{r})$ .
- Learns models:
-
Pseudocode:
- Initialize latent dynamics model, replay buffer
$\mathcal{D}$ (stores sequences). - Loop:
-
Interact: Observe
$s_t$ . Plan action$a_t$ using CEM in latent space with current model. Execute$a_t$ , get$r_t, s_{t+1}$ . Store$(s_t, a_t, r_t, s_{t+1})$ in$\mathcal{D}$ . -
Train Model: Sample sequences from
$\mathcal{D}$ . Update model parameters to minimize prediction/reconstruction losses.
-
Interact: Observe
- Initialize latent dynamics model, replay buffer
-
Code Snippet: (CEM Planning Call)
# Inside main loop # state = current latent state representation action = cem_planner( dynamics_model, state, horizon=PLANNING_HORIZON, num_candidates=CEM_CANDIDATES, num_elites=CEM_ELITES, num_iterations=CEM_ITERATIONS, gamma=CEM_GAMMA, ...) # Execute action in real env... # Train model...
-
Key Hyperparameters: Model architecture (latent size, hidden dims), model learning rate, buffer size, sequence length, planning horizon
H
, CEM parameters (J
,M
, iterations). - Pros: Very sample efficient (especially from images), learns compact world representation, effective planning.
- Cons: Complex model training, planning can be computationally expensive, model inaccuracies can lead to poor plans (compounding errors).
- Common Pitfalls: Model convergence, tuning planning horizon vs. model accuracy, computational cost of planning.
- Use Cases: Control from pixels, sample-constrained robotics tasks, model-based benchmarks.
Pro Tip: Standardizing advantages (in policy gradient / actor-critic methods) or returns (in REINFORCE) often significantly stabilizes training. This involves subtracting the mean and dividing by the standard deviation of the advantages/returns within a batch or episode.
Did You Know? Many state-of-the-art algorithms combine ideas. For example, PPO and A2C are actor-critic methods using policy gradients. SAC combines actor-critic with maximum entropy RL. PlaNet combines model-based learning with CEM planning.
This cheat sheet provides a high-level overview. For detailed implementation and nuances, refer to the specific notebooks in the repository and the original research papers. Good luck learning!