Skip to content

Latest commit

 

History

History
837 lines (761 loc) · 48.2 KB

cheatsheet.md

File metadata and controls

837 lines (761 loc) · 48.2 KB

Reinforcement Learning Cheat Sheet ✨

To avoid the need to constantly refer to the original notebooks, here is a cheat sheet summarizing the most important concepts and algorithms in reinforcement learning (RL).


Table of Contents


Core Concepts

Agent-Environment Loop

The fundamental interaction cycle in RL:

  1. Agent observes state $s_t$.
  2. Agent selects action $a_t$ based on policy $\pi(a_t|s_t)$.
  3. Environment transitions to next state $s_{t+1}$.
  4. Environment provides reward $r_t$.
  5. Agent updates policy/values based on $(s_t, a_t, r_t, s_{t+1})$.

Markov Decision Process (MDP)

Formal framework for RL problems, defined by $(S, A, P, R, \gamma)$:

  • $S$: Set of states.
  • $A$: Set of actions.
  • $P(s'|s, a)$: Transition probability function.
  • $R(s, a, s')$: Reward function.
  • $\gamma$: Discount factor ($0 \le \gamma \le 1$).

Value Functions & Bellman Equations

  • State-Value Function ($V^\pi(s)$): Expected return starting from state $s$ and following policy $\pi$. [ V^\pi(s) = \mathbb{E}\pi [ G_t | S_t=s ] = \mathbb{E}\pi [ \sum_{k=0}^\infty \gamma^k r_{t+k+1} | S_t=s ] ]
  • Action-Value Function ($Q^\pi(s, a)$): Expected return starting from state $s$, taking action $a$, and following policy $\pi$. [ Q^\pi(s, a) = \mathbb{E}\pi [ G_t | S_t=s, A_t=a ] = \mathbb{E}\pi [ \sum_{k=0}^\infty \gamma^k r_{t+k+1} | S_t=s, A_t=a ] ]
  • Bellman Expectation Equation for $V^\pi$: [ V^\pi(s) = \sum_{a} \pi(a|s) \sum_{s', r} P(s', r | s, a) [r + \gamma V^\pi(s')] ]
  • Bellman Optimality Equation for $Q^*$ (used by Q-Learning): [ Q^(s, a) = \sum_{s', r} P(s', r | s, a) [r + \gamma \max_{a'} Q^(s', a')] ]

Exploration vs. Exploitation

  • Exploration: Trying new actions to discover better rewards.
  • Exploitation: Choosing the action currently known to yield the best expected reward.
  • $\epsilon$-Greedy: Common strategy: With probability $\epsilon$, explore (random action); with probability $1-\epsilon$, exploit (greedy action). $\epsilon$ often decays over time.

Basic / Tabular Methods

Simple Exploration Bot

(1_simple_rl.ipynb)

  • Core Idea: Demonstrates the basic agent-environment loop. Agent remembers immediate rewards for state-action pairs and uses a simple epsilon-greedy policy based on average immediate rewards. Does not perform true RL value learning.
  • Mathematical Formulation: No Bellman updates. Policy based on: [ \text{AvgR}(s, a) = \frac{\sum \text{rewards observed after } (s, a)}{\text{count of } (s, a)} ]
  • Pseudocode:
    1. Initialize memory mem[s][a] -> [rewards]
    2. For each episode:
      • Reset env to get s
      • For each step:
        • Choose a using $\epsilon$-greedy on AvgR(s, a)
        • Take action a, get r, s'
        • Store r in mem[s][a]
        • s = s'
  • Code Snippet:
    # Choosing action based on average immediate reward
    avg_rewards = []
    for a in range(n_actions):
        rewards = memory[state][a]
        avg_rewards.append(np.mean(rewards) if rewards else 0)
    best_action = np.random.choice(np.where(avg_rewards == np.max(avg_rewards))[0])
    
    # Updating memory
    memory[state][action].append(reward)
  • Key Hyperparameters: epsilon, epsilon_decay.
  • Pros: Simple illustration of interaction loop and memory.
  • Cons: Does not learn long-term values, only immediate rewards. Not true RL. Inefficient memory.
  • Use Cases: Educational demonstration of basic agent structure.

Q-Learning

(2_q_learning.ipynb)

  • Core Idea: Learns the optimal action-value function ($Q^*$) off-policy using Temporal Difference (TD) updates.
  • Mathematical Formulation: Bellman Optimality update: [ Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha [r_t + \gamma \max_{a'} Q(s_{t+1}, a') - Q(s_t, a_t)] ]
    • $\alpha$: Learning rate
    • $\gamma$: Discount factor
    • $\max_{a'} Q(s_{t+1}, a')$: Max Q-value in next state (greedy estimate of future value)
  • Pseudocode:
    1. Initialize Q-table Q(s, a) to zeros.
    2. For each episode:
      • Initialize state s.
      • For each step:
        • Choose action a from s using policy derived from Q (e.g., $\epsilon$-greedy).
        • Take action a, observe r, s'.
        • Update Q(s, a) using the Q-learning rule.
        • s = s'
  • Code Snippet:
    # Q-Learning update
    current_q = q_table[state][action]
    max_next_q = max(q_table[next_state].values()) if next_state in q_table else 0.0
    td_target = reward + gamma * max_next_q
    td_error = td_target - current_q
    q_table[state][action] += alpha * td_error
  • Key Hyperparameters: alpha (learning rate), gamma (discount factor), epsilon (exploration rate), epsilon_decay.
  • Pros: Off-policy (can learn optimal policy while exploring), simple concept, guaranteed convergence under conditions.
  • Cons: Tabular form doesn't scale to large state spaces, can suffer from maximization bias (addressed by Double Q-learning).
  • Common Pitfalls: Tuning $\alpha$ and $\epsilon$. Ensuring sufficient exploration.
  • Use Cases: Small, discrete state/action spaces, foundational understanding.

SARSA

(3_sarsa.ipynb)

  • Core Idea: Learns the action-value function ($Q^\pi$) for the policy currently being followed (on-policy) using TD updates.
  • Mathematical Formulation: Update uses the next action chosen by the policy: [ Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha [r_t + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t)] ]
    • $a_{t+1}$ is the action chosen in state $s_{t+1}$ by the current policy (e.g., $\epsilon$-greedy).
  • Pseudocode:
    1. Initialize Q-table Q(s, a).
    2. For each episode:
      • Initialize s.
      • Choose a from s using policy derived from Q (e.g., $\epsilon$-greedy).
      • For each step:
        • Take action a, observe r, s'.
        • Choose next action a' from s' using policy derived from Q.
        • Update Q(s, a) using r, s', a'.
        • s = s', a = a'
  • Code Snippet:
    # SARSA update
    current_q = q_table[state][action]
    next_q = q_table[next_state][next_action] # Q-value of the *next* action taken
    td_target = reward + gamma * next_q
    td_error = td_target - current_q
    q_table[state][action] += alpha * td_error
  • Key Hyperparameters: alpha, gamma, epsilon, epsilon_decay.
  • Pros: On-policy (learns value of the exploration policy), often more stable/conservative in risky environments than Q-learning.
  • Cons: Tabular, can be slower to converge to optimal if exploration persists, sensitive to policy changes.
  • Common Pitfalls: Ensuring the next action a' is chosen correctly before the update.
  • Use Cases: When evaluating the current policy is important, safer exploration needed.

Expected SARSA

(4_expected_sarsa.ipynb)

  • Core Idea: Like SARSA, but updates using the expected value over next actions, weighted by policy probabilities, reducing variance. Still on-policy.
  • Mathematical Formulation: [ Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha [r_t + \gamma \mathbb{E}{\pi}[Q(s{t+1}, A')] - Q(s_t, a_t)] ] [ \mathbb{E}{\pi}[Q(s', A')] = \sum{a'} \pi(a'|s') Q(s', a') ] For $\epsilon$-greedy: $\mathbb{E}{\pi}[Q(s', A')] = (1 - \epsilon) \max{a''} Q(s', a'') + \epsilon \frac{\sum_{a'} Q(s', a')}{|\mathcal{A}|}$
  • Pseudocode:
    1. Initialize Q-table Q(s, a).
    2. For each episode:
      • Initialize s.
      • For each step:
        • Choose a from s using policy derived from Q (e.g., $\epsilon$-greedy).
        • Take action a, observe r, s'.
        • Calculate expected Q-value $E[Q(s', A')]$ based on policy $\pi$ in state s'.
        • Update Q(s, a) using r and $E[Q(s', A')]$.
        • s = s'
  • Code Snippet:
    # Expected SARSA update (assuming epsilon-greedy)
    current_q = q_table[state][action]
    if next_state in q_table and q_table[next_state]:
        q_values_next = q_table[next_state]
        max_q_next = max(q_values_next.values())
        num_actions = len(action_space)
        expected_q_next = (1.0 - epsilon) * max_q_next + (epsilon / num_actions) * sum(q_values_next.values())
    else:
        expected_q_next = 0.0
    td_target = reward + gamma * expected_q_next
    td_error = td_target - current_q
    q_table[state][action] += alpha * td_error
  • Key Hyperparameters: alpha, gamma, epsilon, epsilon_decay.
  • Pros: On-policy, lower variance than SARSA, often more stable, same computational cost as Q-learning per update.
  • Cons: Tabular, slightly more complex update calculation than SARSA.
  • Use Cases: Where SARSA is applicable but stability/variance is an issue.

Dyna-Q

(5_dyna_q.ipynb)

  • Core Idea: Integrates model-free learning (Q-learning) with model-based planning. Learns a model of the environment from real experience and uses it to perform extra "planning" updates on the Q-table using simulated experience.
  • Mathematical Formulation:
    • Direct RL: Standard Q-learning update on real transition $(s, a, r, s')$.
    • Model Learning: $Model(s, a) \leftarrow (r, s')$ (for deterministic env).
    • Planning: For $k$ steps: Sample $(s_p, a_p)$ from previously experienced pairs. Get $(r_p, s'_p) = Model(s_p, a_p)$. Apply Q-learning update to $Q(s_p, a_p)$ using $(s_p, a_p, r_p, s'_p)$.
  • Pseudocode:
    1. Initialize Q(s, a) and Model(s, a).
    2. For each episode:
      • Initialize s.
      • For each step:
        • Choose a using policy based on Q.
        • Take action a, observe r, s'.
        • Update Q(s, a) with $(s, a, r, s')$ (Direct RL).
        • Update Model(s, a) with $(r, s')$.
        • Repeat k times (Planning):
          • Sample previously seen $(s_p, a_p)$.
          • Get $(r_p, s'_p)$ from Model(s_p, a_p).
          • Update Q(s_p, a_p) with $(s_p, a_p, r_p, s'_p)$.
        • s = s'
  • Code Snippet:
    # Direct RL Update (same as Q-Learning)
    # ... q_learning_update(q_table, state, action, reward, next_state, ...)
    
    # Model Update
    model[(state, action)] = (reward, next_state)
    if (state, action) not in observed_pairs:
        observed_pairs.append((state, action))
        
    # Planning Step
    for _ in range(planning_steps_k):
        if not observed_pairs: break
        s_p, a_p = random.choice(observed_pairs)
        r_p, s_prime_p = model[(s_p, a_p)]
        q_learning_update(q_table, s_p, a_p, r_p, s_prime_p, ...)
  • Key Hyperparameters: alpha, gamma, epsilon, k (number of planning steps).
  • Pros: Improves sample efficiency compared to pure Q-learning by reusing experience via the model. Simple integration of learning and planning.
  • Cons: Tabular. Effectiveness depends heavily on model accuracy. Assumes deterministic model in simple form.
  • Common Pitfalls: Poor model accuracy can lead to suboptimal policy. Choosing k.
  • Use Cases: Environments where interaction is costly but computation is cheap. Simple planning tasks.

Policy Gradient Methods

REINFORCE (Monte Carlo Policy Gradient)

(6_reinforce.ipynb)

  • Core Idea: Directly learns a parameterized policy $\pi(a|s; \theta)$ by increasing the probability of actions that led to high cumulative episode returns ($G_t$). On-policy, Monte Carlo.
  • Mathematical Formulation: Updates policy parameters $\theta$ via gradient ascent on $J(\theta) = \mathbb{E}[G_t]$. [ \nabla_\theta J(\theta) \approx \sum_{t=0}^{T-1} G_t \nabla_\theta \log \pi(a_t | s_t; \theta) ] Loss function (for minimization): $L(\theta) = -\sum_{t=0}^{T-1} G_t \log \pi(a_t | s_t; \theta)$
    • $G_t = \sum_{k=t}^{T-1} \gamma^{k-t} r_{k+1}$: Discounted return from step $t$.
  • Pseudocode:
    1. Initialize policy network $\pi(a|s; \theta)$.
    2. For each episode:
      • Generate trajectory $(s_0, a_0, r_1, ...)$ by sampling actions $a_t \sim \pi(a|s_t; \theta)$. Store log probs $\log \pi(a_t|s_t; \theta)$ and rewards $r_{t+1}$.
      • Calculate discounted returns $G_t$ for all steps $t$.
      • Compute loss $L$.
      • Update $\theta$ using gradient descent on $L$.
  • Code Snippet:
    # Calculate returns (backward loop)
    returns = []
    G = 0.0
    for r in reversed(episode_rewards):
        G = r + gamma * G
        returns.insert(0, G)
    returns = torch.tensor(returns)
    # Standardize returns (optional but recommended)
    returns = (returns - returns.mean()) / (returns.std() + 1e-8)
    
    # Calculate loss
    log_probs_tensor = torch.stack(episode_log_probs)
    loss = -torch.sum(returns * log_probs_tensor)
    
    # Update policy
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
  • Key Hyperparameters: learning_rate, gamma. Network architecture.
  • Pros: Simple policy gradient concept, works with discrete/continuous actions, learns stochastic policies.
  • Cons: High variance due to Monte Carlo returns, episodic updates (waits until episode end), on-policy sample inefficiency.
  • Common Pitfalls: High variance leading to unstable training, requires careful learning rate tuning.
  • Use Cases: Simple benchmarks, conceptual understanding, basis for actor-critic.

Trust Region Policy Optimization (TRPO)

(12_trpo.ipynb)

  • Core Idea: Improves policy gradient updates by constraining the change in the policy (measured by KL divergence) at each step, ensuring more stable and monotonic improvement. On-policy.
  • Mathematical Formulation: Solves a constrained optimization problem (approximately): [ \max_{\theta} \quad \mathbb{E}t \left[ \frac{\pi\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)} \hat{A}t \right] ] [ \text{s.t.} \quad \mathbb{E}t [D{KL}(\pi{\theta_{old}}(\cdot|s_t) || \pi_{\theta}(\cdot|s_t))] \le \delta ] Solved using Conjugate Gradient (to find direction $\approx F^{-1}g$) and Line Search (to satisfy constraint). $F$ is the Fisher Information Matrix.
  • Pseudocode:
    1. Initialize actor $\pi_\theta$, critic $V_\phi$.
    2. For each iteration:
      • Collect trajectories using $\pi_{\theta_{old}}$. Store states, actions, rewards, log probs.
      • Compute advantages $\hat{A}t$ (using GAE with $V\phi$).
      • Compute policy gradient $g$.
      • Use Conjugate Gradient + Fisher-Vector Products to find step direction $s \approx F^{-1}g$.
      • Perform line search to find step size $\beta \alpha$ satisfying KL constraint $\delta$ and improving surrogate objective.
      • Update actor: $\theta_{new} \leftarrow \theta_{old} + \beta \alpha s$.
      • Update critic $V_\phi$ using collected data.
  • Code Snippet: (Focus on conceptual update call)
    # Conceptual TRPO update call
    policy_gradient = calculate_policy_gradient(...)
    step_direction = conjugate_gradient(fisher_vector_product_func, policy_gradient, ...)
    initial_step_size = calculate_initial_step_size(step_direction, policy_gradient, max_kl, ...)
    final_update, success = backtracking_line_search(actor, ..., step_direction, initial_step_size, max_kl, ...)
    if success:
        apply_update(actor, final_update)
    update_critic(...)
  • Key Hyperparameters: delta (KL constraint), gamma, lambda (GAE), CG iterations, CG damping, line search parameters.
  • Pros: Provides theoretical monotonic improvement guarantee (under approximations), very stable updates.
  • Cons: Complex implementation (FVP, CG, line search), computationally expensive per update, on-policy.
  • Common Pitfalls: Implementing FVP and CG correctly, tuning trust region $\delta$.
  • Use Cases: Continuous control, situations requiring high stability, benchmark for simpler algorithms like PPO.

Actor-Critic Methods

Advantage Actor-Critic (A2C)

(8_a2c.ipynb)

  • Core Idea: A synchronous, simpler version of A3C. Uses an actor (policy) and a critic (value function) trained on batches of experience collected by the actor. Reduces variance compared to REINFORCE by using advantage estimates. On-policy.
  • Mathematical Formulation:
    • Actor Loss (minimize): $L_{actor} = - \mathbb{E}_t [ \log \pi(a_t | s_t; \theta) \hat{A}_t^{\text{detached}} + c_e H(\pi) ]$
    • Critic Loss (minimize): $L_{critic} = \mathbb{E}_t [ (R_t - V(s_t; \phi))^2 ]$
    • $\hat{A}_t = R_t - V(s_t; \phi)$: Advantage estimate (often using n-step returns or GAE for $R_t$).
  • Pseudocode:
    1. Initialize shared actor $\pi_\theta$ and critic $V_\phi$.
    2. Loop for iterations:
      • Collect batch of N steps of experience $(s_t, a_t, r_{t+1}, s_{t+1}, d_t)$ using $\pi_\theta$.
      • Compute n-step returns $R_t$ and advantages $\hat{A}t$ using $V\phi$.
      • Compute actor loss (policy gradient + entropy) and critic loss (MSE).
      • Compute gradients for actor and critic based on the batch.
      • Apply synchronous gradient update to $\theta$ and $\phi$.
  • Code Snippet:
    # Calculate Advantage and Returns (e.g., using GAE)
    advantages, returns_to_go = compute_gae_and_returns(...)
    
    # Evaluate current policy and value
    policy_dist = actor(states)
    log_probs = policy_dist.log_prob(actions)
    entropy = policy_dist.entropy().mean()
    values_pred = critic(states).squeeze()
    
    # Losses
    policy_loss = -(log_probs * advantages.detach()).mean() - entropy_coeff * entropy
    value_loss = F.mse_loss(values_pred, returns_to_go.detach())
    
    # Optimize Actor
    actor_optimizer.zero_grad()
    policy_loss.backward()
    actor_optimizer.step()
    
    # Optimize Critic
    critic_optimizer.zero_grad()
    (value_loss_coeff * value_loss).backward()
    critic_optimizer.step()
  • Key Hyperparameters: learning_rates (actor/critic), gamma, lambda (GAE), n_steps (rollout length), value_loss_coeff, entropy_coeff.
  • Pros: More stable than REINFORCE, simpler than A3C/TRPO/PPO, good baseline, utilizes GPUs well.
  • Cons: On-policy (sample inefficient), updates can still have variance, performance sometimes lower than PPO.
  • Common Pitfalls: Balancing actor/critic learning rates, choosing n_steps.
  • Use Cases: Discrete/continuous control benchmarks, simpler alternative to A3C/PPO.

Asynchronous Advantage Actor-Critic (A3C)

(9_a3c.ipynb & a3c_training.py)

  • Core Idea: Uses multiple parallel workers, each with a local copy of the actor-critic network and an environment instance. Workers compute gradients locally based on n-step returns and asynchronously update a shared global network. On-policy.
  • Mathematical Formulation: Same loss function as A2C (per worker), but updates are applied asynchronously to global parameters $\theta_{global}, \phi_{global}$ using gradients computed from local parameters $\theta', \phi'$.
  • Pseudocode (Worker):
    1. Initialize local network, sync with global.
    2. Loop:
      • Reset local gradients. Sync with global.
      • Collect n-steps of experience using local policy.
      • Calculate n-step returns $R_t$ and advantages $\hat{A}_t$.
      • Compute gradients for actor and critic losses based on the n-step experience.
      • Apply gradients asynchronously to the global network using a shared optimizer.
      • If episode done, reset environment.
  • Code Snippet: (Conceptual - see a3c_training.py)
    # Inside worker loop
    local_model.load_state_dict(global_model.state_dict())
    # ... collect n-steps data ...
    returns, advantages = compute_n_step_returns_advantages(...)
    policy_loss = -(log_probs * advantages.detach()).mean() - entropy_coeff * entropy
    value_loss = F.mse_loss(values_pred, returns.detach())
    total_loss = policy_loss + value_loss_coeff * value_loss
    
    global_optimizer.zero_grad()
    total_loss.backward() # Calculates grad on local model
    # Transfer gradients to global model
    for local_param, global_param in zip(local_model.parameters(), global_model.parameters()):
        if global_param.grad is not None: global_param.grad.data.zero_() # Optional safety zero
        if local_param.grad is not None:
             global_param.grad = local_param.grad.clone()
    global_optimizer.step() # Updates global model
  • Key Hyperparameters: num_workers, n_steps, learning rates, gamma, coefficients $c_v, c_e$. Optimizer details (e.g., shared Adam).
  • Pros: No replay buffer needed, decorrelates data via parallelism, efficient on multi-core CPUs.
  • Cons: Complex implementation (multiprocessing, shared memory, async updates), potential for stale gradients, often less GPU-efficient than A2C.
  • Common Pitfalls: Race conditions with shared optimizer/gradients, worker synchronization.
  • Use Cases: Historically significant for Atari/continuous control, CPU-based parallel training.

Deep Deterministic Policy Gradient (DDPG)

(10_ddpg.ipynb)

  • Core Idea: An off-policy actor-critic algorithm primarily for continuous action spaces. Learns a deterministic policy (actor) alongside a Q-function (critic). Uses ideas from DQN (replay buffer, target networks) for stability.
  • Mathematical Formulation:
    • Critic Update (minimize loss): $L(\phi) = \mathbb{E}_{(s,a,r,s') \sim \mathcal{D}} [ (y - Q(s, a; \phi))^2 ]$ where $y = r + \gamma Q'(s', \mu'(s'; \theta'); \phi')$.
    • Actor Update (maximize objective via gradient ascent, often minimize negative): $L(\theta) = - \mathbb{E}_{s \sim \mathcal{D}} [ Q(s, \mu(s; \theta); \phi) ]$.
    • $\mu, Q$: Main networks; $\mu', Q'$: Target networks.
  • Pseudocode:
    1. Initialize actor $\mu_\theta$, critic $Q_\phi$, target networks $\mu'{\theta'}, Q'{\phi'}$, replay buffer $\mathcal{D}$.
    2. For each step:
      • Select action $a = \mu(s; \theta) + \text{Noise}$.
      • Execute $a$, get $r, s'$. Store $(s, a, r, s')$ in $\mathcal{D}$.
      • Sample mini-batch from $\mathcal{D}$.
      • Update critic $Q_\phi$ using TD error derived from target networks.
      • Update actor $\mu_\theta$ using gradient from critic's output $Q(s, \mu(s))$.
      • Soft-update target networks: $\theta' \leftarrow \tau \theta + (1-\tau)\theta'$, $\phi' \leftarrow \tau \phi + (1-\tau)\phi'$.
  • Code Snippet:
    # Critic Update
    with torch.no_grad():
        next_actions = target_actor(next_state_batch)
        target_q = target_critic(next_state_batch, next_actions)
        y = reward_batch + gamma * (1 - done_batch) * target_q
    current_q = critic(state_batch, action_batch)
    critic_loss = F.mse_loss(current_q, y)
    critic_optimizer.zero_grad()
    critic_loss.backward()
    critic_optimizer.step()
    
    # Actor Update
    actor_actions = actor(state_batch)
    q_for_actor = critic(state_batch, actor_actions) # No detach! Grad flows from critic
    actor_loss = -q_for_actor.mean()
    actor_optimizer.zero_grad()
    actor_loss.backward()
    actor_optimizer.step()
    
    # Soft Updates
    soft_update(target_critic, critic, tau)
    soft_update(target_actor, actor, tau)
  • Key Hyperparameters: buffer_size, batch_size, gamma, tau (soft update rate), actor/critic learning rates, exploration noise parameters.
  • Pros: Off-policy sample efficiency, handles continuous actions directly.
  • Cons: Sensitive to hyperparameters, can suffer from Q-value overestimation, exploration can be tricky.
  • Common Pitfalls: Learning rates, noise scale/decay, target update rate $\tau$.
  • Use Cases: Continuous control (robotics, physics simulation).

Soft Actor-Critic (SAC)

(11_sac.ipynb)

  • Core Idea: An off-policy actor-critic algorithm for continuous actions based on the maximum entropy framework. Learns a stochastic policy that maximizes both expected return and policy entropy, leading to improved exploration and robustness.
  • Mathematical Formulation: Objective includes entropy term: $J(\pi) = \mathbb{E}_{\tau \sim \pi} [\sum \gamma^t (R_t + \alpha H(\pi(\cdot|s_t)))]$. Uses twin Q-critics, target critics, and often auto-tunes entropy coefficient $\alpha$.
    • Critic Update (minimize loss for $Q_1, Q_2$): $L(\phi_i) = \mathbb{E} [ (Q_i(s,a) - y)^2 ]$ where $y = r + \gamma (1-d) [\min_{j=1,2} Q'_j(s', a') - \alpha \log \pi(a'|s')]$, $a' \sim \pi(\cdot|s')$.
    • Actor Update (minimize loss): $L(\theta) = \mathbb{E}{s, a \sim \pi} [ \alpha \log \pi(a|s) - \min{j=1,2} Q_j(s, a) ]$.
    • Alpha Update (minimize loss): $L(\log \alpha) = \mathbb{E}_{a \sim \pi} [ -\log \alpha (\log \pi(a|s) + \bar{H}) ]$ (where $\bar{H}$ is target entropy).
  • Pseudocode:
    1. Initialize actor $\pi_\theta$, twin critics $Q_{\phi_1}, Q_{\phi_2}$, target critics $Q'_{\phi'1}, Q'{\phi'_2}$, replay buffer $\mathcal{D}$, $\log \alpha$.
    2. For each step:
      • Select action $a \sim \pi(\cdot|s; \theta)$ (sampling).
      • Execute $a$, get $r, s'$. Store $(s, a, r, s')$ in $\mathcal{D}$.
      • Sample mini-batch from $\mathcal{D}$.
      • Update critics $Q_{\phi_1}, Q_{\phi_2}$ using soft TD target (min of target Q's minus scaled log prob).
      • Update actor $\pi_\theta$ using gradient based on min Q and log prob.
      • Update $\alpha$ (if auto-tuning) based on policy entropy.
      • Soft-update target critics.
  • Code Snippet:
    # Critic Target Calculation
    with torch.no_grad():
        next_action, next_log_prob = actor(next_state_batch)
        q1_target_next, q2_target_next = target_critic(next_state_batch, next_action)
        q_target_next = torch.min(q1_target_next, q2_target_next)
        alpha = torch.exp(log_alpha).detach()
        soft_target = q_target_next - alpha * next_log_prob
        y = reward_batch + gamma * (1.0 - done_batch) * soft_target
    # ... Critic Update (MSE loss) ...
    
    # Actor Update
    pi_action, pi_log_prob = actor(state_batch)
    q1_pi, q2_pi = critic(state_batch, pi_action) # Grads enabled for critic here
    min_q_pi = torch.min(q1_pi, q2_pi)
    actor_loss = (alpha * pi_log_prob - min_q_pi).mean()
    # ... Actor Optimizer Step ...
    
    # Alpha Update
    alpha_loss = -(log_alpha * (pi_log_prob.detach() + target_entropy)).mean()
    # ... Alpha Optimizer Step ...
    
    # Soft Updates ...
  • Key Hyperparameters: buffer_size, batch_size, gamma, tau, learning rates (actor, critic, alpha), initial alpha, target_entropy (if auto-tuning).
  • Pros: State-of-the-art sample efficiency and performance on continuous control, robust, good exploration.
  • Cons: More complex than DDPG/PPO, requires careful implementation (especially squashing correction).
  • Common Pitfalls: Correct log prob calculation (tanh squashing correction), alpha tuning stability, target entropy choice.
  • Use Cases: Continuous control (robotics, benchmarks), situations needing robust exploration.

Proximal Policy Optimization (PPO)

(7_ppo.ipynb)

  • Core Idea: An on-policy actor-critic method that simplifies TRPO's constrained update using a clipped surrogate objective. Allows multiple epochs of updates on collected data for better sample efficiency.
  • Mathematical Formulation:
    • Ratio: $r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}$
    • Clipped Objective (minimize negative): $L^{CLIP}(\theta) = -\mathbb{E}_t [ \min( r_t(\theta) \hat{A}_t, \text{clip}(r_t(\theta), 1 - \epsilon, 1 + \epsilon) \hat{A}_t ) ]$
    • Often includes value loss $L^{VF}$ and entropy bonus $S$: $L = L^{CLIP} + c_1 L^{VF} - c_2 S$.
  • Pseudocode:
    1. Initialize actor $\pi_\theta$, critic $V_\phi$.
    2. For each iteration:
      • Collect batch of trajectories using $\pi_{\theta_{old}}$. Store states, actions, rewards, dones, old log probs.
      • Compute advantages $\hat{A}_t$ (GAE) and returns $R_t$.
      • For K epochs:
        • For each mini-batch in collected data:
          • Calculate policy ratio $r_t(\theta)$.
          • Compute clipped surrogate loss $L^{CLIP}$.
          • Compute value loss $L^{VF}$.
          • Compute entropy bonus $S$.
          • Compute combined loss $L$.
          • Update $\theta$ and $\phi$ using gradient descent on $L$.
  • Code Snippet:
    # Inside PPO update loop (for one epoch/minibatch)
    policy_dist = actor(states)
    log_probs_new = policy_dist.log_prob(actions)
    entropy = policy_dist.entropy().mean()
    values_pred = critic(states).squeeze()
    
    # Calculate ratio
    ratio = torch.exp(log_probs_new - log_probs_old)
    
    # Calculate policy loss
    surr1 = ratio * advantages
    surr2 = torch.clamp(ratio, 1.0 - ppo_clip_epsilon, 1.0 + ppo_clip_epsilon) * advantages
    policy_loss = -torch.min(surr1, surr2).mean() - entropy_coeff * entropy
    
    # Calculate value loss
    value_loss = F.mse_loss(values_pred, returns_to_go)
    
    # Update networks (typically combined loss or separate updates)
    # ... optimizer steps ...
  • Key Hyperparameters: clip_epsilon, gamma, lambda (GAE), learning rates, num_epochs, mini_batch_size, value_loss_coeff, entropy_coeff.
  • Pros: Simpler than TRPO, stable updates, good performance (often SOTA or near-SOTA), relatively sample efficient for an on-policy method.
  • Cons: Still on-policy (less efficient than off-policy), performance sensitive to implementation details and hyperparameters.
  • Common Pitfalls: Advantage/observation normalization, learning rate schedule, choice of $\epsilon$.
  • Use Cases: Default choice for many discrete/continuous control tasks, RLHF for LLMs.

Value-Based Deep Methods

Deep Q-Networks (DQN)

(13_dqn.ipynb)

  • Core Idea: Combines Q-learning with a deep neural network to approximate $Q(s, a; \theta)$. Uses Experience Replay and Target Networks for stability. Off-policy.
  • Mathematical Formulation: Minimizes TD error using target network $Q'$: [ L(\theta) = \mathbb{E}{(s, a, r, s', d) \sim \mathcal{D}} [ (y - Q(s, a; \theta))^2 ] ] [ y = r + \gamma (1-d) \max{a'} Q'(s', a'; \theta^{-}) ]
    • $\mathcal{D}$: Replay buffer. $\theta^{-}$: Target network parameters.
  • Pseudocode:
    1. Initialize Q-network $Q_\theta$, target network $Q'_{\theta^-}$, replay buffer $\mathcal{D}$.
    2. For each episode:
      • For each step:
        • Choose action a using $\epsilon$-greedy on $Q_\theta$.
        • Execute a, get r, s'. Store $(s, a, r, s', done)$ in $\mathcal{D}$.
        • Sample mini-batch from $\mathcal{D}$.
        • Compute target y using $Q'_{\theta^-}$.
        • Update $Q_\theta$ by minimizing loss $L(\theta)$.
        • Periodically update target network: $\theta^- \leftarrow \theta$.
        • s = s'
  • Code Snippet:
    # DQN Optimization Step
    non_final_mask = torch.tensor(...) # Mask for non-terminal next states
    non_final_next_states = torch.cat(...)
    state_batch, action_batch, reward_batch, done_batch = ... # From replay buffer
    
    state_action_values = policy_net(state_batch).gather(1, action_batch)
    
    next_state_values = torch.zeros(batch_size, device=device)
    with torch.no_grad():
        next_state_values[non_final_mask] = target_net(non_final_next_states).max(1)[0]
        
    expected_state_action_values = (next_state_values * gamma) + reward_batch
    
    loss = criterion(state_action_values, expected_state_action_values.unsqueeze(1))
    
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
  • Key Hyperparameters: buffer_size, batch_size, gamma, tau or target_update_freq, learning_rate, epsilon schedule.
  • Pros: Handles high-dimensional states (e.g., pixels), off-policy sample efficiency, stable due to replay/target nets.
  • Cons: Primarily for discrete actions, can overestimate Q-values, sensitive to hyperparameters.
  • Common Pitfalls: Target network updates, buffer management, hyperparameter tuning.
  • Use Cases: Atari games from pixels, tasks with discrete actions and large state spaces.

Multi-Agent RL (MARL)

Multi-Agent Deep Deterministic Policy Gradient (MADDPG)

(14_maddpg.ipynb)

  • Core Idea: Extends DDPG to multi-agent settings using the "centralized training, decentralized execution" paradigm. Each agent has an actor and a centralized critic that observes joint states/observations and actions. Off-policy.
  • Mathematical Formulation:
    • Centralized Critic $Q_i(x, a_1, ..., a_N; \phi_i)$ for agent $i$. $x = (o_1, ..., o_N)$.
    • Critic Update: Minimize $L(\phi_i) = \mathbb{E} [ (y_i - Q_i(x, \mathbf{a}))^2 ]$ where $y_i = r_i + \gamma Q'_i(x', \mu'_1(o'_1), ..., \mu'_N(o'_N))$.
    • Actor Update: Minimize $L(\theta_i) = - \mathbb{E} [ Q_i(x, \mu_1(o_1), ..., \mu_N(o_N)) ]$ (using main critic $Q_i$).
  • Pseudocode:
    1. Initialize actors $\mu_i$, critics $Q_i$, targets $\mu'_i, Q'_i$, replay buffer $\mathcal{D}$.
    2. For each step:
      • Each agent $i$ chooses $a_i = \mu_i(o_i) + \text{Noise}$.
      • Execute joint action $\mathbf{a}$, get $r=(r_1,...), o'=(o'_1,...)$. Store $(o, \mathbf{a}, r, o')$ in $\mathcal{D}$.
      • Sample mini-batch from $\mathcal{D}$.
      • For each agent $i$: Update critic $Q_i$ and actor $\mu_i$.
      • Soft-update all target networks.
  • Code Snippet: (Conceptual - Update involves joint info)
    # Critic Update (Agent i)
    with torch.no_grad():
        target_actions_next = [target_actors[j](next_obs_batch[:, j]) for j in range(num_agents)]
        target_q_next = target_critics[i](joint_next_obs_batch, torch.cat(target_actions_next, dim=1))
        y_i = rewards_batch[:, i] + gamma * (1 - dones_batch[:, i]) * target_q_next
    current_q_i = critics[i](joint_obs_batch, joint_actions_batch) # Actions from buffer
    critic_loss_i = F.mse_loss(current_q_i, y_i)
    # ... optimize critic i ...
    
    # Actor Update (Agent i)
    current_actions_policy = [actors[j](obs_batch[:, j]) for j in range(num_agents)]
    # Need grads only for actor i's action output
    current_actions_policy[i] = actors[i](obs_batch[:, i]) # Ensure grad enabled if needed
    q_actor_loss = critics[i](joint_obs_batch, torch.cat(current_actions_policy, dim=1))
    actor_loss_i = -q_actor_loss.mean()
    # ... optimize actor i ...
  • Key Hyperparameters: Similar to DDPG, but potentially per-agent. Buffer size, batch size, $\gamma, \tau$, learning rates, noise.
  • Pros: Addresses non-stationarity in MARL, decentralized execution, handles mixed cooperative/competitive settings.
  • Cons: Centralized critic scales poorly with many agents, credit assignment can be hard in cooperative settings.
  • Use Cases: Multi-robot coordination, predator-prey, cooperative navigation.

QMIX (Monotonic Value Function Factorization)

(15_qmix.ipynb)

  • Core Idea: A value-based MARL algorithm for cooperative tasks. Learns individual agent Q-functions $Q_i$ and mixes them monotonically using a mixing network conditioned on the global state to produce $Q_{tot}$. Off-policy, centralized training, decentralized execution.
  • Mathematical Formulation:
    • $Q_{tot}(x, \mathbf{a}) = f_{mix}(Q_1(o_1, a_1), ..., Q_N(o_N, a_N); x)$
    • Constraint: $\frac{\partial Q_{tot}}{\partial Q_i} \ge 0$ (enforced by non-negative mixer weights, often via hypernetworks).
    • Loss: Minimize TD error on $Q_{tot}$: $L = \mathbb{E} [ (y - Q_{tot}(x, \mathbf{a}))^2 ]$ where $y = r + \gamma Q'_{tot}(x', \mathbf{a}')$ with $a'_i = \arg\max_a Q'_i(o'_i, a)$.
  • Pseudocode:
    1. Initialize agent networks $Q_i$, target networks $Q'i$, mixer $f{mix}$, target mixer $f'_{mix}$, replay buffer $\mathcal{D}$.
    2. For each step:
      • Each agent $i$ chooses $a_i$ using $\epsilon$-greedy on $Q_i(o_i)$.
      • Execute $\mathbf{a}$, get $r$, $o'$, $x'$. Store $(o, \mathbf{a}, r, o', x, x', done)$ in $\mathcal{D}$.
      • Sample mini-batch.
      • Calculate target $y$ using target networks $Q'i$ and target mixer $f'{mix}$.
      • Calculate current $Q_{tot}$ using main networks $Q_i$ and main mixer $f_{mix}$.
      • Compute loss $L$.
      • Update all $Q_i$ and $f_{mix}$ parameters via gradient descent on $L$.
      • Soft-update target networks.
  • Code Snippet:
    # Calculate Target Q_tot'
    with torch.no_grad():
        # ... Get max Q'_i for each agent i in next state ...
        target_agent_qs = torch.cat(...) # Shape (batch, num_agents)
        q_tot_target = target_mixer(target_agent_qs, next_global_state_batch)
        y = reward_batch + gamma * (1 - done_batch) * q_tot_target
    
    # Calculate Current Q_tot
    # ... Get Q_i for the action *taken* by each agent i in current state ...
    current_agent_qs = torch.cat(...) # Shape (batch, num_agents)
    q_tot_current = mixer(current_agent_qs, global_state_batch)
    
    # Loss and Optimize
    loss = F.mse_loss(q_tot_current, y)
    optimizer.zero_grad()
    loss.backward() # Gradients flow back to all agent nets and mixer
    optimizer.step()
    # ... Soft update targets ...
  • Key Hyperparameters: Like DQN, plus mixing network architecture, hypernetwork details.
  • Pros: Good for cooperative tasks, enforces IQL principle (local optimum -> global optimum), scales better in action space than joint Q-learning.
  • Cons: Limited representational power due to monotonicity, requires global state for mixer.
  • Use Cases: Cooperative MARL (e.g., SMAC benchmark), resource allocation.

Hierarchical RL (HRL)

Hierarchical Actor-Critic (HAC)

(16_hac.ipynb)

  • Core Idea: Learns policies at multiple levels of abstraction. High levels set subgoals (as "actions") for lower levels, which execute primitive actions to achieve them within a time limit $H$. Uses intrinsic rewards and hindsight for learning subgoals. Off-policy.
  • Mathematical Formulation (Conceptual):
    • Level $i$: Policy $\pi_i(a_i | s, g_i)$, Q-function $Q_i(s, a_i, g_i)$. $a_i$ is subgoal for level $i-1$.
    • Low Level (0): Learns using intrinsic reward $r_{int}$ (success/fail to reach $g_0$).
    • High Level (1): Learns using environment reward $R = \sum r_{env}$.
    • Hindsight: Relabel transitions with achieved states as goals, granting artificial success.
  • Pseudocode (2-Level):
    1. Initialize networks $Q_0, Q'_0, Q_1, Q'_1$, buffers $D_0, D_1$.
    2. For episode:
      • s = env.reset()
      • While overall goal G not reached:
        • High level chooses subgoal g_0 = select_action(level=1, state=s, goal=G).
        • transitions = [], total_env_reward = 0
        • s_start = s
        • For h from 1 to H:
          • Low level chooses primitive action a = select_action(level=0, state=s, goal=g_0).
          • Take a, get r_env, s_next, env_done.
          • total_env_reward += r_env
          • r_int = get_intrinsic_reward(s_next, g_0)
          • Store low-level tuple (s, a, r_int, s_next, g_0, env_done or test_goal(s_next,g_0), achieved_goal=s_next) in transitions.
          • s = s_next
          • If test_goal(s, g_0) or env_done: break inner loop.
        • s_end = s
        • Store high-level tuple (s_start, g_0, total_env_reward, s_end, G, env_done) in transitions.
        • Add transitions to buffers with hindsight relabeling.
        • Update $Q_0, Q_1$ from buffers. Update targets.
        • If env_done: break outer loop.
  • Code Snippet: (Focus on hindsight and intrinsic reward)
    # Inside low-level execution loop
    # ... execute action a, get next_state_norm, env_done ...
    intrinsic_reward = -1.0 # Default failure
    subgoal_achieved = self._test_goal(next_state_norm, goal_norm)
    if subgoal_achieved:
        intrinsic_reward = 0.0 # Success reward
        
    # Store original transition
    buffer.push(..., reward=intrinsic_reward, goal=goal_norm, done=env_done or subgoal_achieved, achieved_goal=next_state_norm, level=0)
    
    # Hindsight is handled in buffer.sample() by replacing goal with achieved_goal
    # and setting reward/done accordingly for the hindsight sample.
  • Key Hyperparameters: Number of levels, time limit H, learning rates, gamma, tau, epsilon schedule, buffer sizes, hindsight probability p.
  • Pros: Can solve long-horizon/sparse reward tasks, structured exploration, potential for skill reuse.
  • Cons: Very complex implementation, sensitive to goal definition, time limits, and hyperparameters, potential for suboptimal subgoal setting. The notebook implementation has known issues.
  • Common Pitfalls: Hindsight logic, intrinsic reward definition, goal feasibility, tuning H.
  • Use Cases: Complex robotics tasks, long-horizon planning, navigation.

Planning & Model-Based Methods

Monte Carlo Tree Search (MCTS)

(17_mcts.ipynb)

  • Core Idea: An online planning algorithm that builds a search tree using simulated trajectories (rollouts) from the current state. Uses statistics (visit counts, values) and UCT to balance exploration/exploitation within the search tree. Requires a simulator/model.
  • Mathematical Formulation:
    • Tree Nodes store state $s$, visit count $N(s)$, total value $W(s)$.
    • Edges store action $a$, count $N(s, a)$, value $W(s, a)$.
    • Selection policy (UCT): Choose action $a$ maximizing $Q(s, a) + C \sqrt{\frac{\ln N(s)}{N(s, a)}}$.
  • Pseudocode (Single MCTS step for action selection):
    1. Initialize tree with root node = current state $s_{root}$.
    2. Repeat for N simulations:
      • node = root_node
      • Selection: While node is fully expanded and not terminal, node = select_best_child_uct(node).
      • Expansion: If node not fully expanded and not terminal, expand one child node = expand_node(node).
      • Simulation: Run rollout from node's state using default policy, get reward R.
      • Backpropagation: Update N and W for nodes/edges from node back up to root using R.
    3. Choose best action from root based on visit counts (or values).
  • Code Snippet: (UCT Selection)
    # Inside select_best_child_uct
    best_score = -float('inf')
    best_child = None
    for action, child in node.children.items():
        if child.visit_count == 0:
            uct_score = float('inf')
        else:
            exploit = child.total_value / child.visit_count
            explore = exploration_constant * math.sqrt(math.log(node.visit_count) / child.visit_count)
            uct_score = exploit + explore
        # ... update best_child ...
    return best_child
  • Key Hyperparameters: num_simulations (budget per step), exploration_constant C, rollout_depth, gamma (for rollouts).
  • Pros: Anytime algorithm, handles large state/action spaces, no explicit value function needed for search, asymmetric tree growth.
  • Cons: Requires a simulator/model, computationally intensive per step, rollout policy quality affects performance.
  • Common Pitfalls: Tuning C, efficient implementation of steps, quality of rollouts.
  • Use Cases: Game playing (Go, Chess), planning problems with simulators.

PlaNet (Deep Planning Network)

(18_planet.ipynb)

  • Core Idea: A model-based RL agent that learns a latent dynamics model (often an RSSM) directly from experience (potentially high-dimensional observations). It then performs planning directly in the latent space using algorithms like CEM to select actions. Off-policy.
  • Mathematical Formulation (Conceptual):
    • Learns models: $p(s_{t+1}|s_t, a_t)$ (transition), $p(r_t|s_t)$ (reward), possibly $p(o_t|s_t)$ (observation/reconstruction), $q(s_t|...)$ (encoder).
    • Model Loss: Maximize data likelihood (often via ELBO, including reconstruction, reward prediction, KL regularization terms). Simplified: MSE on next state & reward.
    • Planning (CEM): Optimize $\mathbb{E} [ \sum_{k=t}^{t+H-1} \gamma^{k-t} \hat{r}k ]$ over action sequences $a_t..a{t+H-1}$ using the learned latent model $(\hat{p}, \hat{r})$.
  • Pseudocode:
    1. Initialize latent dynamics model, replay buffer $\mathcal{D}$ (stores sequences).
    2. Loop:
      • Interact: Observe $s_t$. Plan action $a_t$ using CEM in latent space with current model. Execute $a_t$, get $r_t, s_{t+1}$. Store $(s_t, a_t, r_t, s_{t+1})$ in $\mathcal{D}$.
      • Train Model: Sample sequences from $\mathcal{D}$. Update model parameters to minimize prediction/reconstruction losses.
  • Code Snippet: (CEM Planning Call)
    # Inside main loop
    # state = current latent state representation
    action = cem_planner(
        dynamics_model, state, 
        horizon=PLANNING_HORIZON, 
        num_candidates=CEM_CANDIDATES, 
        num_elites=CEM_ELITES, 
        num_iterations=CEM_ITERATIONS, 
        gamma=CEM_GAMMA, ...)
    # Execute action in real env...
    # Train model...
  • Key Hyperparameters: Model architecture (latent size, hidden dims), model learning rate, buffer size, sequence length, planning horizon H, CEM parameters (J, M, iterations).
  • Pros: Very sample efficient (especially from images), learns compact world representation, effective planning.
  • Cons: Complex model training, planning can be computationally expensive, model inaccuracies can lead to poor plans (compounding errors).
  • Common Pitfalls: Model convergence, tuning planning horizon vs. model accuracy, computational cost of planning.
  • Use Cases: Control from pixels, sample-constrained robotics tasks, model-based benchmarks.

Pro Tip: Standardizing advantages (in policy gradient / actor-critic methods) or returns (in REINFORCE) often significantly stabilizes training. This involves subtracting the mean and dividing by the standard deviation of the advantages/returns within a batch or episode.

Did You Know? Many state-of-the-art algorithms combine ideas. For example, PPO and A2C are actor-critic methods using policy gradients. SAC combines actor-critic with maximum entropy RL. PlaNet combines model-based learning with CEM planning.


This cheat sheet provides a high-level overview. For detailed implementation and nuances, refer to the specific notebooks in the repository and the original research papers. Good luck learning!