lilianweng · Atcold · May 25, 2022 · May 27, 2022
diff --git a/posts/2021-07-11-diffusion-models/index.html b/posts/2021-07-11-diffusion-models/index.html
@@ -300,7 +300,7 @@ <h1 class="post-title">
 <h1 id="what-are-diffusion-models">What are Diffusion Models?<a hidden class="anchor" aria-hidden="true" href="#what-are-diffusion-models">#</a></h1>
 <p>Several diffusion-based generative models have been proposed with similar ideas underneath, including <em>diffusion probabilistic models</em> (<a href="https://arxiv.org/abs/1503.03585">Sohl-Dickstein et al., 2015</a>), <em>noise-conditioned score network</em> (<strong>NCSN</strong>; <a href="https://arxiv.org/abs/1907.05600">Yang &amp; Ermon, 2019</a>), and <em>denoising diffusion probabilistic models</em> (<strong>DDPM</strong>; <a href="https://arxiv.org/abs/2006.11239">Ho et al. 2020</a>).</p>
 <h2 id="forward-diffusion-process">Forward diffusion process<a hidden class="anchor" aria-hidden="true" href="#forward-diffusion-process">#</a></h2>
-<p>Given a data point sampled from a real data distribution $\mathbf{x}_0 \sim q(\mathbf{x})$, let us define a <em>forward diffusion process</em> in which we add small amount of Gaussian noise to the sample in $T$ steps, producing a sequence of noisy samples $\mathbf{x}_1, \dots, \mathbf{x}_T$. The step sizes are controlled by a variance schedule $\{\beta_t \in (0, 1)\}_{t=1}^t$.</p>
+<p>Given a data point sampled from a real data distribution $\mathbf{x}_0 \sim q(\mathbf{x})$, let us define a <em>forward diffusion process</em> in which we add small amount of Gaussian noise to the sample in $T$ steps, producing a sequence of noisy samples $\mathbf{x}_1, \dots, \mathbf{x}_T$. The step sizes are controlled by a variance schedule $\{\beta_t \in (0, 1)\}_{t=1}^T$.</p>
 <div>
 $$
 q(\mathbf{x}_t \vert \mathbf{x}_{t-1}) = \mathcal{N}(\mathbf{x}_t; \sqrt{1 - \beta_t} \mathbf{x}_{t-1}, \beta_t\mathbf{I}) \quad
@@ -329,7 +329,7 @@ <h3 id="connection-with-stochastic-gradient-langevin-dynamics">Connection with s
 <p>Langevin dynamics is a concept from physics, developed for statistically modeling molecular systems. Combined with stochastic gradient descent, <em>stochastic gradient Langevin dynamics</em> (<a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.226.363">Welling &amp; Teh 2011</a>) can produce samples from a probability density $p(\mathbf{x})$ using only the gradients $\nabla_\mathbf{x} \log p(\mathbf{x})$ in a Markov chain of updates:</p>
 <div>
 $$
-\mathbf{x}_t = \mathbf{x}_{t-1} + \frac{\epsilon}{2} \nabla_\mathbf{x} p(\mathbf{x}_{t-1}) + \sqrt{\epsilon} \mathbf{z}_t
+\mathbf{x}_t = \mathbf{x}_{t-1} + \frac{\epsilon}{2} \nabla_\mathbf{x} \log p(\mathbf{x}_{t-1}) + \sqrt{\epsilon} \mathbf{z}_t
 ,\quad\text{where }
 \mathbf{z}_t \sim \mathcal{N}(\mathbf{0}, \mathbf{I})
 $$