8.3 Diffusion Models

One desideratum that has emerged from our investigation of directed generative models is for the distribution of latent variables to be essentially structureless. For one thing, this makes it easy to sample the latent variables, and therefore to generate data (since directed models can be sampled with a single “ancestral pass” from root to leaves). For another, it accords with our high-level conceptual goal of explaining observations in terms of a simpler set of independent causes.

Seen from the perspective of the recognition model, this desideratum presents a paradoxical character: it seems that the corresponding goal for the recognition model is to destroy structure. Clearly, it cannot do so in an irreversible way (e.g., by multiplying the observations by zero and adding noise), or there will be no mutual information between observed and latent variables. However, there exists a class of models, known in statistical physics as diffusion processes, which gradually delete structure from distributions but which are nevertheless reversible [46].

A diffusion process can be described by a (very long) Markov chain that (very mildly) corrupts the data at every step. Now, for sufficiently mild data corruption, the process is reversible [46]. Still, we cannot simply apply Bayes’ rule, since it requires the prior distribution over the original, uncorrupted data—i.e., the data distribution, precisely what we want to sample from. On the other hand, it turns out that each reverse-diffusion step must take the same distributional form as the forward-diffusion step [46]. We still don’t know how to convert noisy observations into the parameters of this distribution, but perhaps this mapping can be learned.

In particular, suppose we pair a recognition model describing such a diffusion process with a generative model that is a Markov chain of the same length, and with the same form for its conditional distributions, but pointing in the other direction. Then training the generative model to assign high probability to the observation—or more precisely, lower the joint relative entropy—while making inferences under the recognition model will effectively oblige the generative model to learn to denoise the data at every step. That is, it will become a model of the reverse-diffusion process.

Notice that for this process to be truly reversible, the dimensionality of the data must stay constant across all steps of the Markov chain (including the observations themselves). Also note that, as lately described, the recognition distribution is fixed and requires no learnable parameters. We revisit this idea below.

The generative and recognition models.

The model can be expressed most elegantly if we use 𝑿ˇ0{\bm{\check{X}}}_{0} for the observed data 𝒀{\bm{Y}}, and likewise 𝑿^0{\bm{\hat{X}}}_{0} for their counterparts in the generative model, 𝒀^{\bm{\hat{Y}}}—so in this section we do. Then the diffusion generative model can be written simply as

p^(𝒙^1,,𝒙^L,𝒙^0;𝜽)=p^(𝒙^L)l=1Lp^(𝒙^l-1|𝒙^l;𝜽);{\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\hat{x}}_{1},\ldots,\bm{\hat{x}}_{L},% \leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}_% {0};\bm{\theta}}\right)}={\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode% \color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}_{L}}% \right)}\prod_{l=1}^{L}{\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color% [rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}_{l-1}% \middle|\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{% rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat% {x}}_{l};\bm{\theta}}\right)}; (8.30)

i.e., a Markov chain. Notice that the generative prior is not parameterized. This accords with the intuition lately discussed that the model should convert structureless noise into the highly structured distribution of interest. The recognition model likewise simplifies according to the independence statements for a Markov chain:

pˇ(𝒙ˇ1,,𝒙ˇL|𝒙ˇ0)=l=1Lpˇ(𝒙ˇl|𝒙ˇl-1).{\check{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\check{x}}_{1},\ldots,\bm{\check{x}}_{L}% \middle|\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{% rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{% \check{x}}_{0}}\right)}=\prod_{l=1}^{L}{\check{p}\mathopen{}\mathclose{{}\left% (\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\check{x}% }_{l}\middle|\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{% pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{\check{x}}_{l-1}}\right)}. (8.31)

We have omitted the usual ϕ\bm{\phi} from this model because by assumption it has no learnable parameters.

It remains, of course, to specify distributional forms for the factors in Eqs. 8.30 and 8.31. We explore one choice below, but first derive the loss for the more general case.

The joint relative entropy.

The joint relative entropy, we recall once again, is the difference between the joint cross entropy (between hybrid distribution and generative model) and the entropy of the hybrid distribution, pˇ(𝒙ˇ|𝒙ˇ0)p(𝒙ˇ0){\check{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\check{x}}\middle|\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\check{x}}_{0}{}}% \right)}{p\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\check{x}}_{0}{}}\right)}. However, in the diffusion model, the recognition model is parameterless, so the entire entropy term is a constant as far as our optimization goes. That is, our optimization is essentially “all M step” (improving the generative model), and therefore can be carried out on the joint cross entropy. This would be a mistake if the recognition model were (as usual) to be intepreted as merely an approximation to the posterior under the generative model. In the diffusion model, in contrast, the recognition model is interpreted as ground truth, for which the generative model provides the approximation. The overall loss can be reduced by improving the generative fit either to the data or to the recognition model, but in this case both are desirable per se.

Given the conditional independencies of a Markov chain, then, the joint relative entropy reduces to a sum of conditional cross entropies (plus constant terms):

JRE(𝜽) . . =𝔼𝑿ˇ1,,𝑿ˇL,𝑿ˇ0[logpˇ(𝑿ˇ1,,𝑿ˇL|𝑿ˇ0)-logp^(𝑿ˇ1,,𝑿ˇL,𝑿ˇ0;𝜽)]+c=𝔼𝑿ˇ1,,𝑿ˇL,𝑿ˇ0[-logp^(𝑿ˇL)-l=1Llogp^(𝑿ˇl-1|𝑿ˇl;𝜽)]+c=-l=1L𝒙ˇ0𝒙ˇl-1𝒙ˇlpˇ(𝒙ˇl,𝒙ˇl-1|𝒙ˇ0)logp^(𝒙ˇl-1|𝒙ˇl;𝜽)d𝒙ˇld𝒙ˇl-1p(𝒙ˇ0)d𝒙ˇ0+c.\begin{split}\mathcal{L}_{\text{JRE}}(\bm{\theta})&\mathrel{\vbox{\hbox{% \scriptsize.}\hbox{\scriptsize.} }}=\mathbb{E}_{{\bm{\check{X}}}_{1},\ldots,{\bm{\check{X}}}_{L}{},{\bm{\check{% X}}}_{0}{}}{\mathopen{}\mathclose{{}\left[\log{\check{p}\mathopen{}\mathclose{% {}\left({\bm{\check{X}}}_{1},\ldots,{\bm{\check{X}}}_{L}\middle|{\bm{\check{X}% }}_{0}}\right)}-\log{\hat{p}\mathopen{}\mathclose{{}\left({\bm{\check{X}}}_{1}% ,\ldots,{\bm{\check{X}}}_{L},{\bm{\check{X}}}_{0};\bm{\theta}}\right)}}\right]% }+c\\ &=\mathbb{E}_{{\bm{\check{X}}}_{1},\ldots,{\bm{\check{X}}}_{L}{},{\bm{\check{X% }}}_{0}{}}{\mathopen{}\mathclose{{}\left[-\log{\hat{p}\mathopen{}\mathclose{{}% \left({\bm{\check{X}}}_{L}}\right)}-\sum_{l=1}^{L}\log{\hat{p}\mathopen{}% \mathclose{{}\left({\bm{\check{X}}}_{l-1}\middle|{\bm{\check{X}}}_{l};\bm{% \theta}}\right)}}\right]}+c\\ &=-\sum_{l=1}^{L}\int_{\bm{\check{x}}_{0}{}}\int_{\bm{\check{x}}_{l-1}{}}\int_% {\bm{\check{x}}_{l}{}}\check{p}\mathopen{}\mathclose{{}\left(\bm{\check{x}}_{l% },\bm{\check{x}}_{l-1}\middle|\bm{\check{x}}_{0}}\right)\log{\hat{p}\mathopen{% }\mathclose{{}\left(\bm{\check{x}}_{l-1}\middle|\bm{\check{x}}_{l};\bm{\theta}% }\right)}\mathop{}\!\mathrm{d}{\bm{\check{x}}_{l}{}}\mathop{}\!\mathrm{d}{\bm{% \check{x}}_{l-1}{}}{p\mathopen{}\mathclose{{}\left(\bm{\check{x}}_{0}}\right)}% \mathop{}\!\mathrm{d}{\bm{\check{x}}_{0}{}}+c.\end{split} (8.32)

The cc denotes different constants on different lines. Now we need to choose generative and recognition models that make the integrals (or sums) in these cross entropies tractable.

8.3.1 Gaussian diffusion models

Probably the most intuitive diffusion process is based on Gaussian noise; for example,

pˇ(𝒙ˇl|𝒙ˇl-1) . . =𝒩(βl𝒙ˇl-1,γl2𝐈){\check{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\check{x}}_{l}\middle|\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\check{x}}_{l-1}}% \right)}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.} }}=\mathcal{N}\mathopen{}\mathclose{{}\left(\beta_{l}\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\check{x}}_{l-1},% \>\gamma_{l}^{2}\mathbf{I}}\right)

for some parameters βl\beta_{l} and γl\gamma_{l}. Essentially the process scales (down) the data and adds isotropic noise. However, we defer specifying these parameters for the moment, and turn directly to the generative model. Suffice to say, if βl\beta_{l} is sufficiently close to 1 and γl\gamma_{l} is sufficiently small, then the generative transitions are likewise Gaussian (see discussion above). Furthermore, after many diffusion steps, the distribution of the state will be Gaussian and isotropic. With the appropriate selection of the recognition parameters, we can force this distribution to have zero mean and unit variance. Therefore we define the generative model to be

p^(𝒙^L) . . =𝒩(𝟎,𝐈),\displaystyle{\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}_{L}}% \right)}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.} }}=\mathcal{N}\mathopen{}\mathclose{{}\left(\bm{0},\>\mathbf{I}}\right), p^(𝒙^l-1|𝒙^l;𝜽) . . =𝒩(𝝁(𝒙^l,l,𝜽),σ2(𝒙^l,l,𝜽)𝐈).\displaystyle{\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}_{l-1}% \middle|\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{% rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat% {x}}_{l};\bm{\theta}}\right)}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{% \scriptsize.} }}=\mathcal{N}\mathopen{}\mathclose{{}\left(\bm{\mu}(\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}_{l},l,\bm% {\theta}),\>\sigma^{2}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{% pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{\hat{x}}_{l},l,\bm{\theta})\mathbf{I}}\right). (8.33)

The mean and variance of this denoising distribution can depend on the corrupted sample (𝒙ˇl\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\check{x}% }_{l}) in a complicated way, so in general we can let 𝝁\bm{\mu} and σ2\sigma^{2} be neural networks. Nevertheless, for simplicity in the derivation, and because learning variances is significantly more challenging than learning means, let us further replace the variance function σ2(𝒙^l,l,𝜽)\sigma^{2}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}% {rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{% \hat{x}}_{l},l,\bm{\theta}) with a set of LL fixed (i.e., not learned), data-independent parameters, σl2\sigma^{2}_{l}. (To save space, we also write the mean function, 𝝁(𝒙^l,l,𝜽)=𝝁l\bm{\mu}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{% rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat% {x}}_{l},l,\bm{\theta})=\bm{\mu}_{l}, but it certainly does depend on the data and generative-model parameters.) Then the joint relative entropy (Eq. 8.32) can be expressed as

JRE(𝜽)=l=1L12σl2𝒙ˇ0𝒙ˇl-1𝒙ˇlpˇ(𝒙ˇl,𝒙ˇl-1|𝒙ˇ0)p(𝒙ˇ0)𝒙ˇl-1-𝝁l2d𝒙ˇld𝒙ˇl-1d𝒙ˇ0+c.\begin{split}\mathcal{L}_{\text{JRE}}(\bm{\theta})&=\sum_{l=1}^{L}\frac{1}{2% \sigma^{2}_{l}}\int_{\bm{\check{x}}_{0}{}}\int_{\bm{\check{x}}_{l-1}{}}\int_{% \bm{\check{x}}_{l}{}}\check{p}\mathopen{}\mathclose{{}\left(\bm{\check{x}}_{l}% ,\bm{\check{x}}_{l-1}\middle|\bm{\check{x}}_{0}}\right){p\mathopen{}\mathclose% {{}\left(\bm{\check{x}}_{0}}\right)}\mathopen{}\mathclose{{}\left\lVert\bm{% \check{x}}_{l-1}-\bm{\mu}_{l}}\right\rVert^{2}\mathop{}\!\mathrm{d}{\bm{\check% {x}}_{l}{}}\mathop{}\!\mathrm{d}{\bm{\check{x}}_{l-1}{}}\mathop{}\!\mathrm{d}{% \bm{\check{x}}_{0}{}}+c.\end{split} (8.34)

The interpretation is now clear. Minimizing the joint relative entropy obliges the generative model to learn how to “undo” one step of corruption with Gaussian noise, for all steps l[1,L]l\in[1,L]. At the model’s disposal is the arbitrarily powerful function (neural network) 𝝁(𝒙^l,l,𝜽)\bm{\mu}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{% rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat% {x}}_{l},l,\bm{\theta}). Since the amount of corruption can vary with ll, optimizing Eq. 8.34 obliges 𝝁\bm{\mu} to be able to remove noise of possibly different sizes.

As for the implementation, we see that the integrals in Eq. 8.34 can be approximated with samples: first a draw (𝒙ˇ0\bm{\check{x}}_{0}) from the data distribution, followed by draws (𝒙ˇ1,,𝒙ˇL\bm{\check{x}}_{1},\ldots,\bm{\check{x}}_{{L}}) down the length of the recognition model. Notice, however, that under this scheme, each summand would be estimated with samples from three random variables, 𝑿ˇ0,𝑿ˇl-1,𝑿ˇl{\bm{\check{X}}}_{0},{\bm{\check{X}}}_{l-1},{\bm{\check{X}}}_{l}. We can reduce this by one, and thereby reduce the variance of our Monte Carlo estimator, by exploiting some properties of Gaussian noise. In particular, we will reverse the order of expansion in applying the chain rule of probability to the recognition model:

pˇ(𝒙ˇl,𝒙ˇl-1|𝒙ˇ0)=pˇ(𝒙ˇl-1|𝒙ˇl,𝒙ˇ0)pˇ(𝒙ˇl|𝒙ˇ0).{\check{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\check{x}}_{l},\leavevmode\color[rgb]{.5,.5,.5% }\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.% 5}\pgfsys@color@gray@fill{.5}\bm{\check{x}}_{l-1}\middle|\leavevmode\color[rgb% ]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\check{x}}_{0}{}}% \right)}={\check{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\check{x}}_{l-1}% \middle|\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{% rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{% \check{x}}_{l},\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{% pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{\check{x}}_{0}}\right)}{\check{p}\mathopen{}% \mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{% pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{\check{x}}_{l}\middle|\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\check{x}}_{0}}% \right)}. (8.35)

Then we will carry out the expectation under pˇ(𝒙ˇl-1|𝒙ˇl,𝒙ˇ0){\check{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\check{x}}_{l-1}\middle|\leavevmode\color[rgb]% {.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\check{x}}_{l},% \leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\check{x}% }_{0}}\right)} in closed form. In preparation, we now turn to pˇ(𝒙ˇl-1|𝒙ˇl,𝒙ˇ0){\check{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\check{x}}_{l-1}\middle|\leavevmode\color[rgb]% {.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\check{x}}_{l},% \leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\check{x}% }_{0}}\right)} and pˇ(𝒙ˇl|𝒙ˇ0){\check{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\check{x}}_{l}\middle|\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\check{x}}_{0}}% \right)}.

The recognition marginals.

A very useful upshot of defining the recognition model to consist only of scaling and the addition of Gaussian noise is that the distribution of any random variable under this model is Gaussian (conditioned, that is, on 𝒙ˇ0\bm{\check{x}}_{0}). That includes pˇ(𝒙ˇl|𝒙ˇ0){\check{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\check{x}}_{l}\middle|\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\check{x}}_{0}}% \right)}, which we might call the “recognition marginal.” For reasons that will soon become clear, we make these marginals the starting point of our definition of the recognition model [26], and then work backwards to the transition probabilities:

pˇ(𝒙ˇl|𝒙ˇ0) . . =𝒩(υlρl𝒙ˇ0,υl2𝐈).{\check{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\check{x}}_{l}\middle|\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\check{x}}_{0}}% \right)}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.} }}=\mathcal{N}\mathopen{}\mathclose{{}\left(\upsilon_{l}\rho_{l}\leavevmode% \color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\check{x}}_{0},\>% \upsilon^{2}_{l}\mathbf{I}}\right). (8.36)

Under this definition, ρl2\rho_{l}^{2} is a kind of signal-to-noise ratio. We require it to decrease monotonically with ll.

Consistency with these marginals requires linear-Gaussian transitions:

𝑿ˇm=βml𝑿ˇl+γml𝒁ˇm,\displaystyle{\bm{\check{X}}}_{m}=\beta_{ml}{\bm{\check{X}}}_{l}+\gamma_{ml}{% \bm{\check{Z}}}_{m}, 𝒁ˇm𝒩(𝟎,𝐈).\displaystyle{\bm{\check{Z}}}_{m}\sim\mathcal{N}\mathopen{}\mathclose{{}\left(% \bm{0},\>\mathbf{I}}\right).

Note that mm and ll need not even be consecutive steps in the Markov chain, although we require m>lm>l. Furthermore, by the law of total expectation,

υmρm𝒙ˇ0=𝔼𝑿ˇm|𝑿ˇ0[𝑿ˇm|𝒙ˇ0]=βml𝔼𝑿ˇl|𝑿ˇ0[𝑿ˇl|𝒙ˇ0]=υlρlβml𝒙ˇ0βml=υmυlρmρl.\upsilon_{m}\rho_{m}\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{% pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{\check{x}}_{0}=\mathbb{E}_{{\bm{\check{X}}}_{m}% |{\bm{\check{X}}}_{0}}{\mathopen{}\mathclose{{}\left[{\bm{\check{X}}}_{m}% \middle|\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{% rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{% \check{x}}_{0}}\right]}=\beta_{ml}\mathbb{E}_{{\bm{\check{X}}}_{l}|{\bm{\check% {X}}}_{0}}{\mathopen{}\mathclose{{}\left[{\bm{\check{X}}}_{l}\middle|% \leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\check{x}% }_{0}}\right]}=\upsilon_{l}\rho_{l}\beta_{ml}\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\check{x}}_{0}\implies\beta_{ml}=\frac{% \upsilon_{m}}{\upsilon_{l}}\frac{\rho_{m}}{\rho_{l}}.

Likewise, by the law of total covariance,

υm2𝐈=Cov𝑿ˇm|𝑿ˇ0[𝑿ˇm|𝒙ˇ0]=βml2Cov𝑿ˇl|𝑿ˇ0[𝑿ˇl|𝒙ˇ0]+γml2𝐈=(υl2βml2+γml2)𝐈γml2=υm2-υl2(υmυlρmρl)2=υm2(1-ρm2ρl2).\begin{split}\upsilon^{2}_{m}\mathbf{I}&=\text{Cov}_{{\bm{\check{X}}}_{m}|{\bm% {\check{X}}}_{0}}{\mathopen{}\mathclose{{}\left[{\bm{\check{X}}}_{m}\middle|% \leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\check{x}% }_{0}}\right]}=\beta_{ml}^{2}\text{Cov}_{{\bm{\check{X}}}_{l}|{\bm{\check{X}}}% _{0}}{\mathopen{}\mathclose{{}\left[{\bm{\check{X}}}_{l}\middle|\leavevmode% \color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\check{x}}_{0}}% \right]}+\gamma_{ml}^{2}\mathbf{I}=\mathopen{}\mathclose{{}\left(\upsilon^{2}_% {l}\beta_{ml}^{2}+\gamma_{ml}^{2}}\right)\mathbf{I}\\ \implies\gamma_{ml}^{2}&=\upsilon^{2}_{m}-\upsilon^{2}_{l}\mathopen{}% \mathclose{{}\left(\frac{\upsilon_{m}}{\upsilon_{l}}\frac{\rho_{m}}{\rho_{l}}}% \right)^{2}=\upsilon^{2}_{m}\mathopen{}\mathclose{{}\left(1-\frac{\rho_{m}^{2}% }{\rho_{l}^{2}}}\right).\end{split}

Notice (what our notation implied) that βml\beta_{ml} and γml\gamma_{ml} are necessarily scalars. In fine, the conditional recognition probabilities are given by

pˇ(𝒙ˇm|𝒙ˇl)=𝒩(υmυlρmρl𝒙ˇl,υm2(1-ρm2ρl2)𝐈).{\check{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\check{x}}_{m}\middle|\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\check{x}}_{l}}% \right)}=\mathcal{N}\mathopen{}\mathclose{{}\left(\frac{\upsilon_{m}}{\upsilon% _{l}}\frac{\rho_{m}}{\rho_{l}}\leavevmode\color[rgb]{.5,.5,.5}\definecolor[% named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{\check{x}}_{l},\>\upsilon^{2}_{m}\mathopen{}% \mathclose{{}\left(1-\frac{\rho_{m}^{2}}{\rho_{l}^{2}}}\right)\mathbf{I}}% \right). (8.37)

The recognition “posterior transitions.”

The other recognition distribution we require in order to use Eq. 8.35 is the “reverse-transition” pˇ(𝒙ˇl-1|𝒙ˇl,𝒙ˇ0){\check{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\check{x}}_{l-1}\middle|\leavevmode\color[rgb]% {.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\check{x}}_{l},% \leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\check{x}% }_{0}}\right)}. Here we again solve for the more generic case of pˇ(𝒙ˇl|𝒙ˇm,𝒙ˇ0){\check{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\check{x}}_{l}\middle|\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\check{x}}_{m},% \leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\check{x}% }_{0}}\right)} in which the state ll precedes mm but not necessarily directly. This distribution is again normal (all recognition distributions are), although this time the calculation of the cumulants is slightly more complicated, since it requires Bayes’ rule. Here the “prior” is pˇ(𝒙ˇl|𝒙ˇ0){\check{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\check{x}}_{l}\middle|\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\check{x}}_{0}}% \right)} (and given by the definition of the recognition marginals, Eq. 8.36); and the “likelihood” (or emission) is pˇ(𝒙ˇm|𝒙ˇl,𝒙ˇ0)=pˇ(𝒙ˇm|𝒙ˇl){\check{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\check{x}}_{m}\middle|\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\check{x}}_{l},% \leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\check{x}% }_{0}}\right)}={\check{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\check{x}}_{m}% \middle|\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{% rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{% \check{x}}_{l}}\right)} (the resulting conditional recognition probabilities, given by Eq. 8.37). We have worked out the general case of Bayes rule for jointly Gaussian random variables in Section 2.1.2. From Eq. 2.14, the posterior precision is the sum of the (unnormalized) prior and likelihood precisions (in the space of 𝑿ˇl{\bm{\check{X}}}_{l}):

Cov𝑿ˇl|𝑿ˇm,𝑿ˇ0[𝑿ˇl|𝒙ˇm,𝒙ˇ0]=(1υl2+ρl2υm2(ρl2-ρm2)(υmυlρmρl)2)-1𝐈=(1υl2+ρm2υl2(ρl2-ρm2))-1𝐈=υl2(ρl2-ρm2)ρl2𝐈.\begin{split}\text{Cov}_{{\bm{\check{X}}}_{l}|{\bm{\check{X}}}_{m},{\bm{\check% {X}}}_{0}}{\mathopen{}\mathclose{{}\left[{\bm{\check{X}}}_{l}\middle|% \leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\check{x}% }_{m},\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}% {.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\check{x% }}_{0}}\right]}&=\mathopen{}\mathclose{{}\left(\frac{1}{\upsilon^{2}_{l}}+% \frac{\rho_{l}^{2}}{\upsilon^{2}_{m}(\rho_{l}^{2}-\rho_{m}^{2})}\mathopen{}% \mathclose{{}\left(\frac{\upsilon_{m}}{\upsilon_{l}}\frac{\rho_{m}}{\rho_{l}}}% \right)^{2}}\right)^{-1}\mathbf{I}\\ &=\mathopen{}\mathclose{{}\left(\frac{1}{\upsilon^{2}_{l}}+\frac{\rho_{m}^{2}}% {\upsilon^{2}_{l}(\rho_{l}^{2}-\rho_{m}^{2})}}\right)^{-1}\mathbf{I}=\frac{% \upsilon^{2}_{l}(\rho_{l}^{2}-\rho_{m}^{2})}{\rho_{l}^{2}}\mathbf{I}.\end{split}

From Eq. 2.13, the posterior mean is a convex combination of the information from the prior and likelihood:

𝔼𝑿ˇl|𝑿ˇm,𝑿ˇ0[𝑿ˇl|𝒙ˇm,𝒙ˇ0]=ρl2-ρm2ρl2υlρl𝒙ˇ0+υl2υm2υmυlρmρl𝒙ˇm=υlρl((ρl2-ρm2)𝒙ˇ0+ρmυm𝒙ˇm).\mathbb{E}_{{\bm{\check{X}}}_{l}|{\bm{\check{X}}}_{m},{\bm{\check{X}}}_{0}}{% \mathopen{}\mathclose{{}\left[{\bm{\check{X}}}_{l}\middle|\leavevmode\color[% rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\check{x}}_{m},% \leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\check{x}% }_{0}}\right]}=\frac{\rho_{l}^{2}-\rho_{m}^{2}}{\rho_{l}^{2}}\upsilon_{l}\rho_% {l}\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\check{x}% }_{0}+\frac{\upsilon^{2}_{l}}{\upsilon^{2}_{m}}\frac{\upsilon_{m}}{\upsilon_{l% }}\frac{\rho_{m}}{\rho_{l}}\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]% {pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{\check{x}}_{m}=\frac{\upsilon_{l}}{\rho_{l}}% \mathopen{}\mathclose{{}\left((\rho_{l}^{2}-\rho_{m}^{2})\leavevmode\color[rgb% ]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\check{x}}_{0}+% \frac{\rho_{m}}{\upsilon_{m}}\leavevmode\color[rgb]{.5,.5,.5}\definecolor[% named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{\check{x}}_{m}}\right).

Assembling the cumulants, we have

pˇ(𝒙ˇl|𝒙ˇm,𝒙ˇ0)=𝒩(υlρl((ρl2-ρm2)𝒙ˇ0+ρmυm𝒙ˇm),υl2(ρl2-ρm2)ρl2𝐈).{\check{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\check{x}}_{l}\middle|\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\check{x}}_{m},% \leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\check{x}% }_{0}}\right)}=\mathcal{N}\mathopen{}\mathclose{{}\left(\frac{\upsilon_{l}}{% \rho_{l}}\mathopen{}\mathclose{{}\left((\rho_{l}^{2}-\rho_{m}^{2})\leavevmode% \color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\check{x}}_{0}+% \frac{\rho_{m}}{\upsilon_{m}}\leavevmode\color[rgb]{.5,.5,.5}\definecolor[% named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{\check{x}}_{m}}\right),\>\frac{\upsilon^{2}_{l}% (\rho_{l}^{2}-\rho_{m}^{2})}{\rho_{l}^{2}}\mathbf{I}}\right). (8.38)

The reverse-transition cross entropies, revisited.

We noted above that one way to evaluate the joint relative entropy for the diffusion model is to form Monte Carlo estimates of each of the summands in Eq. 8.34. Naïvely, we could evaluate each summand with samples from three random variables (𝑿ˇ0,𝑿ˇl-1,𝑿ˇl{\bm{\check{X}}}_{0},{\bm{\check{X}}}_{l-1},{\bm{\check{X}}}_{l}), but as we also noted, one of the expectations can actually be taken in closed-form. In particular, if we expand pˇ(𝒙ˇl,𝒙ˇl-1|𝒙ˇ0){\check{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\check{x}}_{l},\leavevmode\color[rgb]{.5,.5,.5% }\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.% 5}\pgfsys@color@gray@fill{.5}\bm{\check{x}}_{l-1}\middle|\leavevmode\color[rgb% ]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\check{x}}_{0}{}}% \right)} according the chain rule of probability given by Eq. 8.35—namely, in terms of the two distributions just derived, Eqs. 8.36 and 8.38 (letting ll be l-1l-1 and mm be ll)—then we can write

JRE(𝜽)=l=1L12σl2𝒙ˇ0𝒙ˇl-1𝒙ˇlpˇ(𝒙ˇl-1|𝒙ˇl,𝒙ˇ0)pˇ(𝒙ˇl|𝒙ˇ0)p(𝒙ˇ0)𝒙ˇl-1-𝝁l2d𝒙ˇld𝒙ˇl-1d𝒙ˇ0+c=l=1L12σl2𝒙ˇ0𝒙ˇlpˇ(𝒙ˇl|𝒙ˇ0)p(𝒙ˇ0)υl-1ρl-1((ρl-12-ρl2)𝒙ˇ0+ρlυl𝒙ˇl)-𝝁l2d𝒙ˇld𝒙ˇ0+c.\begin{split}\mathcal{L}_{\text{JRE}}(\bm{\theta})&=\sum_{l=1}^{L}\frac{1}{2% \sigma^{2}_{l}}\int_{\bm{\check{x}}_{0}{}}\int_{\bm{\check{x}}_{l-1}{}}\int_{% \bm{\check{x}}_{l}{}}{\check{p}\mathopen{}\mathclose{{}\left(\bm{\check{x}}_{l% -1}\middle|\bm{\check{x}}_{l},\bm{\check{x}}_{0}}\right)}{\check{p}\mathopen{}% \mathclose{{}\left(\bm{\check{x}}_{l}\middle|\bm{\check{x}}_{0}}\right)}{p% \mathopen{}\mathclose{{}\left(\bm{\check{x}}_{0}}\right)}\mathopen{}\mathclose% {{}\left\lVert\bm{\check{x}}_{l-1}-\bm{\mu}_{l}}\right\rVert^{2}\mathop{}\!% \mathrm{d}{\bm{\check{x}}_{l}{}}\mathop{}\!\mathrm{d}{\bm{\check{x}}_{l-1}{}}% \mathop{}\!\mathrm{d}{\bm{\check{x}}_{0}{}}+c\\ &=\sum_{l=1}^{L}\frac{1}{2\sigma^{2}_{l}}\int_{\bm{\check{x}}_{0}{}}\int_{\bm{% \check{x}}_{l}{}}{\check{p}\mathopen{}\mathclose{{}\left(\bm{\check{x}}_{l}% \middle|\bm{\check{x}}_{0}}\right)}{p\mathopen{}\mathclose{{}\left(\bm{\check{% x}}_{0}}\right)}\mathopen{}\mathclose{{}\left\lVert\frac{\upsilon_{l-1}}{\rho_% {l-1}}\mathopen{}\mathclose{{}\left((\rho_{l-1}^{2}-\rho_{l}^{2})\bm{\check{x}% }_{0}+\frac{\rho_{l}}{\upsilon_{l}}\bm{\check{x}}_{l}}\right)-\bm{\mu}_{l}}% \right\rVert^{2}\mathop{}\!\mathrm{d}{\bm{\check{x}}_{l}{}}\mathop{}\!\mathrm{% d}{\bm{\check{x}}_{0}{}}+c.\end{split} (8.39)

In moving to the second line, the expectation of the quadratic form was taken under pˇ(𝒙ˇl-1|𝒙ˇl,𝒙ˇ0){\check{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\check{x}}_{l-1}\middle|\leavevmode\color[rgb]% {.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\check{x}}_{l},% \leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\check{x}% }_{0}}\right)} using the identity B.13 from the appendix, except that the trace term, again a function of the fixed variance, was absorbed into the (now different) constant cc. The remaining expectations can be approximated with Monte Carlo estimates, since we have in hand samples from the data distribution, p(𝒙ˇ0){p\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[% named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{\check{x}}_{0}{}}\right)}, and it is straightforward to generate samples of 𝑿ˇl{\bm{\check{X}}}_{l} from Eq. 8.36.

How are we to interpret Eq. 8.39? It would be convenient if this could also be expressed as the mean squared error between an uncorrupted sample and a predictor—call it 𝒎^(𝒙ˇl,l,𝜽)\bm{\hat{m}}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{% pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{\check{x}}_{l},l,\bm{\theta})—that has access only to corrupted samples (𝒙ˇl\bm{\check{x}}_{l}), as in Eq. 8.34. Of course it can, if we simply reparameterize the generative mean function on analogy with the mean of pˇ(𝒙ˇl-1|𝒙ˇl,𝒙ˇ0){\check{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\check{x}}_{l-1}\middle|\leavevmode\color[rgb]% {.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\check{x}}_{l},% \leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\check{x}% }_{0}}\right)} (Eq. 8.38):

𝝁(𝒙^l,l,𝜽) . . =υl-1ρl-1((ρl-12-ρl2)𝒎^(𝒙^l,l,𝜽)+ρlυl𝒙ˇl).\bm{\mu}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{% rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat% {x}}_{l},l,\bm{\theta})\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.} }}=\frac{\upsilon_{l-1}}{\rho_{l-1}}\mathopen{}\mathclose{{}\left((\rho_{l-1}^% {2}-\rho_{l}^{2})\bm{\hat{m}}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[% named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{\hat{x}}_{l},l,\bm{\theta})+\frac{\rho_{l}}{% \upsilon_{l}}\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{% pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{\check{x}}_{l}}\right).

Note that this reparameterization loses no generality. In terms of this predictor, the joint relative entropy then becomes

JRE(𝜽)=l=1Lυl-122σl2ρl-12(ρl-12-ρl2)2𝒙ˇ0𝒙ˇlpˇ(𝒙ˇl|𝒙ˇ0)p(𝒙ˇ0)𝒙ˇ0-𝒎^l2d𝒙ˇld𝒙ˇ0+c.\begin{split}\mathcal{L}_{\text{JRE}}(\bm{\theta})&=\sum_{l=1}^{L}\frac{% \upsilon_{l-1}^{2}}{2\sigma^{2}_{l}\rho_{l-1}^{2}}(\rho_{l-1}^{2}-\rho_{l}^{2}% )^{2}\int_{\bm{\check{x}}_{0}{}}\int_{\bm{\check{x}}_{l}{}}{\check{p}\mathopen% {}\mathclose{{}\left(\bm{\check{x}}_{l}\middle|\bm{\check{x}}_{0}}\right)}{p% \mathopen{}\mathclose{{}\left(\bm{\check{x}}_{0}}\right)}\mathopen{}\mathclose% {{}\left\lVert\bm{\check{x}}_{0}-\bm{\hat{m}}_{l}}\right\rVert^{2}\mathop{}\!% \mathrm{d}{\bm{\check{x}}_{l}{}}\mathop{}\!\mathrm{d}{\bm{\check{x}}_{0}{}}+c.% \end{split} (8.40)

So as in Eq. 8.34, minimizing the loss amounts to optimizing a denoising function. But in this case, it is the completely uncorrupted data samples, 𝒙ˇ0\bm{\check{x}}_{0}, that are to be recovered, and accordingly a different (but related) denoising function/neural network (𝒎^l\bm{\hat{m}}_{l}) that is to be used. The integrals in Eq. 8.40 are again to be estimated with samples, but from only two rather than three random variables.

Up till now we have refrained from specifying σl2\sigma^{2}_{l}. However, we now note that the expectations carried out in Eq. 8.39 amount to computing the cross entropy between pˇ(𝒙ˇl-1|𝒙ˇl,𝒙ˇ0){\check{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\check{x}}_{l-1}\middle|\leavevmode\color[rgb]% {.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\check{x}}_{l},% \leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\check{x}% }_{0}}\right)} and p^(𝒙^l-1|𝒙^l;𝜽){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\hat{x}}_{l-1}\middle|\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}_{l};\bm{% \theta}}\right)}. Since cross entropy is minimized when the distributions are equal, it seems sensible simply to equate their variances.99 9 Nevertheless, this is not quite optimal. We will not in general be able to set the means of these distributions precisely equal, so the variance σl2\sigma^{2}_{l} really ought to soak up the difference. Comparing Eqs. 8.38 and 8.33, we have

σl2=setυl-12(ρl-12-ρl2)/ρl-12,\sigma^{2}_{l}\stackrel{{\scriptstyle\text{set}}}{{=}}\upsilon^{2}_{l-1}(\rho_% {l-1}^{2}-\rho_{l}^{2})/\rho_{l-1}^{2},

in which case Eq. 8.40 simplifies to the even more elegant

JRE(𝜽)=l=1Lρl-12-ρl22𝒙ˇ0𝒙ˇlpˇ(𝒙ˇl|𝒙ˇ0)p(𝒙ˇ0)𝒙ˇ0-𝒎^l2d𝒙ˇld𝒙ˇ0+cl=1Lρl-12-ρl22𝑿ˇ0-𝒎^(𝑿ˇl,l,𝜽)2𝑿ˇl,𝑿ˇ0+c.\begin{split}\mathcal{L}_{\text{JRE}}(\bm{\theta})&=\sum_{l=1}^{L}\frac{\rho_{% l-1}^{2}-\rho_{l}^{2}}{2}\int_{\bm{\check{x}}_{0}{}}\int_{\bm{\check{x}}_{l}{}% }{\check{p}\mathopen{}\mathclose{{}\left(\bm{\check{x}}_{l}\middle|\bm{\check{% x}}_{0}}\right)}{p\mathopen{}\mathclose{{}\left(\bm{\check{x}}_{0}}\right)}% \mathopen{}\mathclose{{}\left\lVert\bm{\check{x}}_{0}-\bm{\hat{m}}_{l}}\right% \rVert^{2}\mathop{}\!\mathrm{d}{\bm{\check{x}}_{l}{}}\mathop{}\!\mathrm{d}{\bm% {\check{x}}_{0}{}}+c\\ &\approx\sum_{l=1}^{L}\frac{\rho_{l-1}^{2}-\rho_{l}^{2}}{2}{\mathopen{}% \mathclose{{}\left\langle{\mathopen{}\mathclose{{}\left\lVert{\bm{\check{X}}}_% {0}-\bm{\hat{m}}({\bm{\check{X}}}_{l},l,\bm{\theta})}\right\rVert^{2}}}\right% \rangle_{{\bm{\check{X}}}_{l}{},{\bm{\check{X}}}_{0}{}}}+c.\end{split} (8.41)

In words, each summand in Eq. 8.41 computes the mean squared error between the uncorrupted data 𝑿ˇ0{\bm{\check{X}}}_{0} and a denoised version of the corrupted data, 𝑿ˇl{\bm{\check{X}}}_{l}. But before summing, the MSE at step ll is weighted by the amount of SNR lost in transitioning from step l-1l-1 to ll. In fact, Eq. 8.41 tells us that fitting a Gaussian reverse-diffusion model is equivalent to fitting a (conceptually) different generative model:

p^(𝒙ˇ0|𝒙^l;𝜽)=𝒩(𝒎^(𝒙^l,l,𝜽),1ρl-12-ρl2𝐈),l1,,L.{\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\check{x}}_{0}\middle|\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}_{l};\bm{% \theta}}\right)}=\mathcal{N}\mathopen{}\mathclose{{}\left(\bm{\hat{m}}(% \leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}_% {l},l,\bm{\theta}),\>\frac{1}{\rho_{l-1}^{2}-\rho_{l}^{2}}\mathbf{I}}\right),% \quad l\in 1,\ldots,L. (8.42)

Implementation

The pathwise gradient.

In actually carrying out the sample average for Eq. 8.41, we would typically reparameterize 𝑿ˇl{\bm{\check{X}}}_{l} along the lines of Eq. 8.24, i.e. as a scaled and shifted standard normal variate (𝒁ˇ{\bm{\check{Z}}}), using Eq. 8.36, and then apply the LotUS:

JRE(𝜽)=l=1Lρl-12-ρl22𝑿ˇ0-𝒎^(υl(ρl𝑿ˇ0+𝒁ˇ),l,𝜽)2𝒁ˇ,𝑿ˇ0+c.\mathcal{L}_{\text{JRE}}(\bm{\theta})=\sum_{l=1}^{L}\frac{\rho_{l-1}^{2}-\rho_% {l}^{2}}{2}{\mathopen{}\mathclose{{}\left\langle{\mathopen{}\mathclose{{}\left% \lVert{\bm{\check{X}}}_{0}-\bm{\hat{m}}\mathopen{}\mathclose{{}\left(\upsilon_% {l}(\rho_{l}{\bm{\check{X}}}_{0}+{\bm{\check{Z}}}),l,\bm{\theta}}\right)}% \right\rVert^{2}}}\right\rangle_{{\bm{\check{Z}}}{},{\bm{\check{X}}}_{0}{}}}+c. (8.43)

The continuous-time limit.

Eq. 8.36 tells us that the data can be corrupted to an arbitrary position in the Markov chain with a single computation. Consequently, it is not actually necessary to run the chain sequentially from 1 to LL during training, which is critical for parallelized implementations. Indeed, we need not even limit ourselves to an integer number of steps. Suppose we allow the SNR to be a monotonically decreasing function hh of a continuous variable uu that ranges from 0 to 1, such that h(l/L)=ρl2h(l/L)=\rho_{l}^{2}. For consistency, we will define another function on [0,1][0,1] for the marginal variance, such that g(l/L)=υl2g(l/L)=\upsilon^{2}_{l}, although gg need not be monotonic. Then if we scale the joint relative entropy in Eq. 8.41 by the “step size” 1/L1/L and take the limit as LL\to\infty, the loss becomes

limL1LJRE(𝜽)=limLl=1L12L(h(l-1L)-h(lL))𝑿ˇ0-𝒎^(g(l/L)h(l/L)𝑿ˇ0+g(l/L)𝒁ˇ,l/L,𝜽)2𝒁ˇ,𝑿ˇ0+c/L=1201dhdu(u)𝑿ˇ0-𝒎^(g(u)h(u)𝑿ˇ0+g(u)𝒁ˇ,u,𝜽)2𝒁ˇ,𝑿ˇ0du=12dhdu(U)𝑿ˇ0-𝒎^(g(U)h(U)𝑿ˇ0+g(U)𝒁ˇ,U,𝜽)2𝒁ˇ,𝑿ˇ0,U\begin{split}\lim_{L\to\infty}\frac{1}{L}\mathcal{L}_{\text{JRE}}(\bm{\theta})% &=\lim_{L\to\infty}\sum_{l=1}^{L}\frac{1}{2L}\mathopen{}\mathclose{{}\left(h% \mathopen{}\mathclose{{}\left(\frac{l-1}{L}}\right)-h\mathopen{}\mathclose{{}% \left(\frac{l}{L}}\right)}\right)\\ &\qquad\qquad{\mathopen{}\mathclose{{}\left\langle{\mathopen{}\mathclose{{}% \left\lVert{\bm{\check{X}}}_{0}-\bm{\hat{m}}\mathopen{}\mathclose{{}\left(% \sqrt{g(l/L)h(l/L)}{\bm{\check{X}}}_{0}+\sqrt{g(l/L)}{\bm{\check{Z}}},l/L,\bm{% \theta}}\right)}\right\rVert^{2}}}\right\rangle_{{\bm{\check{Z}}}{},{\bm{% \check{X}}}_{0}{}}}+c/L\\ &=\frac{1}{2}\int_{0}^{1}\frac{\mathop{}\!\mathrm{d}{h}}{\mathop{}\!\mathrm{d}% {u}}(u){\mathopen{}\mathclose{{}\left\langle{\mathopen{}\mathclose{{}\left% \lVert{\bm{\check{X}}}_{0}-\bm{\hat{m}}\mathopen{}\mathclose{{}\left(\sqrt{g(u% )h(u)}{\bm{\check{X}}}_{0}+\sqrt{g(u)}{\bm{\check{Z}}},u,\bm{\theta}}\right)}% \right\rVert^{2}}}\right\rangle_{{\bm{\check{Z}}}{},{\bm{\check{X}}}_{0}{}}}% \mathop{}\!\mathrm{d}{u}\\ &=\frac{1}{2}{\mathopen{}\mathclose{{}\left\langle{\frac{\mathop{}\!\mathrm{d}% {h}}{\mathop{}\!\mathrm{d}{u}}(U)\mathopen{}\mathclose{{}\left\lVert{\bm{% \check{X}}}_{0}-\bm{\hat{m}}\mathopen{}\mathclose{{}\left(\sqrt{g(U)h(U)}{\bm{% \check{X}}}_{0}+\sqrt{g(U)}{\bm{\check{Z}}},U,\bm{\theta}}\right)}\right\rVert% ^{2}}}\right\rangle_{{\bm{\check{Z}}}{},{\bm{\check{X}}}_{0}{},U{}}}\end{split}

with U𝒰(0,1)U\sim\mathcal{U}\mathopen{}\mathclose{{}\left(0,1}\right) a uniformly distributed random variable. This is the preferred implementation of diffusion models [25]. But the second line also suggests the change of variables λ=h(u)\lambda=h(u), under which the integral becomes

limL1LJRE(𝜽)=12λ𝑿ˇ0-𝒎^(g(h-1(λ))λ𝑿ˇ0+g(h-1(λ))𝒁ˇ,h-1(λ),𝜽)2𝒁ˇ,𝑿ˇ0dλ.\lim_{L\to\infty}\frac{1}{L}\mathcal{L}_{\text{JRE}}(\bm{\theta})=\frac{1}{2}% \int_{\lambda{}}{\mathopen{}\mathclose{{}\left\langle{\mathopen{}\mathclose{{}% \left\lVert{\bm{\check{X}}}_{0}-\bm{\hat{m}}\mathopen{}\mathclose{{}\left(% \sqrt{g(h^{-1}(\lambda))\lambda}{\bm{\check{X}}}_{0}+\sqrt{g(h^{-1}(\lambda))}% {\bm{\check{Z}}},h^{-1}(\lambda),\bm{\theta}}\right)}\right\rVert^{2}}}\right% \rangle_{{\bm{\check{Z}}}{},{\bm{\check{X}}}_{0}{}}}\mathop{}\!\mathrm{d}{% \lambda{}}.

Notice that hh can be safely removed from this equation, since we have not yet committed to any particular gg, and 𝒎^\bm{\hat{m}} is assumed to be arbitrarily flexible. This shows that in the continuous-time limit, any choice of SNR function yields the same joint relative entropy in expectation, as long as (1) it is monotonically decreasing and (2) it has well-chosen endpoints, λmin\lambda_{\text{min}} and λmax\lambda_{\text{max}}. However, this choice does affect the variance of this sample average, and various SNR “schedules” have been experimented with in practice [25].

A connection to denoising score matching

There is in fact another illuminating reparameterization, this time in terms of the negative energy gradientmargin: negative energy gradient , -E/𝒙-\partial{E}/\partial{\bm{x}}. Because this quantity can be written (generically) as [logp(𝒙)]/𝒙\partial{\mathopen{}\mathclose{{}\left[\log{p\mathopen{}\mathclose{{}\left(% \leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{x}{}}% \right)}}\right]}/\partial{\bm{x}}, it is sometimes called the score function for its resemblance to [logp(𝒙;𝜽)]/𝜽\partial{\mathopen{}\mathclose{{}\left[\log p(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{x};\bm{\theta})}\right]}/\partial{\bm{\theta}}. We will call it the forcemargin: force . Intuitively, the force points toward the modal 𝒙\bm{x}. Furthermore, if our goal in fitting a generative model is merely to synthesize new data, then the force suffices, because the iteration

𝑿i+1=𝑿i-ϵE𝒙(𝑿i)+2ϵ𝒁i,{\bm{X}}_{i+1}={\bm{X}}_{i}-\epsilon\frac{\partial{E}}{\partial{\bm{x}}}({\bm{% X}}_{i})+\sqrt{2\epsilon}{\bm{Z}}_{i}, (8.44)

(with 𝒁l𝒩(0,𝐈){\bm{Z}}_{l}\sim\mathcal{N}\mathopen{}\mathclose{{}\left(0,\>\mathbf{I}}\right), and step size ϵ\epsilon) can be shown to generate samples approximately from the distribution p(𝒙)exp{-E(𝒙)}{p\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[% named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{x}{}}\right)}\propto\exp\mathopen{}\mathclose{{% }\left\{-E(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}% {rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{x})% }\right\}. This iteration is known as Langevin dynamicsmargin: Langevin dynamics , and we return to it in Section 10.2.1. For now we simply ask how we might get or estimate the force.

Consider a random variable 𝒀{\bm{Y}} that was created by corrupting the random variable of interest, 𝑿{\bm{X}}, with some kind of additive noise. The resulting marginal distribution of 𝒀{\bm{Y}},

p(𝒚)=𝒙p(𝒙)pnoise(𝒚-𝒙)d𝒙nNpnoise(𝒚-𝒙n)p(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y})=\int_% {\bm{x}{}}{p\mathopen{}\mathclose{{}\left(\bm{x}}\right)}p_{\text{noise}}(% \leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y}-\bm{x}% )\mathop{}\!\mathrm{d}{\bm{x}{}}\approx\sum_{n}^{{N}}p_{\text{noise}}(% \leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y}-\bm{x}% _{n})

can be thought of as a kernel-density estimate of the distribution of interest, p(𝒙){p\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[% named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{x}{}}\right)}, with N{N} data samples and kernel pnoisep_{\text{noise}}. So perhaps we can use the former in the place of the latter in our Langevin dynamics, Eq. 8.44. But then how are we to get the force of p(𝒚)p(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y})? Expanding it, we find that

𝒚Tlogp(𝒚)=1p(𝒚)𝒚T𝒙p(𝒚|𝒙)p(𝒙)d𝒙=1p(𝒚)𝒙p(𝒚|𝒙)𝒚Tlog(p(𝒚|𝒙))p(𝒙)d𝒙=𝔼𝑿|𝒀[𝒚Tlog(p(𝒚|𝑿))|𝒚].\begin{split}\frac{\partial{}}{\partial{\bm{y}}^{\text{T}}}\log p(\leavevmode% \color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y})=\frac{1}{p(% \leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y})}\frac% {\partial{}}{\partial{\bm{y}}^{\text{T}}}\int_{\bm{x}{}}p(\leavevmode\color[% rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y}|\bm{x})p(\bm{x}% )\mathop{}\!\mathrm{d}{\bm{x}{}}&=\frac{1}{p(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{y})}\int_{\bm{x}{}}p(\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y}|\bm{x})\frac{% \partial{}}{\partial{\bm{y}}^{\text{T}}}\log\mathopen{}\mathclose{{}\left(p(% \leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y}|\bm{x}% )}\right)p(\bm{x})\mathop{}\!\mathrm{d}{\bm{x}{}}\\ &=\mathbb{E}_{{\bm{X}}{}|{\bm{Y}}}{\mathopen{}\mathclose{{}\left[\frac{% \partial{}}{\partial{\bm{y}}^{\text{T}}}\log\mathopen{}\mathclose{{}\left(p(% \leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y}|{\bm{X% }})}\right)\middle|\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{% pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{y}{}}\right]}.\end{split} (8.45)

This equation says that the force of the marginal p(𝒚)p(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y}) equals the expected (under p(𝒙|𝒚)p(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{x}|% \leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y})) force of the conditional, p(𝒚|𝒙)p(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y}|% \leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{x}). The latter can often be computed. For example, if the data are corrupted by scaling and then adding zero-mean Gaussian noise, then the conditional energy and its expected negative gradient (force) are

E(𝒚|𝒙)=(𝒚-α𝒙)TΣ-1(𝒚-α𝒙)/2𝔼𝑿|𝒀[-𝒚TE(𝒚|𝑿)|𝒚]=Σ-1(α𝔼[𝑿|𝒚]-𝒚).E(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y}|% \leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{x})=(% \leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y}-\alpha% \leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{x})^{% \text{T}}\Sigma^{-1}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{% pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{y}-\alpha\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{x})/2\implies\mathbb{E}_{{\bm{X}}{}|{\bm{Y}}}{% \mathopen{}\mathclose{{}\left[-\frac{\partial{}}{\partial{\bm{y}}^{\text{T}}}E% (\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y}|{\bm{X% }})\middle|\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}% {rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y}{% }}\right]}=\Sigma^{-1}\mathopen{}\mathclose{{}\left(\alpha\mathbb{E}{\mathopen% {}\mathclose{{}\left[{\bm{X}}|\leavevmode\color[rgb]{.5,.5,.5}\definecolor[% named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{y}}\right]}-\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{y}}\right).

Putting this together with Eq. 8.45, we see that for additive Gaussian noise,

𝒚Tlogp(𝒚)=Σ-1(α𝔼[𝑿|𝒚]-𝒚)\frac{\partial{}}{\partial{\bm{y}}^{\text{T}}}\log p(\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y})=\Sigma^{-1}% \mathopen{}\mathclose{{}\left(\alpha\mathbb{E}{\mathopen{}\mathclose{{}\left[{% \bm{X}}|\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{% rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y}}% \right]}-\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{% rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y}}\right) (8.46)

This is sometimes called Tweedie’s formulamargin: Tweedie’s formula . This looks helpful—if we had in hand the posterior mean!

Now it is a fact that, of all estimators for 𝑿{\bm{X}}, the posterior mean has minimum mean squared error [42]. Putting all these pieces together [18, 52, 23] yields the following procedure to generate samples from the distribution of interest, p(𝒙){p\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[% named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{x}{}}\right)}: (1) Find an estimator for 𝑿{\bm{X}} that minimizes mean square error; (2) use this in place of the posterior mean in Eq. 8.46 to compute the expected conditional force and, consequently, the marginal force; (3) use the marginal force, [logp(𝒚)]/𝒚\partial{\mathopen{}\mathclose{{}\left[\log p(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{y})}\right]}/\partial{\bm{y}}, as a proxy for the data force, [logp(𝒙)]/𝒙\partial{\mathopen{}\mathclose{{}\left[\log{p\mathopen{}\mathclose{{}\left(% \leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{x}{}}% \right)}}\right]}/\partial{\bm{x}}, in Langevin dynamics (Eq. 8.44). This method of density estimation is known as denoising score matchingmargin: denoising score matching .

Now we examine the diffusion model in light of this procedure. Fitting the generative model to the data has turned out to be equivalent to minimizing the mean squared error between 𝑿ˇ0{\bm{\check{X}}}_{0} and 𝒎^(𝑿ˇl,l,𝜽)\bm{\hat{m}}({\bm{\check{X}}}_{l},l,\bm{\theta}) (Eq. 8.41). Therefore we can interpret 𝒎^l\bm{\hat{m}}_{l} as (an estimator for) the posterior mean, 𝔼[𝑿ˇ0|𝑿ˇl]\mathbb{E}{\mathopen{}\mathclose{{}\left[{\bm{\check{X}}}_{0}|{\bm{\check{X}}}% _{l}}\right]}. The samples 𝑿ˇl{\bm{\check{X}}}_{l} are generated by a recognition model that corrupts the data samples 𝑿ˇ0{\bm{\check{X}}}_{0} with Gaussian noise (Eq. 8.36). Therefore we can use the posterior-mean estimator 𝒎^l\bm{\hat{m}}_{l} and Tweedie’s formula (Eq. 8.46) to construct a force estimator:

υlρl𝒎^(𝒙ˇl,l,𝜽)-𝒙ˇlυl2= . . 𝒇^(𝒙ˇl,l,𝜽).\frac{\upsilon_{l}\rho_{l}\bm{\hat{m}}(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\check{x}}_{l},l,\bm{\theta})-\leavevmode% \color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\check{x}}_{l}}{% \upsilon^{2}_{l}}=\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.} }}\bm{\hat{f}}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{% pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{\check{x}}_{l},l,\bm{\theta}). (8.47)

The force estimator 𝒇^l\bm{\hat{f}}_{l} also provides a good proxy for the data force, [logp(𝒙ˇ0)]/𝒙ˇ0\partial{\mathopen{}\mathclose{{}\left[\log{p\mathopen{}\mathclose{{}\left(% \leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\check{x}% }_{0}{}}\right)}}\right]}/\partial{\bm{\check{x}}_{0}}, and consequently can be used to generate data with Langevin dynamics (Eq. 8.44).

Alternatively, the force can be fit directly, rather than indirectly via the posterior mean. That is easily done here as well, simply by rearranging Eq. 8.47 to reparameterize 𝒎^\bm{\hat{m}} (again without loss of generality):

𝒎^(𝒙ˇl,l,𝜽)= . . υlρl𝒇^(𝒙ˇl,l,𝜽)+1υlρl𝒙ˇl\bm{\hat{m}}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{% pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{\check{x}}_{l},l,\bm{\theta})=\mathrel{\vbox{% \hbox{\scriptsize.}\hbox{\scriptsize.} }}\frac{\upsilon_{l}}{\rho_{l}}\bm{\hat{f}}(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\check{x}}_{l},l,\bm{\theta})+\frac{1}{% \upsilon_{l}\rho_{l}}\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{% pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{\check{x}}_{l}

for some arbitrary function (neural network) 𝒇^\bm{\hat{f}}. Under this reparameterization, Eq. 8.41 becomes

JRE(𝜽)=l=1Lρl-12-ρl22𝑿ˇ0-υlρl𝒇^(𝑿ˇl,l,𝜽)-1υlρl𝑿ˇl2𝑿ˇl,𝑿ˇ0+c=l=1Lρl-12-ρl22υl2ρl2υlρl𝑿ˇ0-𝑿ˇlυl2-𝒇^(𝑿ˇl,l,𝜽)2𝑿ˇl,𝑿ˇ0+c=l=1Lρl-12-ρl22ρl2𝒁ˇ+υl𝒇^(υl(ρl𝑿ˇ0+𝒁ˇ),l,𝜽)2𝒁ˇ,𝑿ˇ0+c.\begin{split}\mathcal{L}_{\text{JRE}}(\bm{\theta})&=\sum_{l=1}^{L}\frac{\rho_{% l-1}^{2}-\rho_{l}^{2}}{2}{\mathopen{}\mathclose{{}\left\langle{\mathopen{}% \mathclose{{}\left\lVert{\bm{\check{X}}}_{0}-\frac{\upsilon_{l}}{\rho_{l}}\bm{% \hat{f}}\mathopen{}\mathclose{{}\left({\bm{\check{X}}}_{l},l,\bm{\theta}}% \right)-\frac{1}{\upsilon_{l}\rho_{l}}{\bm{\check{X}}}_{l}}\right\rVert^{2}}}% \right\rangle_{{\bm{\check{X}}}_{l}{},{\bm{\check{X}}}_{0}{}}}+c\\ &=\sum_{l=1}^{L}\frac{\rho_{l-1}^{2}-\rho_{l}^{2}}{2}\frac{\upsilon^{2}_{l}}{% \rho_{l}^{2}}{\mathopen{}\mathclose{{}\left\langle{\mathopen{}\mathclose{{}% \left\lVert\frac{\upsilon_{l}\rho_{l}{\bm{\check{X}}}_{0}-{\bm{\check{X}}}_{l}% }{\upsilon^{2}_{l}}-\bm{\hat{f}}\mathopen{}\mathclose{{}\left({\bm{\check{X}}}% _{l},l,\bm{\theta}}\right)}\right\rVert^{2}}}\right\rangle_{{\bm{\check{X}}}_{% l}{},{\bm{\check{X}}}_{0}{}}}+c\\ &=\sum_{l=1}^{L}\frac{\rho_{l-1}^{2}-\rho_{l}^{2}}{2\rho_{l}^{2}}{\mathopen{}% \mathclose{{}\left\langle{\mathopen{}\mathclose{{}\left\lVert{\bm{\check{Z}}}+% \upsilon_{l}\bm{\hat{f}}\mathopen{}\mathclose{{}\left(\upsilon_{l}(\rho_{l}{% \bm{\check{X}}}_{0}+{\bm{\check{Z}}}),l,\bm{\theta}}\right)}\right\rVert^{2}}}% \right\rangle_{{\bm{\check{Z}}}{},{\bm{\check{X}}}_{0}{}}}+c.\end{split} (8.48)

The last step follow from reparameterization. Intuitively, 𝒇^\bm{\hat{f}} learns, like 𝒎^\bm{\hat{m}}, how to uncorrupt data. But rather than transforming the corrupted sample (𝒙ˇl\bm{\check{x}}_{l}) into an estimate of the (scaled) uncorrupted sample itself (υlρl𝒙ˇ0\upsilon_{l}\rho_{l}\bm{\check{x}}_{0}), a good 𝒇^l\bm{\hat{f}}_{l} produces (second line of Eq. 8.48) an estimate of the vector that points back to υlρl𝐱ˇ0\upsilon_{l}\rho_{l}\bm{\check{x}}_{0} from the corrupted sample 𝐱ˇl\bm{\check{x}}_{l}. This is consistent with our conclusion that any 𝒇^\bm{\hat{f}} that satisfies Eq. 8.47 provides an estimator for the force of the data distribution. Alternatively, the final line of Eq. 8.48 tells us that the force estimator 𝒇^l\bm{\hat{f}}_{l} must try to recover each realizaton of noise (𝒛ˇl\bm{\check{z}}_{l}) that corrupted each observed datum (𝒙ˇ0\bm{\check{x}}_{0}). But notice that the negative force, i.e. the positive energy gradient, must point in the direction of 𝒁ˇ{\bm{\check{Z}}}. This makes sense: we expect the noise to be “uphill.”

In either case, each summand in Eq. 8.48 corresponds to an objective for denoising score matching; or, put the other way around, fitting a Gaussian (reverse-)diffusion model (Eqs. 8.30, 8.31, 8.37, and 8.33) amounts to running denoising score matching for many different kernel widths. And indeed, such a learning procedure has been proposed and justified independently under this description [48].