6.3 Expectation-Maximization

6.3.1 Derivation of EM

This isn’t quite true. Working with the joint did allow the parameters inside the expectation to decouple. The present problem is a consequence of the parameters in the expectation itself, in particular the model posterior, p^(𝒙^|𝒚^;𝜽){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{}\middle|\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{};\bm{%\theta}}\right)}. This does not preclude gradient descent, and we shall return to it and Eq. 6.3 in Section 10.2.1.

Approximating the posterior distribution.

But here we will not yet give up hope of a cleaner solution. Consider this seemingly artless proposal: simply take the expectation under a different distribution, pˇ(𝒙ˇ|𝒚){\check{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\check{x}}{}\middle|\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y}{}}\right)}, one that doesn’t depend on these parameters.11 1 Note the new latent variables 𝑿ˇ{\bm{\check{X}}}, as opposed to 𝑿^{\bm{\hat{X}}}, that have necessarily been introduced with this “recognition” model. This distinction is not always noted in the literature. Doing so will certainly ruin the equality in Eq. 6.3; that is,

dd𝜽𝔼𝒀[-logp^(𝒀;𝜽)]=𝔼𝑿^,𝒀[-dd𝜽logp^(𝑿^,𝒀;𝜽)]𝔼𝑿ˇ,𝒀[-dd𝜽logp^(𝑿ˇ,𝒀;𝜽)]=dd𝜽𝔼𝑿ˇ,𝒀[-logp^(𝑿ˇ,𝒀;𝜽)],\begin{split}{\frac{\mathop{}\!\mathrm{d}{}}{\mathop{}\!\mathrm{d}{\bm{\theta}%}}}\mathbb{E}_{{\bm{Y}}{}}{\mathopen{}\mathclose{{}\left[-\log{\hat{p}%\mathopen{}\mathclose{{}\left({\bm{Y}};\bm{\theta}}\right)}}\right]}&=\mathbb{%E}_{{\bm{\hat{X}}}{},{\bm{Y}}{}}{\mathopen{}\mathclose{{}\left[-{\frac{\mathop%{}\!\mathrm{d}{}}{\mathop{}\!\mathrm{d}{\bm{\theta}}}}\log{\hat{p}\mathopen{}%\mathclose{{}\left({\bm{\hat{X}}},{\bm{Y}};\bm{\theta}}\right)}}\right]}\\&\neq\mathbb{E}_{{\bm{\check{X}}}{},{\bm{Y}}{}}{\mathopen{}\mathclose{{}\left[%-{\frac{\mathop{}\!\mathrm{d}{}}{\mathop{}\!\mathrm{d}{\bm{\theta}}}}\log{\hat%{p}\mathopen{}\mathclose{{}\left({\bm{\check{X}}},{\bm{Y}};\bm{\theta}}\right)%}}\right]}={\frac{\mathop{}\!\mathrm{d}{}}{\mathop{}\!\mathrm{d}{\bm{\theta}}}%}\mathbb{E}_{{\bm{\check{X}}}{},{\bm{Y}}{}}{\mathopen{}\mathclose{{}\left[-%\log{\hat{p}\mathopen{}\mathclose{{}\left({\bm{\check{X}}},{\bm{Y}};\bm{\theta%}}\right)}}\right]},\end{split}

where in the bottom lines the expectation is under this recognition distributionmargin: recognition distribution .22 2 This term for—and the idea of—a separate model for inference originates in [4]. But, as indicated by the final equality, and in contrast to the correct gradient, this incorrect gradient is indeed the gradient of a joint (“complete”) cross entropy: the derivative can pass outside the expectation that no longer depends on 𝜽\bm{\theta}.

The central intuition behind the algorithms that follow is to fit marginal densities by minimizing the joint or complete cross entropy, 𝔼𝑿ˇ,𝒀[-logp^(𝑿ˇ,𝒀;𝜽)]\mathbb{E}_{{\bm{\check{X}}}{},{\bm{Y}}{}}{\mathopen{}\mathclose{{}\left[-\log%{\hat{p}\mathopen{}\mathclose{{}\left({\bm{\check{X}}},{\bm{Y}};\bm{\theta}}%\right)}}\right]} in lieu of the marginal or “incomplete” cross entropy, 𝔼𝒀[-logp^(𝒀;𝜽)]\mathbb{E}_{{\bm{Y}}{}}{\mathopen{}\mathclose{{}\left[-\log{\hat{p}\mathopen{}%\mathclose{{}\left({\bm{Y}};\bm{\theta}}\right)}}\right]}. The discrepancy between these cross entropies can be finessed (we claim) by simultaneously minimizing the discrepancy between the recognition distribution and the posterior of the generative model.

Relative entropy, cross entropy, and free energy.

Let us prove this last claim. We begin by noting that the following objectives differ only by terms constant in the parameters 𝜽\bm{\theta}:

JRE(𝜽,pˇ)..=DKL{pˇ(𝑿ˇ|𝒀)p(𝒀)p^(𝑿ˇ,𝒀;𝜽)}=𝔼𝑿ˇ,𝒀[log(pˇ(𝑿ˇ|𝒀)p(𝒀))-logp^(𝑿ˇ,𝒀;𝜽)]=𝔼𝑿ˇ,𝒀[logpˇ(𝑿ˇ|𝒀)-logp^(𝑿ˇ,𝒀;𝜽)]+C1=𝔼𝑿ˇ,𝒀[-logp^(𝑿ˇ,𝒀;𝜽)]+C2.\begin{split}\mathcal{L}_{\text{JRE}}(\bm{\theta},\check{p})\mathrel{\vbox{%\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\operatorname*{\text{D}_{\text{KL}}}\mathopen{}\mathclose{{}\left\{{\check{%p}\mathopen{}\mathclose{{}\left({\bm{\check{X}}}\middle|{\bm{Y}}}\right)}{p%\mathopen{}\mathclose{{}\left({\bm{Y}}}\right)}\middle\|{\hat{p}\mathopen{}%\mathclose{{}\left({\bm{\check{X}}},{\bm{Y}};\bm{\theta}}\right)}}\right\}&=%\mathbb{E}_{{\bm{\check{X}}}{},{\bm{Y}}{}}{\mathopen{}\mathclose{{}\left[{\log%\mathopen{}\mathclose{{}\left({\check{p}\mathopen{}\mathclose{{}\left({\bm{%\check{X}}}\middle|{\bm{Y}}}\right)}{p\mathopen{}\mathclose{{}\left({\bm{Y}}}%\right)}}\right)}-\log{\hat{p}\mathopen{}\mathclose{{}\left({\bm{\check{X}}},{%\bm{Y}};\bm{\theta}}\right)}}\right]}\\&=\mathbb{E}_{{\bm{\check{X}}}{},{\bm{Y}}{}}{\mathopen{}\mathclose{{}\left[%\log{\check{p}\mathopen{}\mathclose{{}\left({\bm{\check{X}}}\middle|{\bm{Y}}}%\right)}-\log{\hat{p}\mathopen{}\mathclose{{}\left({\bm{\check{X}}},{\bm{Y}};%\bm{\theta}}\right)}}\right]}+C_{1}\\&=\mathbb{E}_{{\bm{\check{X}}}{},{\bm{Y}}{}}{\mathopen{}\mathclose{{}\left[-%\log{\hat{p}\mathopen{}\mathclose{{}\left({\bm{\check{X}}},{\bm{Y}};\bm{\theta%}}\right)}}\right]}+C_{2}.\end{split} (6.5)

We can interpret these quantities:

  • The first is a relative entropy, in this case between a “hybrid” joint distribution—the product of the recognition distribution and the data marginal—and the generative-model joint. Thus our optimization can be expressed in terms of the fundamental loss function proposed in Section 4.2.

  • The second we call the free energymargin: free energy after its counterpart in statistical physics. It differs from the relative entropy only by the entropy of the data. We introduce this quantity primarily because, in the machine-learning literature, EM-like learning procedures are most frequently derived in terms of something like the free energy. More precisely, the proofs are written in terms of 𝔼𝑿ˇ[logp^(𝑿ˇ,𝒚^;𝜽)-logpˇ(𝑿ˇ|𝒚)]\mathbb{E}_{{\bm{\check{X}}}{}}{\mathopen{}\mathclose{{}\left[\log{\hat{p}%\mathopen{}\mathclose{{}\left({\bm{\check{X}}},\leavevmode\color[rgb]{.5,.5,.5%}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.%5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{};\bm{\theta}}\right)}-\log{\check{p%}\mathopen{}\mathclose{{}\left({\bm{\check{X}}}\middle|\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y}{}}\right)}}%\right]}, the negative of the free energy before averaging under the data. This is known as the evidence lower bound (ELBo)margin: evidence lower bound (ELBo) . The name will become clear below.

  • The quantity in the third line is the joint cross entropy that we lately proposed to optimize in lieu of the marginal cross entropy.

Clearly, minimizing any of these objectives is equivalent, but the proof is most elegant for the joint relative entropy, which can be decomposed as

DKL{pˇ(𝑿ˇ|𝒀)p(𝒀)p^(𝑿ˇ,𝒀;𝜽)}=DKL{p(𝒀)p^(𝒀;𝜽)}+DKL{pˇ(𝑿ˇ|𝒀)p^(𝑿ˇ|𝒀;𝜽)}.\operatorname*{\text{D}_{\text{KL}}}\mathopen{}\mathclose{{}\left\{{\check{p}%\mathopen{}\mathclose{{}\left({\bm{\check{X}}}\middle|{\bm{Y}}}\right)}{p%\mathopen{}\mathclose{{}\left({\bm{Y}}}\right)}\middle\|{\hat{p}\mathopen{}%\mathclose{{}\left({\bm{\check{X}}},{\bm{Y}};\bm{\theta}}\right)}}\right\}=%\operatorname*{\text{D}_{\text{KL}}}\mathopen{}\mathclose{{}\left\{{p\mathopen%{}\mathclose{{}\left({\bm{Y}}}\right)}\middle\|{\hat{p}\mathopen{}\mathclose{{%}\left({\bm{Y}};\bm{\theta}}\right)}}\right\}+\operatorname*{\text{D}_{\text{%KL}}}\mathopen{}\mathclose{{}\left\{{\check{p}\mathopen{}\mathclose{{}\left({%\bm{\check{X}}}\middle|{\bm{Y}}}\right)}\middle\|{\hat{p}\mathopen{}\mathclose%{{}\left({\bm{\check{X}}}\middle|{\bm{Y}};\bm{\theta}}\right)}}\right\}. (6.6)

(The reader should verify this.margin: Exercise LABEL:ex:jointKLdecomposition ) On the right-hand side, the first (marginal) relative entropy is the quantity that we actually want to minimize. The second (posterior) relative entropy is perforce non-negative (see Section 4.2). So the left-hand side is an upper bound on the true objective. And the bound can be tightened by optimizing the joint relative entropy with respect to the recognition distribution, i.e., decreasing the second term on the right. If it can be driven to zero, the bound will be tight. (Precisely the same relationship holds for the free energy with respect to the marginal cross entropy.margin: Exercise LABEL:ex:freeEnergyUpperBound )

Let us make the procedure explicit. We minimize the joint relative entropy with respect to its two arguments:

EM Algorithm

\bullet\> (E) Discriminative optimization: pˇ(i+1)(𝒙ˇ|𝒚)argminpˇ{JRE(𝜽(i),pˇ)}{\check{p}^{(i+1)}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\check{x}}{}%\middle|\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{%rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y}{}%}\right)}\leftarrow\operatorname*{argmin}_{\check{p}}\mathopen{}\mathclose{{}%\left\{\mathcal{L}_{\text{JRE}}(\bm{\theta}^{(i)},\check{p})}\right\}
\bullet\> (M) Generative optimization: 𝜽(i+1)argmin𝜽{JRE(𝜽,pˇ(i+1))}\bm{\theta}^{(i+1)}\leftarrow\operatorname*{argmin}_{\bm{\theta}}\mathopen{}%\mathclose{{}\left\{\mathcal{L}_{\text{JRE}}(\bm{\theta},\check{p}^{(i+1)})}\right\}.

The parameters can be initialized randomly, 𝜽0\bm{\theta}_{0}, although there are better alternatives (see below). The two optimizations may be executed alternately or simultaneously—we shall see examples of both below—until convergence, at which point 𝜽final\bm{\theta}_{\text{final}} is our solution.

What remains to be specified now is a form for the recognition distribution and (in turn) a method for optimizing it. These choices lead to different algorithms. In this text, we refer to all such algorithms as expectation-maximizationmargin: expectation-maximization (EM). In fact, only one special case (discussed next) corresponds to the original EM algorithm of Dempster and Laird [5], but it was later generalized to the procedure just described under the same name, by Neal and Hinton [34].

Nomenclature.

The E and M in the “EM algorithm” originally stood for the two optimization steps that we have called “discriminative” and “generative.” Historically, the discriminative optimization was known as the “E step” because it referred to taking expectations under, rather than finding via optimization, the recognition distribution. This is the expectation in the final line of Eq. 6.5 (ignoring the additional average under the data distribution). Indeed, as we shall see shortly, the process of computing the expectations can be rather involved, since it requires running an inference algorithm (recall, for example, the smoothers of Section 2.2). Nevertheless, the “E step” has come to refer to the optimization, as this plays a larger role in modern variants of EM—and this allows us to see EM as a form of coordinate descent in the joint relative entropy.

Since EM was originally written in terms of expected log-likelihood rather than cross entropy, the “M step” originally referred to a maximization, but it will equally well refer to a minimization as in our setup.

6.3.2 Information-theoretic perspectives on EM

We now derive EM from a slightly different, more informal, perspective. This subsection (6.3.2) can be skipped without loss of continuity.

The relationship between the marginal and joint cross entropies.

We saw above (Eq. 6.3) that the marginal and joint cross entropies do not have the same gradient with respect to 𝜽\bm{\theta}. We can see this more directly simply by decomposing the joint cross entropy:

Hpp^[𝒀;𝜽]=H(pp^)p^[𝑿^,𝒀;𝜽]-H(pp^)p^[𝑿^|𝒀;𝜽].{\text{H}_{p\hat{p}}{\mathopen{}\mathclose{{}\left[{\bm{Y}}{};\bm{\theta}}%\right]}}={\text{H}_{(p\hat{p})\hat{p}}{\mathopen{}\mathclose{{}\left[{\bm{%\hat{X}}},{\bm{Y}}{};\bm{\theta}}\right]}}-{\text{H}_{(p\hat{p})\hat{p}}{%\mathopen{}\mathclose{{}\left[{\bm{\hat{X}}}\middle|{\bm{Y}}{};\bm{\theta}}%\right]}}. (6.7)

Thus, in order to use the joint cross entropy as a proxy for the marginal cross entropy, we would have to “regularize” the optimization with the negative entropy of the (generative) posterior, averaged under the data33 3 This conditional entropy should not be confused with Hp^[𝑿^|𝒀^;𝜽]\text{H}_{\hat{p}}{\mathopen{}\mathclose{{}\left[{\bm{\hat{X}}}\middle|{\bm{%\hat{Y}}};\bm{\theta}}\right]}, the average of the posterior entropy under the model marginal, p^(𝒚^;𝜽){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{};\bm{\theta}}\right)}. , -H(ppˇ)p^[𝑿^|𝒀;𝜽]-{\text{H}_{(p\check{p})\hat{p}}{\mathopen{}\mathclose{{}\left[{\bm{\hat{X}}}%\middle|{\bm{Y}}{};\bm{\theta}}\right]}}. Computationally, this gets us nowhere, since the problematic marginal cross entropy is (typically) just as much on the right- as on the left-hand side. But conceptually, it shows how exactly we need to control the extra degree of freedom introduced into the model with latent variables: The model joint can be improved simply by lowering the posterior uncertainty—that is, licensing more specific inferences to the latent variables, 𝑿^{\bm{\hat{X}}}—without actually improving the model marginal. For example, any setting of continuous latent variables could be assigned arbitrarily large posterior probability, driving the joint cross entropy to arbitrarily large negative numbers. To prevent this, our optimization must encourage a certain amount of vagueness in the model inferences. In short, we have to force the optimization to fit the observations (𝒀{\bm{Y}}) better, rather than just make the latent variables (𝑿^{\bm{\hat{X}}}) more predictable.

The optimal inferential vagueness (posterior entropy) will seldom be minimal (0, for discrete latent variables), but nor will be it be maximal. Rather, the posterior entropy should be allowed to take on whatever value provides the best fit to the observations, 𝒀{\bm{Y}}. Indeed, a priori reason to believe that either of these extreme cases obtains would motivate a different modeling approach. Maximum posterior entropy, H(pp^)p^[𝑿^|𝒀;𝜽]=Hp^[𝑿^;𝜽]{\text{H}_{(p\hat{p})\hat{p}}{\mathopen{}\mathclose{{}\left[{\bm{\hat{X}}}%\middle|{\bm{Y}}{};\bm{\theta}}\right]}}=\text{H}_{\hat{p}}{\mathopen{}%\mathclose{{}\left[{\bm{\hat{X}}};\bm{\theta}}\right]} (recalling that conditioning can never increase uncertainty), implies that the latent variables have no mutual information with the observations—they are useless, so we are essentially back to the fully observed models discussed in the introduction, Section 6.1. Minimal posterior entropy, H(pp^)p^[𝑿^|𝒀;𝜽]=0{\text{H}_{(p\hat{p})\hat{p}}{\mathopen{}\mathclose{{}\left[{\bm{\hat{X}}}%\middle|{\bm{Y}}{};\bm{\theta}}\right]}}=0 (for discrete latent variables), implies a deterministic relationship between 𝒀{\bm{Y}} and 𝑿^{\bm{\hat{X}}}. In this case, EM is likewise unnecessary; we shall explore such cases in Chapter 9 below.

Averaging under a recognition distribution.

EM has two ingredients that distinguish it from direct minimization of Hpp^[𝒀;𝜽]{\text{H}_{p\hat{p}}{\mathopen{}\mathclose{{}\left[{\bm{Y}}{};\bm{\theta}}%\right]}}: a joint (rather than marginal) model, but also the use of a separate recognition model that may differ from the posterior of the generative model. To introduce the latter, notice that Eq. 6.7 holds just as well if the expectation over the latent variables is taken under this recognition distribution:

Hpp^[𝒀;𝜽]=H(ppˇ)p^[𝑿ˇ,𝒀;𝜽]-H(ppˇ)p^[𝑿ˇ|𝒀;𝜽].{\text{H}_{p\hat{p}}{\mathopen{}\mathclose{{}\left[{\bm{Y}}{};\bm{\theta}}%\right]}}={\text{H}_{(p\check{p})\hat{p}}{\mathopen{}\mathclose{{}\left[{\bm{%\check{X}}}{},{\bm{Y}}{};\bm{\theta}}\right]}}-{\text{H}_{(p\check{p})\hat{p}}%{\mathopen{}\mathclose{{}\left[{\bm{\check{X}}}{}\middle|{\bm{Y}}{};\bm{\theta%}}\right]}}. (6.8a)
Note well that this important distinction is conveyed by some subtle notational differences: The generative-model joint surprisal -logp^(𝒙^,𝒚^;𝜽)-\log{\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{},\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{};\bm{\theta}}\right)} is averaged under the generative-model posterior, p^(𝒙^|𝒚^;𝜽){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{}\middle|\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{};\bm{%\theta}}\right)} (and p(𝒚){p\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[%named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}%\pgfsys@color@gray@fill{.5}\bm{y}{}}\right)}) in Eq. 6.7, making H(pp^)p^[𝑿^,𝒀;𝜽]{\text{H}_{(p\hat{p})\hat{p}}{\mathopen{}\mathclose{{}\left[{\bm{\hat{X}}},{%\bm{Y}}{};\bm{\theta}}\right]}} something halfway between an entropy and a cross entropy; but under the recognition posterior, pˇ(𝒙ˇ|𝒚){\check{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\check{x}}{}\middle|\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y}{}}\right)} (and p(𝒚){p\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[%named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}%\pgfsys@color@gray@fill{.5}\bm{y}{}}\right)}) in Eq. 6.8a, making H(ppˇ)p^[𝑿ˇ,𝒀;𝜽]{\text{H}_{(p\check{p})\hat{p}}{\mathopen{}\mathclose{{}\left[{\bm{\check{X}}}%{},{\bm{Y}}{};\bm{\theta}}\right]}} a full cross entropy. The change in interpretation is correspondingly subtle: If this joint cross entropy is to serve as a proxy in the optimization for the marginal cross entropy, it must be regularized by the negative posterior cross entropy, H(ppˇ)p^[𝑿ˇ|𝒀;𝜽]{\text{H}_{(p\check{p})\hat{p}}{\mathopen{}\mathclose{{}\left[{\bm{\check{X}}}%{}\middle|{\bm{Y}}{};\bm{\theta}}\right]}}.

What are the implications of regularizing with a cross entropy? As in Eq. 6.7, this prevents false solutions that simply concentrate posterior probability at single points. (Otherwise, as long as these points have support under the recognition model, the posterior cross entropy could be driven to arbitrarily large negative values.) However, in contrast to Eq. 6.7, this regularizer can also be seen as discouraging the generative posterior, p^(𝒙^|𝒚^;𝜽){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{}\middle|\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{};\bm{%\theta}}\right)}, from too closely resembling the recognition posterior, pˇ(𝒙ˇ|𝒚){\check{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\check{x}}{}\middle|\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y}{}}\right)}. This is perhaps more obvious if the recognition entropy, H(ppˇ)[𝑿ˇ|𝒀]{\text{H}_{(p\check{p})}{\mathopen{}\mathclose{{}\left[{\bm{\check{X}}}{}%\middle|{\bm{Y}}{}}\right]}}, is added to both sides of Eq. 6.8a,

Hpp^[𝒀;𝜽]+H(ppˇ)[𝑿ˇ|𝒀]=H(ppˇ)p^[𝑿ˇ,𝒀;𝜽]-H(ppˇ)p^[𝑿ˇ|𝒀;𝜽]+H(ppˇ)[𝑿ˇ|𝒀],=H(ppˇ)p^[𝑿ˇ,𝒀;𝜽]-DKL{pˇ(𝑿ˇ|𝒀)p^(𝑿ˇ|𝒀;𝜽)}\begin{split}{\text{H}_{p\hat{p}}{\mathopen{}\mathclose{{}\left[{\bm{Y}}{};\bm%{\theta}}\right]}}+{\text{H}_{(p\check{p})}{\mathopen{}\mathclose{{}\left[{\bm%{\check{X}}}{}\middle|{\bm{Y}}{}}\right]}}&={\text{H}_{(p\check{p})\hat{p}}{%\mathopen{}\mathclose{{}\left[{\bm{\check{X}}}{},{\bm{Y}}{};\bm{\theta}}\right%]}}-{\text{H}_{(p\check{p})\hat{p}}{\mathopen{}\mathclose{{}\left[{\bm{\check{%X}}}{}\middle|{\bm{Y}}{};\bm{\theta}}\right]}}+{\text{H}_{(p\check{p})}{%\mathopen{}\mathclose{{}\left[{\bm{\check{X}}}{}\middle|{\bm{Y}}{}}\right]}},%\\&={\text{H}_{(p\check{p})\hat{p}}{\mathopen{}\mathclose{{}\left[{\bm{\check{X}%}}{},{\bm{Y}}{};\bm{\theta}}\right]}}-\operatorname*{\text{D}_{\text{KL}}}%\mathopen{}\mathclose{{}\left\{{\check{p}\mathopen{}\mathclose{{}\left({\bm{%\check{X}}}\middle|{\bm{Y}}}\right)}\middle\|{\hat{p}\mathopen{}\mathclose{{}%\left({\bm{\check{X}}}\middle|{\bm{Y}};\bm{\theta}}\right)}}\right\}\end{split} (6.8b)

in which case the regularizer becomes the relative entropy (KL divergence) of the recognition and generative posterior distributions, but the gradient is unchanged since H(ppˇ)[𝑿ˇ|𝒀]{\text{H}_{(p\check{p})}{\mathopen{}\mathclose{{}\left[{\bm{\check{X}}}{}%\middle|{\bm{Y}}{}}\right]}} is independent of 𝜽\bm{\theta}.

At first blush, this may be surprising: don’t we want the recognition and generative posterior distributions to resemble each other? Certainly they will match at the optimum (see again Eq. 6.6). But in the present thought experiment, the recognition model has not been specified and is therefore totally arbitrary! We do not want to improve the joint cross entropy merely by matching the model posterior to an arbitrary distribution. The relative-entropy “regularizer” in Eq. 6.8b prevents this.

An alternative view of the M step.

From a computational perspective, Eqs. 6.8a and 6.8b are no farther from our goal than Eq. 6.7, but neither are they closer, since the problematic marginal cross entropy (Hpp^[𝒀;𝜽]{\text{H}_{p\hat{p}}{\mathopen{}\mathclose{{}\left[{\bm{Y}}{};\bm{\theta}}%\right]}}) is implicitly still on their right-hand sides. To remove it, let us explicitly change the objective in Eq. 6.8a by additionally penalizing the divergence between the recognition and generative posterior distributions:

Hpp^[𝒀;𝜽]+DKL{pˇ(𝑿ˇ|𝒀)p^(𝑿ˇ|𝒀;𝜽)}=H(ppˇ)p^[𝑿ˇ,𝒀;𝜽]-H(ppˇ)[𝑿ˇ|𝒀].{\text{H}_{p\hat{p}}{\mathopen{}\mathclose{{}\left[{\bm{Y}}{};\bm{\theta}}%\right]}}+\operatorname*{\text{D}_{\text{KL}}}\mathopen{}\mathclose{{}\left\{{%\check{p}\mathopen{}\mathclose{{}\left({\bm{\check{X}}}\middle|{\bm{Y}}}\right%)}\middle\|{\hat{p}\mathopen{}\mathclose{{}\left({\bm{\check{X}}}\middle|{\bm{%Y}};\bm{\theta}}\right)}}\right\}={\text{H}_{(p\check{p})\hat{p}}{\mathopen{}%\mathclose{{}\left[{\bm{\check{X}}}{},{\bm{Y}}{};\bm{\theta}}\right]}}-{\text{%H}_{(p\check{p})}{\mathopen{}\mathclose{{}\left[{\bm{\check{X}}}{}\middle|{\bm%{Y}}{}}\right]}}. (6.9)

Equivalently, we have shifted the relative entropy from the right- to the left-hand side of Eq. 6.8b (and shifted the “constant” term in the other direction). The only 𝜽\bm{\theta}-dependent term on the right-hand side of Eq. 6.9 is the complete cross entropy, H(ppˇ)p^[𝑿ˇ,𝒀;𝜽]{\text{H}_{(p\check{p})\hat{p}}{\mathopen{}\mathclose{{}\left[{\bm{\check{X}}}%{},{\bm{Y}}{};\bm{\theta}}\right]}}, to compute which we need only the generative model, p^(𝒙^,𝒚^;𝜽)=p^(𝒙^;𝜽)p^(𝒚^|𝒙^;𝜽){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{},\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{};\bm{\theta}}\right)}={\hat{p}%\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[%named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}%\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{};\bm{\theta}}\right)}{\hat{p}%\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[%named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}%\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{}\middle|\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{};\bm{%\theta}}\right)}, and the “hybrid” joint distribution, pˇ(𝒙ˇ|𝒚)p(𝒚){\check{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\check{x}}{}\middle|\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y}{}}\right)}{p%\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[%named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}%\pgfsys@color@gray@fill{.5}\bm{y}{}}\right)}. This completely obviates the problematic marginal p^(𝒚^;𝜽){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{};\bm{\theta}}\right)}. Both sides of Eq. 6.9 are the free energy (cf. the second line of Eq. 6.5), and therefore optimizing Eq. 6.9 with respect to the parameters 𝜽\bm{\theta} indeed constitutes the M step of EM derived in the previous section.

But this new objective is not the one we want! Conceptually, it amounts to “giving up on” the regularizer, allowing improvements in joint cross entropy to come by way either of reduced relative entropy between the posterior distributions or of reduced marginal cross entropy.

An alternative view of the E step.

One way to solve this problem is to find another mechanism for shrinking the relative entropy, DKL{pˇ(𝑿ˇ|𝒀)p^(𝑿ˇ|𝒀;𝜽)}\operatorname*{\text{D}_{\text{KL}}}\mathopen{}\mathclose{{}\left\{{\check{p}%\mathopen{}\mathclose{{}\left({\bm{\check{X}}}\middle|{\bm{Y}}}\right)}\middle%\|{\hat{p}\mathopen{}\mathclose{{}\left({\bm{\check{X}}}\middle|{\bm{Y}};\bm{%\theta}}\right)}}\right\}, so that optimization of 𝜽\bm{\theta} won’t be “wasted” on this task. The obvious instrument is the recognition distribution, pˇ(𝒙ˇ|𝒚){\check{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\check{x}}{}\middle|\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y}{}}\right)}, which so far is unspecified. In particular, if we use pˇ\check{p} to shrink the relative entropy—or, equivalently, the entire left- or right-hand sides, since the marginal entropy Hpp^[𝒀;𝜽]{\text{H}_{p\hat{p}}{\mathopen{}\mathclose{{}\left[{\bm{Y}}{};\bm{\theta}}%\right]}} is independent of pˇ\check{p}—then 𝜽\bm{\theta} can be used essentially for optimizing the marginal cross entropy. Indeed, we will have killed two birds with one stone, because a distribution pˇ(𝒙ˇ|𝒚){\check{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\check{x}}{}\middle|\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y}{}}\right)} that resembles, can be used as a proxy for, the frequently intractable generative-model posterior, p^(𝒙^|𝒚^;𝜽){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{}\middle|\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{};\bm{%\theta}}\right)}. And indeed, optimizing Eq. 6.9 with respect to pˇ\check{p} is the E step we derived in the previous section.

Relative vs. cross entropies.

In exchanging Eq. 6.8a for Eq. 6.9 as our objective, we “gave up on” penalizing overly sharp generative posteriors in the M step. Tellingly, the price for this exchange was an E step in which we have to penalize overly sharp recognition posteriors. That is, in the E step, our goal is to improve the fit of the recognition to the generative model. As in our development of the M step objective, we would like to use the joint cross entropy as a proxy objective, since it is tractable. But the joint cross entropy can be reduced simply by reducing the entropy, rather than the misfit, of the recognition model. To prevent this, the recognition entropy must be penalized. Once again, the optimal amount of posterior uncertainty typically will be neither minimal nor maximal.

The recurrence of negative-entropy penalties is not a coincidence. It is closely connected with the fact that cross entropy is not invariant under reparameterization. For example, the joint cross entropy for continuous latent variables could be increased arbitrarily simply by discretizing at finer bin widths, or (say) multiplying all values by a large number, 𝑿^α𝑿^{\bm{\hat{X}}}\rightarrow\alpha{\bm{\hat{X}}}. The negative entropy terms are required to cancel out this degree of freedom. This is built into relative entropies, which is why they are invariant under reparameterizations. This makes the joint relative entropy, rather than cross entropy or even the free energy, the most felicitous candidate for reasoning about EM. Furthermore, both E and M steps can be interpreted as minimizing (with respect to 𝜽\bm{\theta}) the joint relative entropy, whereas only the M step can be interpreted as minimizing the joint cross entropy. (How these minimizations are implemented in practice we consider in the following chapters.)

Minimum description length and the “bits-back” argument

EM can also be understood through the lens of a thought experiment involving information transmission. In the original formulation [16], that information is reckoned in terms of the free energy, but the considerations of the previous section suggest using joint relative entropy instead. In any case, the thought experiment works best with (once again) the Gaussian mixture model (see Section 2.1.1 and again Section 7.1 below, and Fig. LABEL:fig:). margin: Put a figure of GMM data here!

The sender-receiver game.

Imagine that I (the “sender”) want to communicate an observed datum, 𝒚\bm{y}, to someone else (the “receiver”). Ideally, I would encode this datum according to the data marginal, p(𝒚){p\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[%named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}%\pgfsys@color@gray@fill{.5}\bm{y}{}}\right)}; that is, I would assign longer code words to observations that are less probable under this distribution, at a cost of -logp(𝒚)-\log{p\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{y}{}}\right)} bits (assuming base-2 logarithms throughout). To communicate many such data, I would pay on average Hp[𝒀]{\text{H}_{p}{\mathopen{}\mathclose{{}\left[{\bm{Y}}{}}\right]}} bits.

Of course, I do not have direct access to this distribution, so generally the best I can do is assign code words according to the model marginal, p^(𝒚^;𝜽){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{};\bm{\theta}}\right)}. In this case I pay -logp^(𝒚^;𝜽)-\log{\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{};\bm{\theta}}\right)} bits for a single datum and Hpp^[𝒀;𝜽]{\text{H}_{p\hat{p}}{\mathopen{}\mathclose{{}\left[{\bm{Y}}{};\bm{\theta}}%\right]}} on average. The penalty for using the wrong distribution—the “surcharge”—is therefore

𝒮marginal..=Hpp^[𝒀;𝜽]-Hp[𝒀].\mathcal{S}_{\text{marginal}}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{%\scriptsize.}}}={\text{H}_{p\hat{p}}{\mathopen{}\mathclose{{}\left[{\bm{Y}}{};\bm{\theta}}%\right]}}-{\text{H}_{p}{\mathopen{}\mathclose{{}\left[{\bm{Y}}{}}\right]}}. (6.10)

But suppose that I cannot get an analytic expression even for the model marginal. (Recall from Chapter 2 that this is the position we are in for many models.) Still, I may be able to fit a latent-variable generative model to the data (as in this chapter). It seems intuitive that I should be able to use this to encode data for transmission to the receiver. For the GMM, for example, I could assign the observation 𝒚\bm{y} to a class, 𝒙^\bm{\hat{x}}, and then transmit this class 𝒙^\bm{\hat{x}} along with the reconstruction errormargin: reconstruction error —in this case, the difference between 𝒚\bm{y} and the class mean. The receiver can use the class assignment, 𝒙^\bm{\hat{x}}, along with this error, to recover 𝒚\bm{y}. (Since the receiver must know my encoding scheme in order to recover these, I also pay a one-time cost to communicate the generative model to her. We will assume this is small compared with the data to be communicated and ignore it in what follows.) Assuming we design our encoding optimally, the cost of communicating the class 𝒙^\bm{\hat{x}} is its surprisal under the model prior, -logp^(𝒙^;𝜽)-\log{\hat{p}\mathopen{}\mathclose{{}\left(\bm{\hat{x}};\bm{\theta}}\right)}; and the cost of communicating the corresponding reconstruction error for 𝒚\bm{y} is its surprisal under the model emission (conditioned on 𝒙^\bm{\hat{x}}), -logp^(𝒚|𝒙^;𝜽)-\log{\hat{p}\mathopen{}\mathclose{{}\left(\bm{y}\middle|\bm{\hat{x}};\bm{%\theta}}\right)}.

To work out the total costs of such a scheme, we need to consider precisely how we assign the observation (𝒚\bm{y}) to a class (𝒙^\bm{\hat{x}}). Prima facie, the optimal candidate class might seem to be the peak of the posterior distribution, conditioned on the observation: argmax𝒙^{p^(𝒙^|𝒚;𝜽)}\operatorname*{argmax}_{\bm{\hat{x}}}\mathopen{}\mathclose{{}\left\{{\hat{p}%\mathopen{}\mathclose{{}\left(\bm{\hat{x}}\middle|\bm{y};\bm{\theta}}\right)}}\right\}—in a GMM, the cluster to which the observed datum is most likely to belong. In that case, (𝒙^,𝒚)(\bm{\hat{x}},\bm{y}) pairs occur with probability

pˇ(𝒙ˇ|𝒚)p(𝒚)=δ(𝒙ˇ-argmax𝒙^{p^(𝒙^|𝒚;𝜽)})p(𝒚),{\check{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\check{x}}{}\middle|\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y}{}}\right)}{p%\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[%named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}%\pgfsys@color@gray@fill{.5}\bm{y}{}}\right)}=\delta\mathopen{}\mathclose{{}%\left(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}%{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\check{x%}}-\operatorname*{argmax}_{\bm{\hat{x}}}\mathopen{}\mathclose{{}\left\{{\hat{p%}\mathopen{}\mathclose{{}\left(\bm{\hat{x}}\middle|\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y};\bm{\theta}}%\right)}}\right\}}\right){p\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb%]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y}{}}\right)}, (6.11)

and the average costs are the surprisals, -logp^(𝒙^;𝜽)-\log{\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{};\bm{\theta}}\right)} and -logp^(𝒚^|𝒙^;𝜽)-\log{\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{}\middle|\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{};\bm{%\theta}}\right)}, averaged under this distribution. We call these, respectively, the code costmargin: code cost , H(ppˇ)p^[𝑿ˇ;𝜽]{\text{H}_{(p\check{p})\hat{p}}{\mathopen{}\mathclose{{}\left[{\bm{\check{X}}}%{};\bm{\theta}}\right]}}, and the reconstruction costmargin: reconstruction cost , H(ppˇ)p^[𝒀|𝑿ˇ;𝜽]{\text{H}_{(p\check{p})\hat{p}}{\mathopen{}\mathclose{{}\left[{\bm{Y}}{}%\middle|{\bm{\check{X}}}{};\bm{\theta}}\right]}}. The “surcharge” under this scheme is therefore

𝒮joint..=H(ppˇ)p^[𝑿ˇ;𝜽]+H(ppˇ)p^[𝒀|𝑿ˇ;𝜽]-Hp[𝒀].\mathcal{S}_{\text{joint}}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.%}}}={\text{H}_{(p\check{p})\hat{p}}{\mathopen{}\mathclose{{}\left[{\bm{\check{X%}}}{};\bm{\theta}}\right]}}+{\text{H}_{(p\check{p})\hat{p}}{\mathopen{}%\mathclose{{}\left[{\bm{Y}}{}\middle|{\bm{\check{X}}}{};\bm{\theta}}\right]}}-%{\text{H}_{p}{\mathopen{}\mathclose{{}\left[{\bm{Y}}{}}\right]}}. (6.12)

The surcharge for latent-variable models with hard cluster assignments.

The relative sizes of the surcharges are not (perhaps) immediately obvious from Eqs. 6.10 and 6.12. Let us try to rewrite Eq. 6.12 in terms of Eq. 6.10 by factoring the generative model the other way:

𝒮joint=H(ppˇ)p^[𝑿ˇ|𝒀;𝜽]+Hpp^[𝒀;𝜽]-Hp[𝒀]=DKL{pˇ(𝑿ˇ|𝒀)p^(𝑿ˇ|𝒀;𝜽)}+H(ppˇ)[𝑿ˇ|𝒀]+𝒮marginal.\begin{split}\mathcal{S}_{\text{joint}}&={\text{H}_{(p\check{p})\hat{p}}{%\mathopen{}\mathclose{{}\left[{\bm{\check{X}}}{}\middle|{\bm{Y}}{};\bm{\theta}%}\right]}}+{\text{H}_{p\hat{p}}{\mathopen{}\mathclose{{}\left[{\bm{Y}}{};\bm{%\theta}}\right]}}-{\text{H}_{p}{\mathopen{}\mathclose{{}\left[{\bm{Y}}{}}%\right]}}\\&=\operatorname*{\text{D}_{\text{KL}}}\mathopen{}\mathclose{{}\left\{{\check{p%}\mathopen{}\mathclose{{}\left({\bm{\check{X}}}\middle|{\bm{Y}}}\right)}%\middle\|{\hat{p}\mathopen{}\mathclose{{}\left({\bm{\check{X}}}\middle|{\bm{Y}%};\bm{\theta}}\right)}}\right\}+{\text{H}_{(p\check{p})}{\mathopen{}\mathclose%{{}\left[{\bm{\check{X}}}{}\middle|{\bm{Y}}{}}\right]}}+\mathcal{S}_{\text{%marginal}}.\end{split} (6.13)

Now when the averaging distribution pˇ\check{p} is a Dirac delta, as in Eq. 6.11, the recognition entropy vanishes and surcharge reduces to

𝒮joint-hard=DKL{pˇ(𝑿ˇ|𝒀)p^(𝑿ˇ|𝒀;𝜽)}+𝒮marginal.\mathcal{S}_{\text{joint-hard}}=\operatorname*{\text{D}_{\text{KL}}}\mathopen{%}\mathclose{{}\left\{{\check{p}\mathopen{}\mathclose{{}\left({\bm{\check{X}}}%\middle|{\bm{Y}}}\right)}\middle\|{\hat{p}\mathopen{}\mathclose{{}\left({\bm{%\check{X}}}\middle|{\bm{Y}};\bm{\theta}}\right)}}\right\}+\mathcal{S}_{\text{%marginal}}. (6.14)

So the excess surcharge of hard cluster assignments is simply the relative entropy of the posteriors. We could try to reduce this cost by choosing a better recognition model, but it seems that this would reimpose a recognition-entropy cost, which is zero only for delta distributions.

The surcharge for latent-variable models with soft cluster assignments.

Still, let us be optimistic and consider trying to reduce cost 1. Mathematically, this would amount to using a recognition distribution pˇ\check{p} in Eq. 6.13 that diverges less from the generative-model posterior, p^(𝒙^|𝒚^;𝜽){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{}\middle|\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{};\bm{%\theta}}\right)}, than the mode-situated Dirac delta does. But what would this mathematical change correspond to in the sender-receiver game? We have seen that “recognizing” the latent states with Dirac deltas at the posterior modes corresponds to hard assignment of latent states, e.g. of cluster identities. Similarly, the use of a non-deterministic (non-Dirac) recognition distribution corresponds to soft assignment of data to latent states. Concretely, we can imagine, for each observation 𝒚\bm{y}, sampling a cluster identity 𝒙ˇ\bm{\check{x}} from the recognition distribution, pˇ(𝒙ˇ|𝒚){\check{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\check{x}}{}\middle|\bm{y}}\right)}. For example, we can imagine doing so according to the classic precedure: passing a sample from a uniform distribution through the inverse cumulative distribution function for the recognition model. This might seem like a terrible idea, since sampling appears to be “adding noise”—but we shall see shortly how to recoup our loss.

On average across all observations, communicating these “softly assigned” cluster identities will cost H(ppˇ)p^[𝑿ˇ;𝜽]{\text{H}_{(p\check{p})\hat{p}}{\mathopen{}\mathclose{{}\left[{\bm{\check{X}}}%{};\bm{\theta}}\right]}}, and communicating the resulting reconstruction errors (under the generative model) will cost H(ppˇ)p^[𝒀|𝑿ˇ;𝜽]{\text{H}_{(p\check{p})\hat{p}}{\mathopen{}\mathclose{{}\left[{\bm{Y}}{}%\middle|{\bm{\check{X}}}{};\bm{\theta}}\right]}}. So the surcharge is still given by Eqs. 6.12 and 6.13, just with a different averaging distribution pˇ\check{p}. Presumably with a good choice of pˇ(𝒙ˇ|𝒚){\check{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\check{x}}{}\middle|\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y}{}}\right)} we can drive the posterior relative entropy lower than with a Dirac delta, but at the cost of a non-zero recognition entropy—the price of randomly assigning observations to latent states. Is there a way to recoup these lost bits?

Getting bits back.

For each “message” I send, the receiver can recover 𝒚\bm{y} from the sample 𝒙^\bm{\hat{x}} and the reconstruction error. But she can also recover an estimate of the number used to sample 𝒙^\bm{\hat{x}}—by (in our concrete example) computing the cumulative probability of 𝒙^\bm{\hat{x}} under the recognition distribution, pˇ(𝒙ˇ|𝒚){\check{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\check{x}}{}\middle|\bm{y}}\right)}. The amount of information communicated per sample pair (𝒙ˇ,𝒚)(\bm{\check{x}},\bm{y}) is precisely -logpˇ(𝒙ˇ|𝒚)-\log{\check{p}\mathopen{}\mathclose{{}\left(\bm{\check{x}}\middle|\bm{y}}%\right)}, so the average value of our “refund” is H(ppˇ)[𝑿ˇ|𝒀]{\text{H}_{(p\check{p})}{\mathopen{}\mathclose{{}\left[{\bm{\check{X}}}{}%\middle|{\bm{Y}}{}}\right]}}, the recognition entropy. And clearly, the uniformly distributed numbers need not have been random; they could be any data we also wish to send to the receiver. (We might have to transform them first to distribute them uniformly; this is an implementation detail.) Subtracting our refund from both sides of Eq. 6.13, we see that the surcharge under the sender-receiver game is

𝒮bits-back..=DKL{pˇ(𝑿ˇ|𝒀)p^(𝑿ˇ|𝒀;𝜽)}+𝒮marginal=DKL{pˇ(𝑿ˇ|𝒀)p^(𝑿ˇ|𝒀;𝜽)}+DKL{p(𝒀)p^(𝒀;𝜽)};\begin{split}\mathcal{S}_{\text{bits-back}}&\mathrel{\vbox{\hbox{\scriptsize.}%\hbox{\scriptsize.}}}=\operatorname*{\text{D}_{\text{KL}}}\mathopen{}\mathclose{{}\left\{{\check{%p}\mathopen{}\mathclose{{}\left({\bm{\check{X}}}\middle|{\bm{Y}}}\right)}%\middle\|{\hat{p}\mathopen{}\mathclose{{}\left({\bm{\check{X}}}\middle|{\bm{Y}%};\bm{\theta}}\right)}}\right\}+\mathcal{S}_{\text{marginal}}\\&=\operatorname*{\text{D}_{\text{KL}}}\mathopen{}\mathclose{{}\left\{{\check{p%}\mathopen{}\mathclose{{}\left({\bm{\check{X}}}\middle|{\bm{Y}}}\right)}%\middle\|{\hat{p}\mathopen{}\mathclose{{}\left({\bm{\check{X}}}\middle|{\bm{Y}%};\bm{\theta}}\right)}}\right\}+\operatorname*{\text{D}_{\text{KL}}}\mathopen{%}\mathclose{{}\left\{{p\mathopen{}\mathclose{{}\left({\bm{Y}}}\right)}\middle%\|{\hat{p}\mathopen{}\mathclose{{}\left({\bm{Y}};\bm{\theta}}\right)}}\right\}%;\end{split} (6.15)

in short, the joint relative entropy (cf. Eq. 6.6).

𝒁{\bm{Z}}{}pˇ(𝒙ˇ|𝒚){\check{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\check{x}}{}\middle|\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y}{}}\right)}𝑿ˇ{\bm{\check{X}}}p^(𝒚^|𝒙^;𝜽){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{}\middle|\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{};\bm{%\theta}}\right)}𝑬\bm{E}𝒀{\bm{Y}}p^(𝒙^;𝜽){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{};\bm{\theta}}\right)} senderp^(𝒙^;𝜽){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{};\bm{\theta}}\right)}𝑿ˇ{\bm{\check{X}}}{}pˇ(𝒙ˇ|𝒚){\check{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\check{x}}{}\middle|\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y}{}}\right)}p^(𝒚^|𝒙^;𝜽){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{}\middle|\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{};\bm{%\theta}}\right)}𝒁{\bm{Z}}{}𝒀{\bm{Y}}{} receiver
Figure 6.1: The bits-back procedure. The sender (left) stochastically encodes observations 𝒀{\bm{Y}}{} into latent codes 𝑿ˇ{\bm{\check{X}}}{} with the recognition model pˇ(𝒙ˇ|𝒚){\check{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\check{x}}{}\middle|\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y}{}}\right)}, with the randomness provided by some “extra bits” 𝒁{\bm{Z}}{}. Under any generative model p^(𝒚^|𝒙^;𝜽){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{}\middle|\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{};\bm{%\theta}}\right)}, 𝒀{\bm{Y}}{} cannot be reconstructed from 𝑿ˇ{\bm{\check{X}}}{} without some error, 𝑬\bm{E}. The sender encodes these errors (with p^(𝒚^|𝒙^;𝜽){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{}\middle|\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{};\bm{%\theta}}\right)}) and transmits them to the receiver. The latent codes are likewise approximately uniformized, in this case with the generative prior p^(𝒙^;𝜽){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{};\bm{\theta}}\right)}, and transmitted to the receiver. The receiver (right) in turn transforms these messages back into latent codes (again with the generative prior), and uses these along with the error terms to reconstruct 𝒀{\bm{Y}}{} (using p^(𝒚^|𝒙^;𝜽){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{}\middle|\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{};\bm{%\theta}}\right)}). She then proceeds to use the reconstructed 𝒀{\bm{Y}}{}, along with the latent codes 𝑿ˇ{\bm{\check{X}}}{}, to recover the “extra bits,” 𝒁{\bm{Z}}{}.

Comparing Eqs. 6.14 and 6.15, we see that soft and hard assignments of observations (𝒚\bm{y}) to latent states (𝒙ˇ\bm{\check{x}}) incur the same additional surcharge, the relative entropy of the recognition model and the generative-model posterior—although in the case of soft assignments, I have to “apply for a refund” to reach this minimum, by making use of the extra channel for information. Furthermore, hard assignments seldom minimize the posterior relative entropy, since the posterior distribution under the generative model will only be deterministic in very special cases (we discuss these in Chapter 9). In contrast, under soft assignments, the posterior relative entropy can even be driven to zero, so long as the recognition distribution can be set equal to the generative-model posterior, pˇ(𝒙ˇ|𝒚)=p^(𝒙^|𝒚^;𝜽){\check{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\check{x}}{}\middle|\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y}{}}\right)}={%\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{}\middle|\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{};\bm{%\theta}}\right)}. In this case encoding the observations with the model joint distribution, instead of the model marginal, entails no excess surcharges whatsoever: 𝒮bits-back=𝒮marginal\mathcal{S}_{\text{bits-back}}=\mathcal{S}_{\text{marginal}}. (And if the model is good, p^(𝒚;𝜽)p(𝒚)𝒚{\hat{p}\mathopen{}\mathclose{{}\left(\bm{y};\bm{\theta}}\right)}\approx{p%\mathopen{}\mathclose{{}\left(\bm{y}}\right)}\forall\bm{y}, there is no surcharge at all.) Even in the (frequent) case that the posterior distribution cannot be derived in closed-form, flexible recognition distributions will generally be cheaper than any delta distributions—as long as we remember to claim some bits back.

Practical applications of the bits-back argument to compression.

See [50, 28].