6.3 Expectation-Maximization
6.3.1 Derivation of EM
This isn’t quite true.
Working with the joint did allow the parameters inside the expectation to decouple.
The present problem is a consequence of the parameters in the expectation itself, in particular the model posterior,
Approximating the posterior distribution.
But here we will not yet give up hope of a cleaner solution.
Consider this seemingly artless proposal: simply take the expectation under a different distribution,
where in the bottom lines the expectation is under this recognition distribution††margin:
recognition distribution
.22
2
This term for—and the idea of—a separate model for inference originates in [4].
But, as indicated by the final equality, and in contrast to the correct gradient, this incorrect gradient is indeed the gradient of a joint (“complete”) cross entropy: the derivative can pass outside the expectation that no longer depends on
The central intuition behind the algorithms that follow is to fit marginal densities by minimizing the joint or complete cross entropy,
Relative entropy, cross entropy, and free energy.
Let us prove this last claim.
We begin by noting that the following objectives differ only by terms constant in the parameters
We can interpret these quantities:
-
•
The first is a relative entropy, in this case between a “hybrid” joint distribution—the product of the recognition distribution and the data marginal—and the generative-model joint. Thus our optimization can be expressed in terms of the fundamental loss function proposed in Section 4.2.
-
•
The second we call the free energy††margin: free energy after its counterpart in statistical physics. It differs from the relative entropy only by the entropy of the data. We introduce this quantity primarily because, in the machine-learning literature, EM-like learning procedures are most frequently derived in terms of something like the free energy. More precisely, the proofs are written in terms of
, the negative of the free energy before averaging under the data. This is known as the evidence lower bound (ELBo)††margin: evidence lower bound (ELBo) . The name will become clear below. -
•
The quantity in the third line is the joint cross entropy that we lately proposed to optimize in lieu of the marginal cross entropy.
Clearly, minimizing any of these objectives is equivalent, but the proof is most elegant for the joint relative entropy, which can be decomposed as
(The reader should verify this.††margin: Exercise LABEL:ex:jointKLdecomposition ) On the right-hand side, the first (marginal) relative entropy is the quantity that we actually want to minimize. The second (posterior) relative entropy is perforce non-negative (see Section 4.2). So the left-hand side is an upper bound on the true objective. And the bound can be tightened by optimizing the joint relative entropy with respect to the recognition distribution, i.e., decreasing the second term on the right. If it can be driven to zero, the bound will be tight. (Precisely the same relationship holds for the free energy with respect to the marginal cross entropy.††margin: Exercise LABEL:ex:freeEnergyUpperBound )
Let us make the procedure explicit. We minimize the joint relative entropy with respect to its two arguments:
EM Algorithm
|
|
|
|
The parameters can be initialized randomly,
What remains to be specified now is a form for the recognition distribution and (in turn) a method for optimizing it. These choices lead to different algorithms. In this text, we refer to all such algorithms as expectation-maximization††margin: expectation-maximization (EM). In fact, only one special case (discussed next) corresponds to the original EM algorithm of Dempster and Laird [5], but it was later generalized to the procedure just described under the same name, by Neal and Hinton [34].
Nomenclature.
The E and M in the “EM algorithm” originally stood for the two optimization steps that we have called “discriminative” and “generative.” Historically, the discriminative optimization was known as the “E step” because it referred to taking expectations under, rather than finding via optimization, the recognition distribution. This is the expectation in the final line of Eq. 6.5 (ignoring the additional average under the data distribution). Indeed, as we shall see shortly, the process of computing the expectations can be rather involved, since it requires running an inference algorithm (recall, for example, the smoothers of Section 2.2). Nevertheless, the “E step” has come to refer to the optimization, as this plays a larger role in modern variants of EM—and this allows us to see EM as a form of coordinate descent in the joint relative entropy.
Since EM was originally written in terms of expected log-likelihood rather than cross entropy, the “M step” originally referred to a maximization, but it will equally well refer to a minimization as in our setup.
6.3.2 Information-theoretic perspectives on EM
We now derive EM from a slightly different, more informal, perspective. This subsection (6.3.2) can be skipped without loss of continuity.
The relationship between the marginal and joint cross entropies.
We saw above (Eq. 6.3) that the marginal and joint cross entropies do not have the same gradient with respect to
Thus, in order to use the joint cross entropy as a proxy for the marginal cross entropy, we would have to “regularize” the optimization with the negative entropy of the (generative) posterior, averaged under the data33
3
This conditional entropy should not be confused with
The optimal inferential vagueness (posterior entropy) will seldom be minimal (0, for discrete latent variables), but nor will be it be maximal.
Rather, the posterior entropy should be allowed to take on whatever value provides the best fit to the observations,
Averaging under a recognition distribution.
EM has two ingredients that distinguish it from direct minimization of
(6.8a) | |||
Note well that this important distinction is conveyed by some subtle notational differences:
The generative-model joint surprisal |
What are the implications of regularizing with a cross entropy?
As in Eq. 6.7, this prevents false solutions that simply concentrate posterior probability at single points.
(Otherwise, as long as these points have support under the recognition model, the posterior cross entropy could be driven to arbitrarily large negative values.)
However, in contrast to Eq. 6.7, this regularizer can also be seen as discouraging the generative posterior,
in which case the regularizer becomes the relative entropy (KL divergence) of the recognition and generative posterior distributions, but the gradient is unchanged since
At first blush, this may be surprising: don’t we want the recognition and generative posterior distributions to resemble each other? Certainly they will match at the optimum (see again Eq. 6.6). But in the present thought experiment, the recognition model has not been specified and is therefore totally arbitrary! We do not want to improve the joint cross entropy merely by matching the model posterior to an arbitrary distribution. The relative-entropy “regularizer” in Eq. 6.8b prevents this.
An alternative view of the M step.
From a computational perspective, Eqs. 6.8a and 6.8b are no farther from our goal than Eq. 6.7, but neither are they closer, since the problematic marginal cross entropy (
Equivalently, we have shifted the relative entropy from the right- to the left-hand side of Eq. 6.8b (and shifted the “constant” term in the other direction).
The only
But this new objective is not the one we want! Conceptually, it amounts to “giving up on” the regularizer, allowing improvements in joint cross entropy to come by way either of reduced relative entropy between the posterior distributions or of reduced marginal cross entropy.
An alternative view of the E step.
One way to solve this problem is to find another mechanism for shrinking the relative entropy,
Relative vs. cross entropies.
In exchanging Eq. 6.8a for Eq. 6.9 as our objective, we “gave up on” penalizing overly sharp generative posteriors in the M step. Tellingly, the price for this exchange was an E step in which we have to penalize overly sharp recognition posteriors. That is, in the E step, our goal is to improve the fit of the recognition to the generative model. As in our development of the M step objective, we would like to use the joint cross entropy as a proxy objective, since it is tractable. But the joint cross entropy can be reduced simply by reducing the entropy, rather than the misfit, of the recognition model. To prevent this, the recognition entropy must be penalized. Once again, the optimal amount of posterior uncertainty typically will be neither minimal nor maximal.
The recurrence of negative-entropy penalties is not a coincidence.
It is closely connected with the fact that cross entropy is not invariant under reparameterization.
For example, the joint cross entropy for continuous latent variables could be increased arbitrarily simply by discretizing at finer bin widths, or (say) multiplying all values by a large number,
Minimum description length and the “bits-back” argument
EM can also be understood through the lens of a thought experiment involving information transmission. In the original formulation [16], that information is reckoned in terms of the free energy, but the considerations of the previous section suggest using joint relative entropy instead. In any case, the thought experiment works best with (once again) the Gaussian mixture model (see Section 2.1.1 and again Section 7.1 below, and Fig. LABEL:fig:). ††margin: Put a figure of GMM data here!
The sender-receiver game.
Imagine that I (the “sender”) want to communicate an observed datum,
Of course, I do not have direct access to this distribution, so generally the best I can do is assign code words according to the model marginal,
But suppose that I cannot get an analytic expression even for the model marginal.
(Recall from Chapter 2 that this is the position we are in for many models.)
Still, I may be able to fit a latent-variable generative model to the data (as in this chapter).
It seems intuitive that I should be able to use this to encode data for transmission to the receiver.
For the GMM, for example, I could assign the observation
To work out the total costs of such a scheme, we need to consider precisely how we assign the observation (
and the average costs are the surprisals,
The surcharge for latent-variable models with hard cluster assignments.
The relative sizes of the surcharges are not (perhaps) immediately obvious from Eqs. 6.10 and 6.12. Let us try to rewrite Eq. 6.12 in terms of Eq. 6.10 by factoring the generative model the other way:
Now when the averaging distribution
So the excess surcharge of hard cluster assignments is simply the relative entropy of the posteriors. We could try to reduce this cost by choosing a better recognition model, but it seems that this would reimpose a recognition-entropy cost, which is zero only for delta distributions.
The surcharge for latent-variable models with soft cluster assignments.
Still, let us be optimistic and consider trying to reduce cost 1.
Mathematically, this would amount to using a recognition distribution
On average across all observations, communicating these “softly assigned” cluster identities will cost
Getting bits back.
For each “message” I send, the receiver can recover
in short, the joint relative entropy (cf. Eq. 6.6).
Comparing Eqs. 6.14 and 6.15, we see that soft and hard assignments of observations (