Chapter 7 Learning Invertible Generative Models
We saw in Chapter 2 that computing a generative model’s posterior is frequently impossible, because computing the normalizer
In this classical version of EM, the two optimizations of the joint relative entropy (JRE) are carried out in consecutive steps.
At discriminative step
The conclusion follows because the marginal relative entropy (MRE) doesn’t depend on the recognition model (
EM Algorithm under exact inference
|
|
|
|
That is, we will carry out the optimization in Eq. 6.5 with an iterative process, at each iteration of which we take expectations under the generative posterior from the previous iteration.
For example, to update our estimate of the mean of the GMM, we will indeed use Eq. 6.4, except that the posterior under which the expectations are taken will be evaluated at the parameters from the previous iteration.
This eliminates the dependence on
As we saw in the previous chapter, this procedure is guaranteed to decrease (or, more precisely, not to increase) an upper bound on the loss we actually care about, the MRE. In this version of the algorithm, we can say more. At each E step, the PRE in Eq. 6.6 is not merely reduced but eliminated. So at the start of every M step, the bound is tight (Fig. LABEL:fig:EMtightBound). That means that any decrease in JRE at this point entails a decrease in MRE—a better model for the data. Typically, decreases in JRE will also be accompanied by an increase in the PRE. But so far from being a bad thing, this corresponds to an even larger decrease in MRE than the decrease in JRE (see again Fig. LABEL:fig:EMtightBound). And at the next step E step, the PRE is again eliminated—the bound is again made tight.
In the examples we consider next, the M step is carried out in closed form, so every parameter update either decreases the MRE or does nothing at all. In contrast, for models in which the M step is carried out by gradient descent, a decrease in JRE need not correspond to a decrease in MRE: As the JRE decreases across the course of the M step, the PRE can open back up—the bound can loosen—and therefore further decreases can correspond to the bound retightening, rather than the MRE decreasing. Indeed, the MRE can increase during this period of bound tightening. But it can never increase above its value at at the beginning of the M step, when the bound was tight.
Applying EM.
The EM “algorithm” is a in fact a kind of meta-algorithm or recipe for estimating densities with latent variables. To derive actual algorithms we need to apply EM to specific graphical models. In the following sections we apply EM to some of the most classic latent-variable models.
Our recipe instructs us to minimize joint relative entropy between
as they do to a single step in EM, in which case
In either case, the entropy of