8.1 Gaussian recognition distributions and sparse coding
For continuous random latent variables, the simplest choice is perhaps the (multivariate) normal distribution:
Note that we are allowing the mean and precision to depend on the observations, although we have not as yet specified how. The symbol stands for all the currently unspecified parameters of the recognition distributions.
To derive an EM algorithm under this recognition model, we start as usual with the joint relative entropy. This consists of (among other things) the entropy of the recognition model, which under our current assumptions is just the entropy of a multivariate normal distribution, averaged under the data:
So the joint relative entropy is
Local parameters
Now let us assume that we will learn (in the E step) one pair of parameters for each observation, ; that is,
The advantage of this approach is flexibility: the mean vector and precision matrix for each datum can take on arbitrary values. The (potential) disadvantage is inefficiency: Unless the parameters can be derived in closed-form, this approach will not scale well to a large number of observations. Indeed, the recognition distribution for a new (or held-out) observation cannot be fit any faster, or predicted, on the basis of the previous observations. We shall explore alternatives shortly.
The E step.
It is instructive to begin the E-step optimization, even without having specified a generative model. Replacing the expectation over the data in Eq. 8.1 with a sample average and differentiating with respect to the recognition parameters () associated with a single datum () clearly eliminates all the terms that depend on the parameters for other data. So we can drop the expectation over the data altogether. The gradient with respect to a recognition mean vector is then [36]
The second line can be derived by exploiting the symmetry of and in a Gaussian function, and then integrating by parts.††margin: Exercise LABEL:ex:derivativesOfGaussianExpectations The energy consists of the -dependent terms in , hence the third line. Similarly, the gradient with respect to a recognition covariance matrix is
The third line can be derived along similar lines to the identity for gradients with respect to the mean (but the integration by parts must be done twice††margin: Exercise LABEL:ex:derivativesOfGaussianExpectations [36]).
We have written the optimizations for the mean and precision in the form of Eqs. 8.3 and 8.4 in order to highlight the resemblance to the Laplace approximation encountered in Chapter 2—see especially Eq. 2.31 [36]. Note that the joint (here) and posterior (there) energies are identical up to terms constant in . Therefore, Eqs. 8.3 and 8.4 become the Laplace approximation whenever the expectations commute with the energy-gradient and energy-Hessian functions of . This happens precisely when the order of the energy in is quadratic or less.
8.1.1 Independent sources and Gaussian emissions: sparse coding and other models
To make the procedure more concrete, let us make a few further assumptions. Although this is by no means obligatory, we let the emissions be Gaussians, as they have been in all preceding examples. For simplicity, we even let the noise be isotropic. The “source” random variables, on the other hand, we shall assume to be independent, but otherwise distributed generically:
(8.5) |
Among other models, Eq. 8.5 includes the sparse-coding models encountered in Section 2.1.3, in which each emission is assembled from a modest number of basis vectors drawn from a large (overcomplete) dictionary—i.e., an emission matrix that is “fat.” To enforce sparsity, the energy functions for the source variables can be chosen from leptokurtotic distributions. For example, we considered energy functions for the Laplace distribution and a smooth approximation thereof:
This ensures that the source variables take on values close to zero more frequently than normal random variables with similar variance. Unfortunately, as we saw, such fat-tailed distributions are not conjugate to the likelihood of specified by Eq. 8.5, i.e. the Gaussian emission density, so they require approximate inference.
Here we are considering Gaussian recognition distributions. For the sake of generality, however, we shall refrain at first from assuming any form for the energy functions, deriving our results for the more general class described by Eq. 8.5. Now, the cross entropy of this joint distribution is:
So far this follows directly from Eq. 8.5. The third term can be simplified further, however, by applying our assumption that the recognition distribution is normal, and using Eq. B.13 from the appendix:
The M step.
We can now write a more specific expression for the joint relative entropy that will allow us to see how to carry out the M step, as well. Let us simplify even further and treat and the parameters of the prior as fixed. Then substituting the cross entropy of the joint, Eq. 8.7, into the joint relative entropy for Gaussian recognition distributions, Eq. 8.1, yields
(All “constant” terms have been lumped into a single scalar .) The only remaining model parameter to be optimized is the emission matrix, . We can compute the gradient of the joint relative entropy, Eq. 8.8, with respect to using some results from the appendix (Section B.1):
Thus, the emission matrix is once again given by some version of the normal equations. In practice, however, if each is (e.g.) an image, and the model is overcomplete, then the matrix inversions will be computationally challenging. Instead, the gradient in Eq. 8.9 (or the natural gradient [29, 31]) can be descended.
The E step.
To carry out the E step, we apply Eqs. 8.3 and 8.4 to the energy of the generative model’s joint or posterior distribution (they are equivalent up to constant terms), to wit, the bracketed terms in Eq. 8.7. Let us also commit to the source energies specified by Eq. 8.6, i.e., the energy of the Laplace distribution. Then the resulting joint energy,
is quadratic in everywhere except at points where is zero (for any ). Therefore the energy-gradient function (of ) and the expectation in Eq. 8.3 “almost” commute with each other, and the gradient of the joint relative entropy becomes
Thus11 1 Technically the final line is only a sufficient, not a necessary, condition for the gradient to be zero, but we are assuming the Hessian is locally positive definite., the optimal choice for the mean is the solution to the -penalized least-squares problem. For “fat” matrices , this is known as basis-pursuit denoising,22 2 This is also sometimes referred to as the LASSO problem , although typically that term is reserved for “tall” output matrices . a convex optimization for which efficient solvers exist.
That leaves the covariance matrix, given by Eq. 8.4. Again we exploit the fact that the Hessian of the energy is a sub-cubic—indeed, constant—function of , except at points where any is zero. Therefore,
The approximation in the final line follows from the approximation to the Laplace energy in Eq. 8.6.
Let us collect up our optimizations for the complete sparse-coding algorithm:
EM Algorithm for Laplace-Sparse Coding
E step: | |
M step: | . |
Relationship to classical sparse coding.
We have arrived at a sparse-coding algorithm very similar to that of Lewicki and colleagues [29, 31], but by a somewhat different route. In those papers, the matrix of basis functions, , is fit by directly descending along the gradient of the marginal relative entropy (MRE), , rather than descending via the gradient of its upper bound, the joint relative entropy (JRE). The price, of course, is the intractability of the MRE. In the classical papers, this price is paid with Laplace’s method, approximating the model marginal, , with the integral in the vicinity of one of the modes, , of the model joint (Eq. 2.34). This in some sense removes the latent variables from the problem, although the modes still need to be estimated. Consequently, there is no E step, and no fixed parameters and (for any ).
Nevertheless, the Hessian of the joint (or posterior) energy still shows up in the classical derivation’s loss, this time as a byproduct of Laplace’s method, as a measure of the volume of the posterior. This is no coincidence: we have allowed the E step to collapse into a Laplace approximation [36] by using an energy function that is sub-cubic almost everywhere. However, in the classical method, this precision matrix is not fixed during optimization of , but rather is interpreted as a function of and : . On the other hand, the trace term in Eq. 8.8, , never materializes, because the squared reconstruction error is simply evaluated at the mode, rather than being averaged under a recognition distribution.
Conceptually, these two terms encode the contribution of the recognition precision () to the two losses, ours and the classical. Evidently, increasing the precision has opposite effects on the two losses, lowering the JRE (via ) but raising the approximate MRE (via ). But increasing the magnitude of the emission matrix has the same effect on these two terms: It increases the contribution of recognition uncertainty to the average reconstruction error (), and it increases the recognition precision itself (and therefore ) via Eq. 8.11. Indeed, the partial derivative of with respect to is precisely the same as the derivative of the trace term with respect to (namely, ).
There is one last discrepancy between the loss gradients. In the classical derivation, the mode is not fixed, either, and in turns depends on . So the total derivative of generates extra terms. On the other hand, in the original papers, these are derived and then promptly dropped, under the conjecture that curvature unrelated to volume is irrelevant to the optimization.
8.1.2 Independent-components analysis
Just as EM becomes -means for the Gaussian mixture model and (iterative) principal-components analysis for the factor analyzer, as the variance of the (Gaussian) emission is shrunk to zero, EM for the sparse coding model becomes independent-components analysis (ICA)…..