8.1 Gaussian recognition distributions and sparse coding
For continuous random latent variables, the simplest choice is perhaps the (multivariate) normal distribution:
Note that we are allowing the mean and precision to depend on the observations, although we have not as yet specified how.
The symbol
To derive an EM algorithm under this recognition model, we start as usual with the joint relative entropy. This consists of (among other things) the entropy of the recognition model, which under our current assumptions is just the entropy of a multivariate normal distribution, averaged under the data:
So the joint relative entropy is
Local parameters
Now let us assume that we will learn (in the E step) one pair of parameters for each observation,
The advantage of this approach is flexibility: the mean vector and precision matrix for each datum can take on arbitrary values.
The (potential) disadvantage is inefficiency:
Unless the parameters can be derived in closed-form, this approach will not scale well to a large number of observations.
Indeed, the recognition distribution for a new (or held-out) observation
The E step.
It is instructive to begin the E-step optimization, even without having specified a generative model.
Replacing the expectation over the data in Eq. 8.1 with a sample average and differentiating with respect to the recognition parameters (
The second line can be derived by exploiting the symmetry of
The third line can be derived along similar lines to the identity for gradients with respect to the mean (but the integration by parts must be done twice††margin: Exercise LABEL:ex:derivativesOfGaussianExpectations [36]).
We have written the optimizations for the mean and precision in the form of Eqs. 8.3 and 8.4 in order to highlight the resemblance to the Laplace approximation encountered in Chapter 2—see especially Eq. 2.31 [36].
Note that the joint (here) and posterior (there) energies are identical up to terms constant in
8.1.1 Independent sources and Gaussian emissions: sparse coding and other models
To make the procedure more concrete, let us make a few further assumptions. Although this is by no means obligatory, we let the emissions be Gaussians, as they have been in all preceding examples. For simplicity, we even let the noise be isotropic. The “source” random variables, on the other hand, we shall assume to be independent, but otherwise distributed generically:
(8.5) |
Among other models, Eq. 8.5 includes the sparse-coding models encountered in Section 2.1.3, in which each emission
This ensures that the source variables
Here we are considering Gaussian recognition distributions. For the sake of generality, however, we shall refrain at first from assuming any form for the energy functions, deriving our results for the more general class described by Eq. 8.5. Now, the cross entropy of this joint distribution is:
So far this follows directly from Eq. 8.5. The third term can be simplified further, however, by applying our assumption that the recognition distribution is normal, and using Eq. B.13 from the appendix:
The M step.
We can now write a more specific expression for the joint relative entropy that will allow us to see how to carry out the M step, as well.
Let us simplify even further and treat
(All “constant” terms have been lumped into a single scalar
Thus, the emission matrix is once again given by some version of the normal equations.
In practice, however, if each
The E step.
To carry out the E step, we apply Eqs. 8.3 and 8.4 to the energy of the generative model’s joint or posterior distribution (they are equivalent up to constant terms), to wit, the bracketed terms in Eq. 8.7. Let us also commit to the source energies specified by Eq. 8.6, i.e., the energy of the Laplace distribution. Then the resulting joint energy,
is quadratic in
Thus11
1
Technically the final line is only a sufficient, not a necessary, condition for the gradient to be zero, but we are assuming the Hessian is locally positive definite., the optimal choice for the mean is the solution to the
That leaves the covariance matrix, given by Eq. 8.4.
Again we exploit the fact that the Hessian of the energy is a sub-cubic—indeed, constant—function of
The approximation in the final line follows from the approximation to the Laplace energy in Eq. 8.6.
Let us collect up our optimizations for the complete sparse-coding algorithm:
EM Algorithm for Laplace-Sparse Coding
|
|
|
|
Relationship to classical sparse coding.
We have arrived at a sparse-coding algorithm very similar to that of Lewicki and colleagues [29, 31], but by a somewhat different route.
In those papers, the matrix of basis functions,
Nevertheless, the Hessian of the joint (or posterior) energy still shows up in the classical derivation’s loss, this time as a byproduct of Laplace’s method, as a measure of the volume of the posterior.
This is no coincidence: we have allowed the E step to collapse into a Laplace approximation [36] by using an energy function that is sub-cubic almost everywhere.
However, in the classical method, this precision matrix is not fixed during optimization of
Conceptually, these two terms encode the contribution of the recognition precision (
There is one last discrepancy between the loss gradients.
In the classical derivation, the mode
8.1.2 Independent-components analysis
Just as EM becomes