2.1 Static models
As promised, we consider models with independent and identically distributed (i.i.d.) Gaussian emissions, but with different prior distributions of the “source” variable. We start with the “sparsest,” a mixture model.
2.1.1 Mixture models and the GMM
Consider the graphical model in Fig. 2.1.
The generative process could be described as first rolling a
Here,
In Eq. 2.2, the support of
which is particularly useful when working with log probabilities. The emission can be expressed similarly,
or, alternatively, its mean can be expressed as a linear function of the die roll,
where the columns of the matrix
Inference in the GMM.
Inference in this model is a simple application of Bayes’ rule. We start by noting that the posterior is necessarily another categorical distribution (i.e., a toss of a die); the only question is what the probabilities of each category (each side of the die) are.
We begin somewhat pedantically with Bayes’s theorem as it is expressed in Eq. 2.1, but for a categorical random variable represented one-hot.
To emphasize that we are making inferences from actual observations, we write this in terms of an observation from the data distribution,
i.e., the
The derivation so far is general to any mixture model.
For the GMM’s Gaussian emissions, Eq. 2.4,
where the constant
One way to interpret Eq. 2.8 is (mentally) to fix
Moving toward simpler, rather than more complex, contours, consider now the special case of constant covariance matrices across classes.
Then the terms
a linear function of
Class
We can even ignore the covariance matrix if we are willing to work in the whitened space (i.e., absorb a factor of
Acquiring the parameters of the Gaussian mixture model with identical covariances (Eq. 2.1.1), and then using it to classify points according to the posterior distribution (Eq. 2.7), is known as linear discriminant analysis (LDA)††margin: linear discriminant analysis (LDA) , a nearly century-old method that goes back to Fisher. When the covariances differ, as in the more generic GMM of Eq. 2.8, and the discrimating boundary is quadratic, the method is known, appropriately, as QDA††margin: QDA . If we generalize along a different avenue, however, allowing the emissions to be members of the same but otherwise arbitrary exponential family with identical dispersions, the boundary remains linear [21]. (We will discuss exponential families in detail in Chapter 5.) In all of these cases, estimating the parameters is conceptually straightforward: we separate the data into classes and compute the sample statistics—e.g., means and covariances for the GMM—within each. We discuss this parameter estimation more formally in Chapter 6.
Marginalization in mixture models.
As noted at the outset, our evaluation of the model often requires
For any easily evaluable emission distributions, this is computed painlessly: the probability of the datum
Mixture models and conjugacy.
Comparing Eqs. 2.7 and 2.2, we see that the posterior distribution over
and the prior distribution into the standard form for conjugate priors:
The lack of a log-partition function should give us pause—it says that the partition function is unity for all values of the “parameters”—but we could nevertheless interpret
In fine, the categorical distribution is in some sense “conjugate” to any likelihood function of log probability. This is one way of understanding why mixture models are amenable to Bayesian analyses….
2.1.2 Jointly Gaussian models and factor analyzers
|
|
We begin with a more generic model than the ones which will be useful to us later.
In particular, we begin simply by replacing the categorical source variable from the previous section with a (vector) standard normal variate, as in Fig. 2.2A.
We refine this slightly by assuming further that (1) the emission mean is an affine function of the source,
(2.11) |
Inference in jointly Gaussian models.
To compute the posterior, we of course use Bayes’s theorem, Eq. 2.1, but this time we take advantage of a property of the prior and emission distributions alluded to at the start of the chapter.
As this example nicely illustrates, it will not be necessary to compute the normalizer explicity.
That is because we will recognize the form of the distribution—in this case, another Gaussian—even without the normalizer.
This property follows from the fact that both the prior and the emission are exponentiated second-order polynomials in
There are circumstances under which this is really all we need to know; but what if we want the normalizer explicity?
We still don’t have to take the integral, because the normalizer for a Gaussian can be expressed simply as a function33
3
We use
where we have defined
and
These posterior cumulants should be somewhat intuitive.
The posterior mean is a convex combination of the information from the emission and the prior distributions.
More precisely, it is a convex combination of the prior mean,
Marginalization in jointly Gaussian models.
So much for the posterior.
What about
and of total covariance,
The higher cumulants of
The source-emission covariance.
There is a third convenient way to characterize the joint distribution, and that is to give the distribution of the vector of
From Eq. 2.11, we can write
This covariance will turn out to be useful later when we take on the learning problem.
Jointly Gaussian random variables and conjugacy.
For the models described by Eq. 2.11, it is perhaps unsurprising that the posterior is again normal (Eq. 2.12). After all, the normal distribution is well known to be the conjugate prior for the mean of another normal distribution (it induces a normal posterior distribution over that mean). However, in the family of jointly Gaussian models we are now considering (Eq. 2.11), the mean of the emission distribution is an affine function of, rather than identical with, the source random variable. Evidently, conjugacy holds for this more generic class as well as the classical result for a normally distributed mean, which can be construed as a special case.
Factor analysis.
We consider now the special case where the underlying normal source variable
|
|
This sounds like the jointly Gaussian model just discussed, but the fact that we never observe
A similar consideration applies to the prior covariance.
From Eq. 2.17, we see that any non-isotropic covariance could simply be absorbed into
[[We remove one more degree of freedom.
In general,
So our complete model is:
(2.20) |
In what follows, we shall assume that one has the correct model, and needs to solve only the inference problem. We shall return to learning in Section 7.3.
Inference and marginalization in a factor analyzer.
We take the marginal probability first this time. Applying Eqs. 2.16, 2.17, and 2.18 to the special case of Eq. 2.20, we have
Inference in a factor analyzer is just a special case of inference in the jointly Gaussian model introduced at the beginning of this section. That is, it requires only the computation of the posterior cumulants, Eqs. 2.13 and 2.14, under the distributions that define the model, Eq. 2.20:
Anticipating the computational costs of this, and more complicated, inference problems with Gaussian random variables, we note with some dismay the matrix inversions, which cost at least
Alternative expressions for the posterior cumulants.
To change the space in which the posterior covariance is computed, we can apply the Woodbury matrix-inversion lemma, Eq. B.14, to Eq. 2.14:
This moves the inversion from source (
The expression for the posterior mean, Eq. 2.13, likewise invokes an inversion in source space (
Intuitively, the posterior mean is the prior mean, adjusted by a “correction factor.”
The correction measures the difference between the emission as observed (
Despite appearances, matrix inversions are still required in Eq. 2.24; they occur in the calculation of
Thus in fact only one matrix need be inverted in computing the entire posterior distribution, namely
To make the complete calculation more perspicuous, we collect here the marginal cumulants from Eqs. 2.16 and 2.17:
along with this alternative (“Woodburied”) computation of the posterior cumulants:
where on the last line we have rewritten the posterior covariance yet again using the expression for the emission precision.
2.1.3 Sparse coding
To motivate “sparse” codes, we consider more complicated marginal distributions of data, focusing in particular on the distribution of natural images, for which mixture models and factor analyzers are inadequate models.
To see this, we (following Hinton and Ghahramani [15]) cast these two model types as poles of a continuum of models, from sparsest to densest.
As we saw above, one possible formulation of mixture models interprets the “source” variable,
At the other extreme, in factor analyzers, all of the “hidden units” (elements of
To make matters even more concrete, consider a GMM whose emission covariance is fixed across classes, so that the activity of the source variables determines only the mean of the emission.
Then the emissions for the GMM and factor analyzer can be written identically, as a normal distribution about
The statistics of natural scenes.
Both types of model are problematic for complex distributions, like that of natural images.
Gaussian mixture models can indeed approximate any marginal density over
A factor analyzer, in contrast, can avail itself of all basis vectors at once in attempting to explain or generate an observation, and consequently can in theory recognize an image in terms of its constituent parts.
By the same token, it needs no more hidden units than there are pixels in the images:
The existence of such higher-order structure as seen in Fig. LABEL:fig:pixelDistribution is not perhaps obvious a priori.
After all, the size of the covariance matrix for the model marginal,
Covariance matrices arising from shift-invariant data will in general have this “symmetric Toeplitz” structure.
More specifically, when the shift-invariant data are two-dimensional, like images, the covariance matrix will be block-Toeplitz with Toeplitz blocks [10].
Furthermore, if we assume that the covariance drops to zero at some distance within the image, and we ignore edge effects, the covariance matrix can be approximated as block-circulant with circulant blocks, in which case its eigenvectors are the (two-dimensional) discrete Fourier modes, and its eigenvalues constitute the power spectrum [10].77
7
Since the eigenvectors of the covariance matrix are the principal components of the data (see also Section 7.3), this likewise implies that the Fourier modes are the principal components of shift-invariant data.
In natural images, the power spectrum appears to fall off with frequency
Thus, natural images can be decorrelated by projection onto the Fourier modes, and then “whitened” by normalizing by the amplitude spectrum (the square roots of the power spectrum).
Yet it is easily seen Fig. LABEL:fig:SimoncelliOlshausenFig5 that this process leaves most of the structure intact [44].
Or again, coming from the other direction, restoring the (approximately)
Moderately sparse generative models.
We have established, then, that neither factor analyzers nor GMMs are good models for natural images, albeit for roughly opposite reasons.
Furthermore, the inadequacy of the multivariate Gaussian as a marginal distribution for natural images is relevant to a wider range of models than the factor analyzer.
When the source variables are densely, even though non-normally, distributed, the model marginal
Between the “extremes” of the GMM and the factor analyzer lie models in which the “source” variables are sparsely distributed: some small subset are “on” for any given image. For example, they could be distributed according to a product over Bernoulli distributions with small means. Or again, we could interpret “on” more liberally and assign fat-tailed (leptokurtotic) distributions to the source variables, so that they usually take on values close to zero but can take on extreme probabilities more often than normally distributed variables. Common choices are the Laplace and Cauchy distributions [35, 29]. In this case it is convenient to write the prior distribution in the form of a product of Boltzmann distributions,
(2.28) |
with the energy functions
The last is not a standard distribution, but for large
|
|
Since we are aiming at economy of expression (few descriptors per image), we must concede prodigality in vocabulary (many descriptors) [39]—i.e., an overcomplete set of basis vectors. This is made explicit in Fig. 2.4B, along with the independence statements implied by our choices of prior and emission distributions. A factor of 2 overcompleteness is common [29].
…
Inference and marginalization in a sparse-coding model: Laplace’s method.
There is a reason that factor analyzers and GMMs remain popular despite their limitations: the model posterior and marginal distributions are computable in closed form. This computability is due to the prior distributions being conjugate (or pseudo-conjugate) to the emission likelihoods. When they are not, the marginal and posterior distributions can typically be computed only approximately. When the prior is Laplace, e.g., it is not clear that the posterior can be computed (although see [38]).
Perhaps the simplest approach is to make a Taylor approximation.88
8
Another, more powerful class of techniques for approximate inference is based on (iterative) optimization of the posterior.
However, this endows it with the character of learning, and accordingly we shall defer discussing these techniques until Chapter 6.
In particular, consider approximating the energy,
i.e., at the mode, the energy’s gradient must be zero and its Hessian must be positive definite. For a generic energy, the second-order Taylor approximation at the mode is therefore
The only remaining
Note that this approximation is valid at realizations of
So much for approximating the posterior; we now move on to the model marginal. Using the definition of conditional probability with the approximate posterior distribution, Eq. 2.31, and then evaluating at the mode, we see that
This is the approximate model marginal under Laplace’s method.
Let us specialize to an isotropic Gaussian emission and a factorial prior distribution, Eq. 2.28. Start with the Hessian:
where
Hence, from Eq. 2.31, an approximate posterior distribution for the sparse-coding model is
where
To specialize the marginal distribution to the Laplace prior, we use Eq. 2.32, substituting the Laplace energies into the joint, but the hyperbolic-cosine energies into the Hessian (to maintain differentiability):
…