2.1 Static models
As promised, we consider models with independent and identically distributed (i.i.d.) Gaussian emissions, but with different prior distributions of the “source” variable. We start with the “sparsest,” a mixture model.
2.1.1 Mixture models and the GMM
††margin: include a plot of example dataConsider the graphical model in Fig. 2.1. The generative process could be described as first rolling a -sided die in order to pick a mean (vector) and covariance (matrix) from a set of means and covariances, followed by a draw from a multivariate normal distribution described by these parameters:
Here, is a vector of the probabilities of the classes (sides of a possibly loaded die); its elements sum to one.
In Eq. 2.2, the support of could be the integers between 1 and (inclusive). However, a common alternative represention for the categorical random variable is a one-hot vector, , i.e. a vector consisting of all zeros except for a single 1 at the category being represented. This allows us to rewrite the model in other convenient forms. For example, the prior distribution can place the random variable into the exponent,
which is particularly useful when working with log probabilities. The emission can be expressed similarly,
or, alternatively, its mean can be expressed as a linear function of the die roll,
where the columns of the matrix are the mean vectors. But this formulation cannot be extended elegantly to the covariance matrix, which is therefore the chief advantage of Eq. 2.4 over Eq. 2.5: the dependence on the source variable of the covariance matrix, as well as of the mean, can be expressed algebraically. We shall see an example of its usefulness when we turn to learning in the Gaussian mixture model, Section 7.1.
Inference in the GMM.
Inference in this model is a simple application of Bayes’ rule. We start by noting that the posterior is necessarily another categorical distribution (i.e., a toss of a die); the only question is what the probabilities of each category (each side of the die) are.
We begin somewhat pedantically with Bayes’s theorem as it is expressed in Eq. 2.1, but for a categorical random variable represented one-hot. To emphasize that we are making inferences from actual observations, we write this in terms of an observation from the data distribution, :
i.e., the output of the softmax function††margin: softmax function, or more properly the soft argmax , where . For the vector of all possible categories, , the posterior is therefore
The derivation so far is general to any mixture model. For the GMM’s Gaussian emissions, Eq. 2.4, becomes
where the constant is irrelevant since it occurs in all the terms and therefore factors out of the softmax. This quantity, , is (up to an additive constant) the log of the posterior probability of class . What does it mean?
One way to interpret Eq. 2.8 is (mentally) to fix and see what the contours of constant (log-)probability look like. Or again, one can set for some , in which case the resulting expression (in ) is the boundary between classes and . In either case, these expressions are second-order polynomials in , a property inherited from the normal distribution. This is illustrated in Fig. LABEL:fig:XXX. If we sought fancier (higher-order) boundaries between classes, we would need a distribution with more “shape” parameters. ††margin: insert figure of quadratic class boundaries here; also a subfig of linear boundaries with the projection…..
Moving toward simpler, rather than more complex, contours, consider now the special case of constant covariance matrices across classes. Then the terms and could be factored out of all the elements of , and cancelled (like ). That leaves
a linear function of . Setting two such expressions equal to each other for different classes and (and applying a little bit of algebra), we see that two classes are equiprobable when
Class is more probable than class when the left-hand side is larger than the right, and vice versa. (Notice, though, that in neither case is one of these classes guaranteed to be most likely, since we are here ignoring all other classes).
We can even ignore the covariance matrix if we are willing to work in the whitened space (i.e., absorb a factor of into and ). Then Eq. 2.9 becomes transparent. To decide the relative probability of a (whitened) observation belonging to class or class , we first measure how far it is from the midpoint between the class centers, . We then project this (whitened) displacement onto the vector connecting their (whitened) centers. Here the vector runs from to , so the probability of class increases with (positive) distance along the projection. But the point of equality on the projection is not zero displacement, because one class may be a priori more probable than the other. The term on the right-hand side accounts for this.
Acquiring the parameters of the Gaussian mixture model with identical covariances (Eq. 2.1.1), and then using it to classify points according to the posterior distribution (Eq. 2.7), is known as linear discriminant analysis (LDA)††margin: linear discriminant analysis (LDA) , a nearly century-old method that goes back to Fisher. When the covariances differ, as in the more generic GMM of Eq. 2.8, and the discrimating boundary is quadratic, the method is known, appropriately, as QDA††margin: QDA . If we generalize along a different avenue, however, allowing the emissions to be members of the same but otherwise arbitrary exponential family with identical dispersions, the boundary remains linear [21]. (We will discuss exponential families in detail in Chapter 5.) In all of these cases, estimating the parameters is conceptually straightforward: we separate the data into classes and compute the sample statistics—e.g., means and covariances for the GMM—within each. We discuss this parameter estimation more formally in Chapter 6.
Marginalization in mixture models.
As noted at the outset, our evaluation of the model often requires . It is easily read off the denominator in our derivation of a mixture model’s posterior distribution:
For any easily evaluable emission distributions, this is computed painlessly: the probability of the datum under mixture component , scaled by the prior probability of component , and then summed over all .
Mixture models and conjugacy.
Comparing Eqs. 2.7 and 2.2, we see that the posterior distribution over is in the same family as the prior distribution—to wit, the family of categorical distributions. We need not even have carried out any derivation to see this: the support of is the set of length- one-hot vectors (or alternatively, the support of is the set of integers from 1 to ), any distribution over which can be described as categorical. Now recall the definition of a conjugate prior††margin: conjugate prior : the prior distribution of a parameter is conjugate to the likelihood of that parameter when the resulting posterior distribution over the parameter is in the same family as the prior. So we interpret the “source” variable as a set of parameters, albeit with some trepidation, since the parameters of exponential families are always continuous rather than discrete (although see [47]). Still, comparing Eqs. 2.3 and 2.4, we see that the “prior” distribution mimics the “likelihood” in the usual way, with the sufficient statistics being the vector of log probabilities. Thus encouraged, we put the emission in exponential-family form,
and the prior distribution into the standard form for conjugate priors:
The lack of a log-partition function should give us pause—it says that the partition function is unity for all values of the “parameters”—but we could nevertheless interpret as the (first) standard hyperparameter of the conjugate prior. To compute the posterior distribution, this gets updated in the standard way for conjugate priors, namely, by adding the sufficient statistics: for all .
In fine, the categorical distribution is in some sense “conjugate” to any likelihood function of log probability. This is one way of understanding why mixture models are amenable to Bayesian analyses….
2.1.2 Jointly Gaussian models and factor analyzers
We begin with a more generic model than the ones which will be useful to us later. In particular, we begin simply by replacing the categorical source variable from the previous section with a (vector) standard normal variate, as in Fig. 2.2A. We refine this slightly by assuming further that (1) the emission mean is an affine function of the source, ; and (2) the emission covariance is fixed for all values of :
(2.11) |
Inference in jointly Gaussian models.
To compute the posterior, we of course use Bayes’s theorem, Eq. 2.1, but this time we take advantage of a property of the prior and emission distributions alluded to at the start of the chapter. As this example nicely illustrates, it will not be necessary to compute the normalizer explicity. That is because we will recognize the form of the distribution—in this case, another Gaussian—even without the normalizer. This property follows from the fact that both the prior and the emission are exponentiated second-order polynomials in . Multiplying two such functions together yields a third exponentiated second-order polynomial in (adding second-order polynomials can’t increase their order), which must be a third normal distribution.
There are circumstances under which this is really all we need to know; but what if we want the normalizer explicity? We still don’t have to take the integral, because the normalizer for a Gaussian can be expressed simply as a function33 3 We use throughout this book [11]. of the covariance matrix: . (Another way to think about this is that someone had to take the integral once, but we don’t have to do it again.) Naturally, to compute this we do need an expression for the covariance matrix, so it won’t do to leave the second-order polynomial in any form whatever; we need to re-arrange it into a quadratic form, with the inverse covariance matrix in the middle. That is the only algebraic work that actually needs to be done in applying Bayes’s theorem:
where we have defined
and
These posterior cumulants should be somewhat intuitive. The posterior mean is a convex combination of the information from the emission and the prior distributions. More precisely, it is a convex combination of the prior mean, , and the centered observation that has been transformed into the source space, . Here is a pseudo-inverse of .44 4 When is higher-dimensional than , this could be any right inverse. When is higher-dimensional than , the pseudo-inverse should be intepreted in the least-squares sense. The weights are the (normalized) precisions of these two distributions, although the precision of the emission about must be transformed into the source space first with . The posterior precision, for its part, is just the sum of these two (unnormalized) precisions. As we shall see shortly, it will be convenient to have a name for the normalized precision of the emission:
Marginalization in jointly Gaussian models.
So much for the posterior. What about ? We suspect it will be yet another normal distribution, although we need to make sure. Given this suspicion, and since a normal distribution is completely characterized by its first two cumulants, we use the laws of total expectation,
and of total covariance,
The higher cumulants of are necessarily zero because each term in the law of total cumulance invokes a higher-order (than two) cumulant of either or , and these are all zero. So the marginal distribution is indeed again normal:
The source-emission covariance.
There is a third convenient way to characterize the joint distribution, and that is to give the distribution of the vector of and concatenated together. Unsurprisingly, this vector is also normally distributed, and indeed we have almost all the pieces for this already in hand.55 5 Notice that the joint distribution of and does not have a convenient form for the Gaussian mixture model. The mean is given by the concatenation of and . The covariance matrix has and in its upper-left and lower-right blocks. What remains is the upper-right (or its transpose in the lower-left) block, , i.e. the covariance between and .
From Eq. 2.11, we can write , where is independent of . Hence:
This covariance will turn out to be useful later when we take on the learning problem.
Jointly Gaussian random variables and conjugacy.
For the models described by Eq. 2.11, it is perhaps unsurprising that the posterior is again normal (Eq. 2.12). After all, the normal distribution is well known to be the conjugate prior for the mean of another normal distribution (it induces a normal posterior distribution over that mean). However, in the family of jointly Gaussian models we are now considering (Eq. 2.11), the mean of the emission distribution is an affine function of, rather than identical with, the source random variable. Evidently, conjugacy holds for this more generic class as well as the classical result for a normally distributed mean, which can be construed as a special case.
Factor analysis.
We consider now the special case where the underlying normal source variable is never observed—it is latent. For example, suppose that is the (random) vector of performances on a battery of intelligence tests of various kinds. Each sample corresponds to individual ’s performance on all exams. We hypothesize that this performance can be explained by a smaller, , number of latent intelligence “factors,” , that are normally distributed across individuals. But we can’t observe them directly; we only observe test scores, which are (ideally) an affine function of these latent factors. Finally, we allow for errors in the model, errors in our measurements, unmodelled enviromental inputs, etc. by allowing our test results to have been corrupted by Gaussian noise.
This sounds like the jointly Gaussian model just discussed, but the fact that we never observe obliges us to restrict the degress of freedom of the model, as discussed at the outset of this chapter. In particular, we have just seen that the marginal distribution of observed data is normal, Eq. 2.18, and in this factor analysis, this is the only distribution we ever see. Eq. 2.16 shows that, as far as this distribution is concerned, any change in the mean of the prior distribution, , could equally well be attributed to a change in the emission offset, . So without loss of generality, we can let .
A similar consideration applies to the prior covariance. From Eq. 2.17, we see that any non-isotropic covariance could simply be absorbed into . Indeed, this makes the intepretation of the model more intuitive: we are asking for the independent factors that together explain test performances. Any covariance across different kinds of intelligence tests is due to their having similar “loading” onto these independent factors.
[[We remove one more degree of freedom. In general, could be any positive definite matrix, but assuming it to be diagonal, , yields a useful interpretation…..]].
So our complete model is:
(2.20) |
In what follows, we shall assume that one has the correct model, and needs to solve only the inference problem. We shall return to learning in Section 7.3.
Inference and marginalization in a factor analyzer.
We take the marginal probability first this time. Applying Eqs. 2.16, 2.17, and 2.18 to the special case of Eq. 2.20, we have
Inference in a factor analyzer is just a special case of inference in the jointly Gaussian model introduced at the beginning of this section. That is, it requires only the computation of the posterior cumulants, Eqs. 2.13 and 2.14, under the distributions that define the model, Eq. 2.20:
Anticipating the computational costs of this, and more complicated, inference problems with Gaussian random variables, we note with some dismay the matrix inversions, which cost at least for matrices. In factor analysis, as we have noted, the observations are higher-dimensional than the latent variables . It is fortunate, then, that the only matrix inversion carried out in the observation space is of , which is just (invert the entries on the diagonal). But what about models where the “source space” is larger than the “emission space”? What if the emission covariance is not diagonal? And how can we limit the total number of inversions when the inference step has to be applied repeatedly?
Alternative expressions for the posterior cumulants.
To change the space in which the posterior covariance is computed, we can apply the Woodbury matrix-inversion lemma, Eq. B.14, to Eq. 2.14:
This moves the inversion from source () to emission () space. (Notice, incidentally, that the inverted quantity is precisely the marginal covariance, .)
The expression for the posterior mean, Eq. 2.13, likewise invokes an inversion in source space (), as well as one in the emission space (). In factor analysis, the first inversion was not required because the prior mean, , was assumed to be zero. Here we eliminate it by substituting the “Woodburied” posterior covariance, Eq. 2.23, into Eq. 2.13. We also make use of the definition of the emission precision, Eq. 2.15:
Intuitively, the posterior mean is the prior mean, adjusted by a “correction factor.” The correction measures the difference between the emission as observed () and as predicted by the prior mean (), weighted by the precision of the emission ().
Despite appearances, matrix inversions are still required in Eq. 2.24; they occur in the calculation of . Consulting its definition, Eq. 2.15, we see one inversion of the emission covariance, and another in the computation of the posterior covariance, albeit in the space of emissions via Eq. 2.23. With a little work, one of these inversions can be eliminated:
Thus in fact only one matrix need be inverted in computing the entire posterior distribution, namely —which, by Eq. 2.17, is just the marginal covariance, . So we can also write
To make the complete calculation more perspicuous, we collect here the marginal cumulants from Eqs. 2.16 and 2.17:
along with this alternative (“Woodburied”) computation of the posterior cumulants:
where on the last line we have rewritten the posterior covariance yet again using the expression for the emission precision.
2.1.3 Sparse coding
To motivate “sparse” codes, we consider more complicated marginal distributions of data, focusing in particular on the distribution of natural images, for which mixture models and factor analyzers are inadequate models. To see this, we (following Hinton and Ghahramani [15]) cast these two model types as poles of a continuum of models, from sparsest to densest. As we saw above, one possible formulation of mixture models interprets the “source” variable, , as a one-hot vector of length (as opposed to a scalar with support in the set ). We can even think of each element of as a separate “unit” in a neural network. Indeed, the winner-take-all neural networks of the 1980s and ’90s [12, Chapter 9] can be thought of as implying (perhaps approximate) posterior distributions for generative models that are mixtures. From this perspective, then, the term “mixture” is potentially misleading: the marginal distribution of observations is indeed a mixture, whence the name, but each individual example is not mixed. That is, each is generated by just one element of , and the code for generating it is maximally sparse.
At the other extreme, in factor analyzers, all of the “hidden units” (elements of ) typically will have non-zero values, since they are independent and normally distributed. All elements of a sample therefore potentially contribute to a single emission . This is a dense code.
To make matters even more concrete, consider a GMM whose emission covariance is fixed across classes, so that the activity of the source variables determines only the mean of the emission. Then the emissions for the GMM and factor analyzer can be written identically, as a normal distribution about (cf. Eq. 2.5 and Eq. 2.20). That is, both models generate data as a “combination” of basis vectors, the columns of , but differ in how many columns are allowed to contribute to a single observation.
The statistics of natural scenes.
Both types of model are problematic for complex distributions, like that of natural images. Gaussian mixture models can indeed approximate any marginal density over (Eq. 2.10) to arbitrary accuracy [49]: intuitively, a distribution of any shape can be constructed by placing little blobs of probability mass at various locations. This is essential for the distribution of images, where the presence of (e.g.) lines makes the marginal distribution of (say) triplets of adjacent pixels inexplicable by their mean values or even pairwise correlations alone (Fig. LABEL:fig:pixelDistribution). But each such bit of structure must be “explained” with a separate basis vector! There is no way to explain a letter “X” as a superimposition of two lines; rather, the “X” and the two lines into which it might be decomposed would each require its own basis vector. More generally, every image (or image patch) that could not be expressed as a Gaussian disturbance of some other image would require another column in . Thus the ability to model structure comes at the price of a very large number of mixture components. It must be emphasized that we believe this price to be too high precisely because we believe the structure in images to be “componential,” explicable in terms of combinations of simpler basis elements.
A factor analyzer, in contrast, can avail itself of all basis vectors at once in attempting to explain or generate an observation, and consequently can in theory recognize an image in terms of its constituent parts. By the same token, it needs no more hidden units than there are pixels in the images: can be square, i.e. a complete basis. However, the marginal distribution of observations under a factor analyzer is just another normal distribution (recall Eq. 2.21)—a fact that looked helpful from the perspective of calculating with the model, but is fatal for modeling natural images. This is a consequence of the higher-order structure in images alluded to above.
The existence of such higher-order structure as seen in Fig. LABEL:fig:pixelDistribution is not perhaps obvious a priori. After all, the size of the covariance matrix for the model marginal, , scales quadratically in the number of pixels—a lot of representational power to work with! For example, the covariance matrix of even modestly sized, (256 256)-pixel, images has independent entries. It would seem that we could represent any distribution with this many parameters. But in fact, by taking certain symmetries into account, we reduce the number of covariance entries to a linear function of the number of pixels. The statistics of natural images are, for the most part66 6 Many natural images, as seen by land-dwelling animals, have sky in the upper portion and ground below, which does introduce statistical regularities that depend on absolute position. Overrepresentation of vertical and (especially) horizontal lines would likewise introduce assymmetries. , stationary: the correlation between any two pairs of pixels depends on the distance between them but not their absolute locations. In our example, there are only distances—from, say, the upper left pixel to any pixel in the image, including itself. So the covariance cannot encode so much information after all.
Covariance matrices arising from shift-invariant data will in general have this “symmetric Toeplitz” structure. More specifically, when the shift-invariant data are two-dimensional, like images, the covariance matrix will be block-Toeplitz with Toeplitz blocks [10]. Furthermore, if we assume that the covariance drops to zero at some distance within the image, and we ignore edge effects, the covariance matrix can be approximated as block-circulant with circulant blocks, in which case its eigenvectors are the (two-dimensional) discrete Fourier modes, and its eigenvalues constitute the power spectrum [10].77 7 Since the eigenvectors of the covariance matrix are the principal components of the data (see also Section 7.3), this likewise implies that the Fourier modes are the principal components of shift-invariant data. In natural images, the power spectrum appears to fall off with frequency at a rate of very nearly [44].
Thus, natural images can be decorrelated by projection onto the Fourier modes, and then “whitened” by normalizing by the amplitude spectrum (the square roots of the power spectrum). Yet it is easily seen Fig. LABEL:fig:SimoncelliOlshausenFig5 that this process leaves most of the structure intact [44]. Or again, coming from the other direction, restoring the (approximately) power spectrum to white noise does not produce a natural image (Fig. LABEL:fig:SimoncelliOlshausenFig5). Factor analyzers thus have no means of capturing the structure in natural images that lies beyond the mean and the power spectrum.
Moderately sparse generative models.
We have established, then, that neither factor analyzers nor GMMs are good models for natural images, albeit for roughly opposite reasons. Furthermore, the inadequacy of the multivariate Gaussian as a marginal distribution for natural images is relevant to a wider range of models than the factor analyzer. When the source variables are densely, even though non-normally, distributed, the model marginal may very well end up approximately normal anyway, for central-limit-like reasons: the marginal distribution is an average of many independent, “random” basis vectors. Averaging can wash out higher-order structure in these vectors. This is (I take it) another way of expressing Hinton’s conjecture that the most useful codes will be componential, in order to represent the properties—pose, deformation, etc.—of the objects in the image; but also non-dense, in order to accommodate multiple objects in the same image [15].
Between the “extremes” of the GMM and the factor analyzer lie models in which the “source” variables are sparsely distributed: some small subset are “on” for any given image. For example, they could be distributed according to a product over Bernoulli distributions with small means. Or again, we could interpret “on” more liberally and assign fat-tailed (leptokurtotic) distributions to the source variables, so that they usually take on values close to zero but can take on extreme probabilities more often than normally distributed variables. Common choices are the Laplace and Cauchy distributions [35, 29]. In this case it is convenient to write the prior distribution in the form of a product of Boltzmann distributions,
(2.28) |
with the energy functions chosen so as to enforce some form of sparseness. For example,
The last is not a standard distribution, but for large it approximates the Laplace distribution while retaining differentiability at 0 [30]; see Fig. LABEL:fig:sparseDistributions. It will serve as our default example in the sparse-coding model that follows. The emissions, for their part, are once again normally distributed about a linear function of the source variables, and for simplicity we have given them identical variance, . Eq. 2.28 describes the complete sparse-coding model, a graphical model that is shown in Fig. 2.4A.
Since we are aiming at economy of expression (few descriptors per image), we must concede prodigality in vocabulary (many descriptors) [39]—i.e., an overcomplete set of basis vectors. This is made explicit in Fig. 2.4B, along with the independence statements implied by our choices of prior and emission distributions. A factor of 2 overcompleteness is common [29].
…
Inference and marginalization in a sparse-coding model: Laplace’s method.
There is a reason that factor analyzers and GMMs remain popular despite their limitations: the model posterior and marginal distributions are computable in closed form. This computability is due to the prior distributions being conjugate (or pseudo-conjugate) to the emission likelihoods. When they are not, the marginal and posterior distributions can typically be computed only approximately. When the prior is Laplace, e.g., it is not clear that the posterior can be computed (although see [38]).
Perhaps the simplest approach is to make a Taylor approximation.88 8 Another, more powerful class of techniques for approximate inference is based on (iterative) optimization of the posterior. However, this endows it with the character of learning, and accordingly we shall defer discussing these techniques until Chapter 6. In particular, consider approximating the energy, , of a distribution near its mode, , with the first three terms of the Taylor series. It is easy to see that, for to be a mode of the distribution, it must satisfy the following conditions:
i.e., at the mode, the energy’s gradient must be zero and its Hessian must be positive definite. For a generic energy, the second-order Taylor approximation at the mode is therefore
The only remaining -dependent term should look familiar: it is the energy of a normal distribution with mean at the mode and inverse covariance equal to the Hessian of the energy (evaluated at the mode). The first term, , on the other hand, is constant in the argument and can therefore be absorbed into the normalizer. Hence we can approximate the posterior as
Note that this approximation is valid at realizations of even “moderately” close to , since the neglected higher-order terms need only be less than unity in order for the exponentiation to drive their contribution to the distribution down to much smaller values. On the other hand, it is obviously inappropriate for distributions having multiple modes, or over discrete random variables.
So much for approximating the posterior; we now move on to the model marginal. Using the definition of conditional probability with the approximate posterior distribution, Eq. 2.31, and then evaluating at the mode, we see that
This is the approximate model marginal under Laplace’s method.
Let us specialize to an isotropic Gaussian emission and a factorial prior distribution, Eq. 2.28. Start with the Hessian:
where is a diagonal matrix with the vector as its diagonal. For concreteness, let us specialize further to Laplace-distributed source variables, as in Eq. 2.29. Unfortunately, the curvature of this energy function is undefined at the origin, and (equally bad) zero everywhere else. Lewicki and colleagues [29, 31] therefore suggest approximating this energy—in the Hessian only—with the hyperbolic-cosine energy (see again Eq. 2.29), which approaches the Laplace energy in the limit of large . The second derivative of this energy is
Hence, from Eq. 2.31, an approximate posterior distribution for the sparse-coding model is
where is the element-wise product and is supposed to act element-wise. Notice that although this may appear to be independent of the observations , they enter through the mode, which obviously varies as a function of .
To specialize the marginal distribution to the Laplace prior, we use Eq. 2.32, substituting the Laplace energies into the joint, but the hyperbolic-cosine energies into the Hessian (to maintain differentiability):
…