2.1 Static models

As promised, we consider models with independent and identically distributed (i.i.d.) Gaussian emissions, but with different prior distributions of the “source” variable. We start with the “sparsest,” a mixture model.

2.1.1 Mixture models and the GMM

𝑿^{\bm{\hat{X}}}p^(𝒙^;𝜽)=Categ(𝝅){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{};\bm{\theta}}\right)}={\text{Categ}% \mathopen{}\mathclose{{}\left(\bm{\pi}}\right)}𝒀^{\bm{\hat{Y}}}p^(𝒚^|𝒙^;𝜽)=𝒩(𝝁x^,𝚺x^){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{}\middle|\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{};\bm{% \theta}}\right)}={\mathcal{N}\mathopen{}\mathclose{{}\left(\bm{\mu}_{% \leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\hat{x}},\>% \mathbf{{\Sigma}}_{\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{% pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\hat{x}}}\right)} NN
Figure 2.1: Gaussian mixture model.
margin: include a plot of example data

Consider the graphical model in Fig. 2.1. The generative process could be described as first rolling a K{K}-sided die in order to pick a mean (vector) and covariance (matrix) from a set of means and covariances, followed by a draw from a multivariate normal distribution described by these parameters:

p^(x^;𝜽)=Categ(𝝅)p^(𝒚^|x^;𝜽)=𝒩(𝝁x^,𝚺x^).\begin{split}{\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\hat{x};\bm{\theta}}% \right)}&=\text{Categ}\mathopen{}\mathclose{{}\left(\bm{\pi}}\right)\\ {\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{}\middle|\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\hat{x};\bm{\theta}}% \right)}&=\mathcal{N}\mathopen{}\mathclose{{}\left(\bm{\mu}_{\leavevmode\color% [rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\hat{x}},\>\mathbf{{% \Sigma}}_{\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{% rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\hat{x}}% }\right).\end{split} (2.2)

Here, 𝝅\bm{\pi} is a vector of the probabilities of the K{K} classes (sides of a possibly loaded die); its elements sum to one.

In Eq. 2.2, the support of X^{\hat{X}} could be the integers between 1 and K{K} (inclusive). However, a common alternative represention for the categorical random variable is a one-hot vector, 𝑿^{\bm{\hat{X}}}, i.e. a vector consisting of all zeros except for a single 1 at the category being represented. This allows us to rewrite the model in other convenient forms. For example, the prior distribution can place the random variable into the exponent,

p^(𝒙^;𝜽)=k=1Kπkx^k,{\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{};\bm{\theta}}\right)}=\prod_{k=1}^{{% K}}\pi_{k}^{\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor% }{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\hat{x% }_{k}}, (2.3)

which is particularly useful when working with log probabilities. The emission can be expressed similarly,

p^(𝒚^|𝒙^;𝜽)=k=1K[𝒩(𝝁k,𝚺k)]x^k;{\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{}\middle|\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{};\bm{% \theta}}\right)}=\prod_{k=1}^{{K}}\mathopen{}\mathclose{{}\left[\mathcal{N}% \mathopen{}\mathclose{{}\left(\bm{\mu}_{k},\>\mathbf{{\Sigma}}_{k}}\right)}% \right]^{\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{% rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\hat{x}_% {k}}; (2.4)

or, alternatively, its mean can be expressed as a linear function of the die roll,

p^(𝒚^|𝒙^;𝜽)=𝒩(𝐂𝒙^,𝚺𝒙^),{\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{}\middle|\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{};\bm{% \theta}}\right)}=\mathcal{N}\mathopen{}\mathclose{{}\left(\mathbf{{C}}% \leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}},% \>\mathbf{{\Sigma}}_{\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{% pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{\hat{x}}}}\right), (2.5)

where the columns of the matrix 𝐂=[𝝁1,,𝝁K]\mathbf{{C}}=\mathopen{}\mathclose{{}\left[\bm{\mu}_{1},\ldots,\bm{\mu}_{{K}}}\right] are the mean vectors. But this formulation cannot be extended elegantly to the covariance matrix, which is therefore the chief advantage of Eq. 2.4 over Eq. 2.5: the dependence on the source variable of the covariance matrix, as well as of the mean, can be expressed algebraically. We shall see an example of its usefulness when we turn to learning in the Gaussian mixture model, Section 7.1.

Inference in the GMM.

Inference in this model is a simple application of Bayes’ rule. We start by noting that the posterior is necessarily another categorical distribution (i.e., a toss of a die); the only question is what the probabilities of each category (each side of the die) are.

We begin somewhat pedantically with Bayes’s theorem as it is expressed in Eq. 2.1, but for a categorical random variable represented one-hot. To emphasize that we are making inferences from actual observations, we write this in terms of an observation from the data distribution, 𝒚\bm{y}:

p^(X^j=1|𝒚;𝜽)=p^(𝒚|X^j=1;𝜽)p^(X^j=1;𝜽)𝒙^p^(𝒚|𝒙^;𝜽)p^(𝒙^;𝜽)=p^(𝒚|X^j=1;𝜽)p^(X^j=1;𝜽)k=1Kp^(𝒚|X^k=1;𝜽)p^(X^k=1;𝜽)=p^(𝒚|X^j=1;𝜽)πjk=1Kp^(𝒚|X^k=1;𝜽)πk=exp{log(p^(𝒚|X^j=1;𝜽)πj)}k=1Kexp{log(p^(𝒚|X^k=1;𝜽)πk)}=softmax{𝒛^}j,\begin{split}{\hat{p}\mathopen{}\mathclose{{}\left({\hat{X}}_{j}=1\middle|\bm{% y};\bm{\theta}}\right)}&=\frac{{\hat{p}\mathopen{}\mathclose{{}\left(\bm{y}% \middle|{\hat{X}}_{j}=1;\bm{\theta}}\right)}{\hat{p}\mathopen{}\mathclose{{}% \left({\hat{X}}_{j}=1;\bm{\theta}}\right)}}{\sum_{\bm{\hat{x}}{}}{\hat{p}% \mathopen{}\mathclose{{}\left(\bm{y}\middle|\bm{\hat{x}};\bm{\theta}}\right)}{% \hat{p}\mathopen{}\mathclose{{}\left(\bm{\hat{x}};\bm{\theta}}\right)}}\\ &=\frac{{\hat{p}\mathopen{}\mathclose{{}\left(\bm{y}\middle|{\hat{X}}_{j}=1;% \bm{\theta}}\right)}{\hat{p}\mathopen{}\mathclose{{}\left({\hat{X}}_{j}=1;\bm{% \theta}}\right)}}{\sum_{k=1}^{{K}}{\hat{p}\mathopen{}\mathclose{{}\left(\bm{y}% \middle|{\hat{X}}_{k}=1;\bm{\theta}}\right)}{\hat{p}\mathopen{}\mathclose{{}% \left({\hat{X}}_{k}=1;\bm{\theta}}\right)}}\\ &=\frac{{\hat{p}\mathopen{}\mathclose{{}\left(\bm{y}\middle|{\hat{X}}_{j}=1;% \bm{\theta}}\right)}\pi_{j}}{\sum_{k=1}^{{K}}{\hat{p}\mathopen{}\mathclose{{}% \left(\bm{y}\middle|{\hat{X}}_{k}=1;\bm{\theta}}\right)}\pi_{k}}\\ &=\frac{\exp\mathopen{}\mathclose{{}\left\{\log\mathopen{}\mathclose{{}\left({% \hat{p}\mathopen{}\mathclose{{}\left(\bm{y}\middle|{\hat{X}}_{j}=1;\bm{\theta}% }\right)}\pi_{j}}\right)}\right\}}{\sum_{k=1}^{{K}}\exp\mathopen{}\mathclose{{% }\left\{\log\mathopen{}\mathclose{{}\left({\hat{p}\mathopen{}\mathclose{{}% \left(\bm{y}\middle|{\hat{X}}_{k}=1;\bm{\theta}}\right)}\pi_{k}}\right)}\right% \}}\\ &=\operatorname*{softmax}\mathopen{}\mathclose{{}\left\{\bm{\hat{z}}}\right\}_% {j},\\ \end{split} (2.6)

i.e., the jthj^{\text{th}} output of the softmax functionmargin: softmax function, or more properly the soft argmax , where z^k=log(p^(𝒚|X^k=1;𝜽)πk)\hat{z}_{k}=\log\mathopen{}\mathclose{{}\left({\hat{p}\mathopen{}\mathclose{{}% \left(\bm{y}\middle|{\hat{X}}_{k}=1;\bm{\theta}}\right)}\pi_{k}}\right). For the vector of all possible categories, 𝑿^{\bm{\hat{X}}}, the posterior is therefore

p^(𝒙^|𝒚;𝜽)=Categ(softmax{𝒛^}){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{}\middle|\bm{y};\bm{\theta}}\right)}=% \text{Categ}\mathopen{}\mathclose{{}\left(\operatorname*{softmax}\mathopen{}% \mathclose{{}\left\{\bm{\hat{z}}}\right\}}\right) (2.7)

The derivation so far is general to any mixture model. For the GMM’s Gaussian emissions, Eq. 2.4, z^k\hat{z}_{k} becomes

z^k=c-12log|𝚺k|-12(𝒚-𝝁k)T𝚺k-1(𝒚-𝝁k)+logπk,\begin{split}\hat{z}_{k}&=c-\frac{1}{2}\log|\mathbf{{\Sigma}}_{k}|-\frac{1}{2}% \mathopen{}\mathclose{{}\left(\bm{y}-\bm{\mu}_{k}}\right)^{\text{T}}\mathbf{{% \Sigma}}^{-1}_{k}\mathopen{}\mathclose{{}\left(\bm{y}-\bm{\mu}_{k}}\right)+% \log\pi_{k},\end{split} (2.8)

where the constant cc is irrelevant since it occurs in all the terms and therefore factors out of the softmax. This quantity, z^k\hat{z}_{k}, is (up to an additive constant) the log of the posterior probability of class kk. What does it mean?

One way to interpret Eq. 2.8 is (mentally) to fix z^k\hat{z}_{k} and see what the contours of constant (log-)probability look like. Or again, one can set z^k=setz^j\hat{z}_{k}\stackrel{{\scriptstyle\text{set}}}{{=}}\hat{z}_{j} for some jkj\neq k, in which case the resulting expression (in 𝒚\bm{y}) is the boundary between classes kk and jj. In either case, these expressions are second-order polynomials in 𝒚\bm{y}, a property inherited from the normal distribution. This is illustrated in Fig. LABEL:fig:XXX. If we sought fancier (higher-order) boundaries between classes, we would need a distribution with more “shape” parameters. margin: insert figure of quadratic class boundaries here; also a subfig of linear boundaries with the projection…..

Moving toward simpler, rather than more complex, contours, consider now the special case of constant covariance matrices across classes. Then the terms -12log|𝚺|-\frac{1}{2}\log|\mathbf{{\Sigma}}| and -12𝒚T𝚺-1𝒚-\frac{1}{2}\bm{y}^{\text{T}}\mathbf{{\Sigma}}^{-1}\bm{y} could be factored out of all the elements of 𝒛^\bm{\hat{z}}, and cancelled (like cc). That leaves

-12(𝝁kT𝚺-1𝝁k-2𝝁kT𝚺-1𝒚)+logπk,-\frac{1}{2}\mathopen{}\mathclose{{}\left(\bm{\mu}_{k}^{\text{T}}\mathbf{{% \Sigma}}^{-1}\bm{\mu}_{k}-2\bm{\mu}_{k}^{\text{T}}\mathbf{{\Sigma}}^{-1}\bm{y}% }\right)+\log\pi_{k},

a linear function of 𝒚\bm{y}. Setting two such expressions equal to each other for different classes kk and jj (and applying a little bit of algebra), we see that two classes are equiprobable when

(𝒚-12(𝝁k+𝝁j))T𝚺-1(𝝁k-𝝁j)=log(πjπk).\begin{split}\mathopen{}\mathclose{{}\left(\bm{y}-\frac{1}{2}\mathopen{}% \mathclose{{}\left(\bm{\mu}_{k}+\bm{\mu}_{j}}\right)}\right)^{\text{T}}\mathbf% {{\Sigma}}^{-1}\mathopen{}\mathclose{{}\left(\bm{\mu}_{k}-\bm{\mu}_{j}}\right)% &=\log\mathopen{}\mathclose{{}\left(\frac{\pi_{j}}{\pi_{k}}}\right).\end{split} (2.9)

Class kk is more probable than class jj when the left-hand side is larger than the right, and vice versa. (Notice, though, that in neither case is one of these classes guaranteed to be most likely, since we are here ignoring all other classes).

We can even ignore the covariance matrix if we are willing to work in the whitened space (i.e., absorb a factor of 𝚺-1/2\mathbf{{\Sigma}}^{{-1/2}} into 𝒚\bm{y} and 𝝁\bm{\mu}). Then Eq. 2.9 becomes transparent. To decide the relative probability of a (whitened) observation belonging to class kk or class jj, we first measure how far it is from the midpoint between the class centers, 12(𝝁k+𝝁j)\frac{1}{2}\mathopen{}\mathclose{{}\left(\bm{\mu}_{k}+\bm{\mu}_{j}}\right). We then project this (whitened) displacement onto the vector connecting their (whitened) centers. Here the vector runs from jj to kk, so the probability of class kk increases with (positive) distance along the projection. But the point of equality on the projection is not zero displacement, because one class may be a priori more probable than the other. The term on the right-hand side accounts for this.

Acquiring the parameters of the Gaussian mixture model with identical covariances (Eq. 2.1.1), and then using it to classify points according to the posterior distribution (Eq. 2.7), is known as linear discriminant analysis (LDA)margin: linear discriminant analysis (LDA) , a nearly century-old method that goes back to Fisher. When the covariances differ, as in the more generic GMM of Eq. 2.8, and the discrimating boundary is quadratic, the method is known, appropriately, as QDAmargin: QDA . If we generalize along a different avenue, however, allowing the emissions to be members of the same but otherwise arbitrary exponential family with identical dispersions, the boundary remains linear [21]. (We will discuss exponential families in detail in Chapter 5.) In all of these cases, estimating the parameters is conceptually straightforward: we separate the data into classes and compute the sample statistics—e.g., means and covariances for the GMM—within each. We discuss this parameter estimation more formally in Chapter 6.

Marginalization in mixture models.

As noted at the outset, our evaluation of the model often requires p^(𝒚;𝜽){\hat{p}\mathopen{}\mathclose{{}\left(\bm{y};\bm{\theta}}\right)}. It is easily read off the denominator in our derivation of a mixture model’s posterior distribution:

p^(𝒚;𝜽)=k=1Kp^(𝒚|X^k=1;𝜽)πk.{\hat{p}\mathopen{}\mathclose{{}\left(\bm{y};\bm{\theta}}\right)}=\sum_{k=1}^{% {K}}{\hat{p}\mathopen{}\mathclose{{}\left(\bm{y}\middle|{\hat{X}}_{k}=1;\bm{% \theta}}\right)}\pi_{k}. (2.10)

For any easily evaluable emission distributions, this is computed painlessly: the probability of the datum 𝒚\bm{y} under mixture component kk, scaled by the prior probability of component kk, and then summed over all kk.

Mixture models and conjugacy.

Comparing Eqs. 2.7 and 2.2, we see that the posterior distribution over 𝑿^{\bm{\hat{X}}} is in the same family as the prior distribution—to wit, the family of categorical distributions. We need not even have carried out any derivation to see this: the support of 𝑿^{\bm{\hat{X}}} is the set of length-K{K} one-hot vectors (or alternatively, the support of X^{\hat{X}} is the set of integers from 1 to K{K}), any distribution over which can be described as categorical. Now recall the definition of a conjugate priormargin: conjugate prior : the prior distribution of a parameter is conjugate to the likelihood of that parameter when the resulting posterior distribution over the parameter is in the same family as the prior. So we interpret the “source” variable 𝑿^{\bm{\hat{X}}} as a set of parameters, albeit with some trepidation, since the parameters of exponential families are always continuous rather than discrete (although see [47]). Still, comparing Eqs. 2.3 and 2.4, we see that the “prior” distribution mimics the “likelihood” in the usual way, with the sufficient statistics being the vector of log probabilities. Thus encouraged, we put the emission in exponential-family form,

p^(𝒚|𝒙^;𝜽)=exp{k=1Kx^klog(p^(𝒚|x^k;𝜽))},{\hat{p}\mathopen{}\mathclose{{}\left(\bm{y}\middle|\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{};\bm{% \theta}}\right)}=\exp\mathopen{}\mathclose{{}\left\{\sum_{k=1}^{{K}}% \leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\hat{x}_{k}% \log\mathopen{}\mathclose{{}\left({\hat{p}\mathopen{}\mathclose{{}\left(\bm{y}% \middle|\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{% rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\hat{x}_% {k};\bm{\theta}}\right)}}\right)}\right\},

and the prior distribution into the standard form for conjugate priors:

p^(𝒙^;𝜽)=exp{k=1Kx^klogπk}.{\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{};\bm{\theta}}\right)}=\exp\mathopen{% }\mathclose{{}\left\{\sum_{k=1}^{{K}}\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\hat{x}_{k}\log\pi_{k}}\right\}.

The lack of a log-partition function should give us pause—it says that the partition function is unity for all values of the “parameters”—but we could nevertheless interpret log𝝅\log\bm{\pi} as the (first) standard hyperparameter of the conjugate prior. To compute the posterior distribution, this gets updated in the standard way for conjugate priors, namely, by adding the sufficient statistics: logπk+log(p^(𝒚|x^k;𝜽))\log\pi_{k}+\log\mathopen{}\mathclose{{}\left({\hat{p}\mathopen{}\mathclose{{}% \left(\bm{y}\middle|\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{% pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\hat{x}_{k};\bm{\theta}}\right)}}\right) for all kk.

In fine, the categorical distribution is in some sense “conjugate” to any likelihood function of log probability. This is one way of understanding why mixture models are amenable to Bayesian analyses….

2.1.2 Jointly Gaussian models and factor analyzers

𝑿^{\bm{\hat{X}}}𝒀^{\bm{\hat{Y}}}p^(𝒙^;𝜽)=𝒩(𝝁x^,𝚺x^){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{};\bm{\theta}}\right)}=\mathcal{N}% \mathopen{}\mathclose{{}\left(\bm{\mu}_{\hat{x}},\>\mathbf{\Sigma}_{\hat{x}}}\right)p^(𝒚^|𝒙^;𝜽)=𝒩(𝝁y^|x^,𝚺y^|x^){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{}\middle|\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{};\bm{% \theta}}\right)}=\mathcal{N}\mathopen{}\mathclose{{}\left(\bm{\mu}_{\hat{y}|% \hat{x}},\>\mathbf{\Sigma}_{\hat{y}|\hat{x}}}\right)
(A)
𝑿^{\bm{\hat{X}}}𝒀^{\bm{\hat{Y}}}p^(𝒙^|𝒚^;𝜽)=𝒩(𝝁x^|y^,𝚺x^|y^){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{}\middle|\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{};\bm{% \theta}}\right)}=\mathcal{N}\mathopen{}\mathclose{{}\left(\bm{\mu}_{\hat{x}|% \hat{y}},\>\mathbf{\Sigma}_{\hat{x}|\hat{y}}}\right)p^(𝒚^;𝜽)=𝒩(𝝁y^,𝚺y^){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{};\bm{\theta}}\right)}=\mathcal{N}% \mathopen{}\mathclose{{}\left(\bm{\mu}_{\hat{y}},\>\mathbf{\Sigma}_{\hat{y}}}\right)
(B)
Figure 2.2: Emission and source. These models make equivalent independence statements about 𝑿^{\bm{\hat{X}}} and 𝒀^{\bm{\hat{Y}}}, but parameterize the distribution differently.

We begin with a more generic model than the ones which will be useful to us later. In particular, we begin simply by replacing the categorical source variable from the previous section with a (vector) standard normal variate, as in Fig. 2.2A. We refine this slightly by assuming further that (1) the emission mean is an affine function of the source, 𝑿^{\bm{\hat{X}}}; and (2) the emission covariance is fixed for all values of 𝑿^{\bm{\hat{X}}}:

p^(𝒙^;𝜽)=𝒩(𝝁x^,𝚺x^),\displaystyle{\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{};\bm{% \theta}}\right)}=\mathcal{N}\mathopen{}\mathclose{{}\left(\bm{\mu}_{\hat{x}},% \>\mathbf{{\Sigma}}_{\hat{x}}}\right), p^(𝒚^|𝒙^;𝜽)=𝒩(𝐂𝒙^+𝒄,𝚺y^|x^).\displaystyle{\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{}\middle|% \leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{% };\bm{\theta}}\right)}=\mathcal{N}\mathopen{}\mathclose{{}\left(\mathbf{{C}}% \leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}+% \bm{c},\>\mathbf{{\Sigma}}_{\hat{y}|\hat{x}}}\right). (2.11)

Inference in jointly Gaussian models.

To compute the posterior, we of course use Bayes’s theorem, Eq. 2.1, but this time we take advantage of a property of the prior and emission distributions alluded to at the start of the chapter. As this example nicely illustrates, it will not be necessary to compute the normalizer explicity. That is because we will recognize the form of the distribution—in this case, another Gaussian—even without the normalizer. This property follows from the fact that both the prior and the emission are exponentiated second-order polynomials in 𝒙^\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}. Multiplying two such functions together yields a third exponentiated second-order polynomial in 𝒙^\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}} (adding second-order polynomials can’t increase their order), which must be a third normal distribution.

There are circumstances under which this is really all we need to know; but what if we want the normalizer explicity? We still don’t have to take the integral, because the normalizer for a Gaussian can be expressed simply as a function33 3 We use τ=2π\tau=2\pi throughout this book [11]. of the covariance matrix: Z=τK/2|𝚺x^|y^|1/2Z=\tau^{{K}/2}|\mathbf{{\Sigma}}_{\hat{x}|\hat{y}}|^{1/2}. (Another way to think about this is that someone had to take the integral once, but we don’t have to do it again.) Naturally, to compute this we do need an expression for the covariance matrix, so it won’t do to leave the second-order polynomial in any form whatever; we need to re-arrange it into a quadratic form, with the inverse covariance matrix in the middle. That is the only algebraic work that actually needs to be done in applying Bayes’s theorem:

p^(𝒙^|𝒚^;𝜽)p^(𝒚^|𝒙^;𝜽)p^(𝒙^;𝜽)exp{-12(𝐂𝒙^+𝒄-𝒚^)T𝚺y^|x^-1(𝐂𝒙^+𝒄-𝒚^)-12(𝒙^-𝝁x^)T𝚺x^-1(𝒙^-𝝁x^)}exp{-12[𝒙^T(𝐂T𝚺y^|x^-1𝐂+𝚺x^-1)𝒙^-2((𝒚^-𝒄)T𝚺y^|x^-1𝐂+𝝁x^T𝚺x^-1)𝒙^]}=exp{-12[𝒙^T𝚺x^|y^-1𝒙^-2𝝁x^|y^T𝚺x^|y^-1𝒙^]}exp{-12(𝒙^-𝝁x^|y^)T𝚺x^|y^-1(𝒙^-𝝁x^|y^)}=𝒩(𝝁x^|y^,𝚺x^|y^),\begin{split}{\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{}\middle|% \leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{% };\bm{\theta}}\right)}&\propto{\hat{p}\mathopen{}\mathclose{{}\left(% \leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{% }\middle|\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{% rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat% {x}}{};\bm{\theta}}\right)}{\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode% \color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{};\bm{% \theta}}\right)}\\ &\propto\exp\mathopen{}\mathclose{{}\left\{-\frac{1}{2}\mathopen{}\mathclose{{% }\left(\mathbf{{C}}\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{% pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{\hat{x}}+\bm{c}-\leavevmode\color[rgb]{.5,.5,.5% }\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.% 5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}}\right)^{\text{T}}\mathbf{{\Sigma}}^% {-1}_{\hat{y}|\hat{x}}\mathopen{}\mathclose{{}\left(\mathbf{{C}}\leavevmode% \color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}+\bm{c}-% \leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}}% \right)-\frac{1}{2}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}-\bm{\mu}_% {\hat{x}}}\right)^{\text{T}}\mathbf{{\Sigma}}^{-1}_{\hat{x}}\mathopen{}% \mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{% pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{\hat{x}}-\bm{\mu}_{\hat{x}}}\right)}\right\}\\ &\propto\exp\mathopen{}\mathclose{{}\left\{-\frac{1}{2}\mathopen{}\mathclose{{% }\left[\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb% }{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}% }^{\text{T}}\mathopen{}\mathclose{{}\left(\mathbf{{C}}^{\text{T}}\mathbf{{% \Sigma}}^{-1}_{\hat{y}|\hat{x}}\mathbf{{C}}+\mathbf{{\Sigma}}^{-1}_{\hat{x}}}% \right)\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb% }{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}% }-2\mathopen{}\mathclose{{}\left(\mathopen{}\mathclose{{}\left(\leavevmode% \color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}-\bm{c}}% \right)^{\text{T}}\mathbf{{\Sigma}}^{-1}_{\hat{y}|\hat{x}}\mathbf{{C}}+\bm{\mu% }_{\hat{x}}^{\text{T}}\mathbf{{\Sigma}}^{-1}_{\hat{x}}}\right)\leavevmode% \color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}}\right]}% \right\}\\ &=\exp\mathopen{}\mathclose{{}\left\{-\frac{1}{2}\mathopen{}\mathclose{{}\left% [\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}^% {\text{T}}\mathbf{{\Sigma}}^{-1}_{\hat{x}|\hat{y}}\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}-2\bm{\mu}% _{\hat{x}|\hat{y}}^{\text{T}}\mathbf{{\Sigma}}^{-1}_{\hat{x}|\hat{y}}% \leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}}% \right]}\right\}\\ &\propto\exp\mathopen{}\mathclose{{}\left\{-\frac{1}{2}\mathopen{}\mathclose{{% }\left(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb% }{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}% }-\bm{\mu}_{\hat{x}|\hat{y}}}\right)^{\text{T}}\mathbf{{\Sigma}}^{-1}_{\hat{x}% |\hat{y}}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\hat{x}}-\bm{\mu}_{\hat{x}|\hat{y}}}\right)}% \right\}\\ &=\mathcal{N}\mathopen{}\mathclose{{}\left(\bm{\mu}_{\hat{x}|\hat{y}},\>% \mathbf{{\Sigma}}_{\hat{x}|\hat{y}}}\right),\\ \end{split} (2.12)

where we have defined

𝝁x^|y^ . . =𝚺x^|y^[𝐂T𝚺y^|x^-1(𝒚^-𝒄)+𝚺x^-1𝝁x^]\bm{\mu}_{\hat{x}|\hat{y}}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.% } }}=\mathbf{{\Sigma}}_{\hat{x}|\hat{y}}\mathopen{}\mathclose{{}\left[\mathbf{{C% }}^{\text{T}}\mathbf{{\Sigma}}^{-1}_{\hat{y}|\hat{x}}(\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}-\bm{c})+% \mathbf{{\Sigma}}^{-1}_{\hat{x}}\bm{\mu}_{\hat{x}}}\right] (2.13)

and

𝚺x^|y^-1 . . =𝐂T𝚺y^|x^-1𝐂+𝚺x^-1.\mathbf{{\Sigma}}^{-1}_{\hat{x}|\hat{y}}\mathrel{\vbox{\hbox{\scriptsize.}% \hbox{\scriptsize.} }}=\mathbf{{C}}^{\text{T}}\mathbf{{\Sigma}}^{-1}_{\hat{y}|\hat{x}}\mathbf{{C}}% +\mathbf{{\Sigma}}^{-1}_{\hat{x}}. (2.14)

These posterior cumulants should be somewhat intuitive. The posterior mean is a convex combination of the information from the emission and the prior distributions. More precisely, it is a convex combination of the prior mean, 𝝁x^\bm{\mu}_{\hat{x}}, and the centered observation that has been transformed into the source space, 𝐂(𝒚^-𝒄)\mathbf{{C}}^{\dagger}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{% pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{\hat{y}}-\bm{c}). Here 𝐂\mathbf{{C}}^{\dagger} is a pseudo-inverse of 𝐂\mathbf{{C}}.44 4 When 𝑿^{\bm{\hat{X}}} is higher-dimensional than 𝒀^{\bm{\hat{Y}}}, this could be any right inverse. When 𝒀^{\bm{\hat{Y}}} is higher-dimensional than 𝑿^{\bm{\hat{X}}}, the pseudo-inverse should be intepreted in the least-squares sense. The weights are the (normalized) precisions of these two distributions, although the precision of the emission about 𝑿^{\bm{\hat{X}}} must be transformed into the source space first with 𝐂T\mathbf{{C}}^{\text{T}}. The posterior precision, for its part, is just the sum of these two (unnormalized) precisions. As we shall see shortly, it will be convenient to have a name for the normalized precision of the emission:

𝐊 . . =𝚺x^|y^𝐂T𝚺y^|x^-1.\mathbf{K}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.} }}=\mathbf{{\Sigma}}_{\hat{x}|\hat{y}}\mathbf{{C}}^{\text{T}}\mathbf{{\Sigma}}% ^{-1}_{\hat{y}|\hat{x}}. (2.15)

Marginalization in jointly Gaussian models.

So much for the posterior. What about p^(𝒚^;𝜽){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{};\bm{\theta}}\right)}? We suspect it will be yet another normal distribution, although we need to make sure. Given this suspicion, and since a normal distribution is completely characterized by its first two cumulants, we use the laws of total expectation,

𝔼𝒀^[𝒀^]=𝔼𝑿^[𝔼𝒀^|𝑿^[𝒀^|𝑿^]]=𝔼𝑿^[𝐂𝑿^+𝒄]=𝐂𝝁x^+𝒄= . . 𝝁y^,\begin{split}\mathbb{E}_{{\bm{\hat{Y}}}}{\mathopen{}\mathclose{{}\left[{\bm{% \hat{Y}}}}\right]}&=\mathbb{E}_{{\bm{\hat{X}}}}{\mathopen{}\mathclose{{}\left[% \mathbb{E}_{{\bm{\hat{Y}}}|{\bm{\hat{X}}}}{\mathopen{}\mathclose{{}\left[{\bm{% \hat{Y}}}|{\bm{\hat{X}}}}\right]}}\right]}\\ &=\mathbb{E}_{{\bm{\hat{X}}}}{\mathopen{}\mathclose{{}\left[\mathbf{{C}}{\bm{% \hat{X}}}+\bm{c}}\right]}\\ &=\mathbf{{C}}\bm{\mu}_{\hat{x}}+\bm{c}=\mathrel{\vbox{\hbox{\scriptsize.}% \hbox{\scriptsize.} }}\bm{\mu}_{\hat{y}},\end{split} (2.16)

and of total covariance,

Cov𝒀^[𝒀^]=Cov𝑿^[𝔼𝒀^|𝑿^[𝒀^|𝑿^]]+𝔼𝑿^[Cov𝒀^|𝑿^[𝒀^|𝑿^]]=Cov𝑿^[𝐂𝑿^+𝒄]+𝔼𝑿^[𝚺y^|x^]=𝐂𝚺x^𝐂T+𝚺y^|x^= . . 𝚺y^.\begin{split}\text{Cov}_{{\bm{\hat{Y}}}}{\mathopen{}\mathclose{{}\left[{\bm{% \hat{Y}}}}\right]}&=\text{Cov}_{{\bm{\hat{X}}}}{\mathopen{}\mathclose{{}\left[% \mathbb{E}_{{\bm{\hat{Y}}}|{\bm{\hat{X}}}}{\mathopen{}\mathclose{{}\left[{\bm{% \hat{Y}}}|{\bm{\hat{X}}}}\right]}}\right]}+\mathbb{E}_{{\bm{\hat{X}}}}{% \mathopen{}\mathclose{{}\left[\text{Cov}_{{\bm{\hat{Y}}}|{\bm{\hat{X}}}}{% \mathopen{}\mathclose{{}\left[{\bm{\hat{Y}}}|{\bm{\hat{X}}}}\right]}}\right]}% \\ &=\text{Cov}_{{\bm{\hat{X}}}}{\mathopen{}\mathclose{{}\left[\mathbf{{C}}{\bm{% \hat{X}}}+\bm{c}}\right]}+\mathbb{E}_{{\bm{\hat{X}}}}{\mathopen{}\mathclose{{}% \left[\mathbf{{\Sigma}}_{\hat{y}|\hat{x}}}\right]}\\ &=\mathbf{{C}}\mathbf{{\Sigma}}_{\hat{x}}\mathbf{{C}}^{\text{T}}+\mathbf{{% \Sigma}}_{\hat{y}|\hat{x}}=\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize% .} }}\mathbf{{\Sigma}}_{\hat{y}}.\end{split} (2.17)

The higher cumulants of 𝒀^{\bm{\hat{Y}}} are necessarily zero because each term in the law of total cumulance invokes a higher-order (than two) cumulant of either 𝑿^{\bm{\hat{X}}} or 𝒀^|𝑿^{\bm{\hat{Y}}}|{\bm{\hat{X}}}, and these are all zero. So the marginal distribution is indeed again normal:

p^(𝒚^;𝜽)=𝒩(𝒚;𝝁y^,𝚺y^).{\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{};\bm{\theta}}\right)}=\mathcal{N}% \mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[% named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{y};\bm{\mu}_{\hat{y}},\>\mathbf{{\Sigma}}_{\hat% {y}}}\right). (2.18)

Between this marginal distribution, Eqs. 2.18, 2.16, and 2.17, and the posterior, Eqs. 2.12, 2.13, and 2.14, we have “reversed the arrow” in Fig. 2.2A, providing the alternative parameterization in the equivalent graphical model of Fig. 2.2B.

The source-emission covariance.

There is a third convenient way to characterize the joint distribution, and that is to give the distribution of the vector of 𝑿^{\bm{\hat{X}}} and 𝒀^{\bm{\hat{Y}}} concatenated together. Unsurprisingly, this vector is also normally distributed, and indeed we have almost all the pieces for this already in hand.55 5 Notice that the joint distribution of 𝑿^{\bm{\hat{X}}} and 𝒀^{\bm{\hat{Y}}} does not have a convenient form for the Gaussian mixture model. The mean is given by the concatenation of 𝝁x^\bm{\mu}_{\hat{x}} and 𝝁y^\bm{\mu}_{\hat{y}}. The covariance matrix has 𝚺x^\mathbf{{\Sigma}}_{\hat{x}} and 𝚺y^\mathbf{{\Sigma}}_{\hat{y}} in its upper-left and lower-right blocks. What remains is the upper-right (or its transpose in the lower-left) block, 𝚺x^,y^\mathbf{\Sigma}_{\hat{x},\hat{y}}, i.e. the covariance between 𝑿^{\bm{\hat{X}}} and 𝒀^{\bm{\hat{Y}}}.

From Eq. 2.11, we can write 𝒀^=𝐂𝑿^+𝒄+𝒁^{\bm{\hat{Y}}}=\mathbf{{C}}{\bm{\hat{X}}}+\bm{c}+{\bm{\hat{Z}}}, where 𝒁^𝒩(𝟎,𝚺y^|x^){\bm{\hat{Z}}}\sim\mathcal{N}\mathopen{}\mathclose{{}\left(\bm{0},\>\mathbf{{% \Sigma}}_{\hat{y}|\hat{x}}}\right) is independent of 𝑿^{\bm{\hat{X}}}. Hence:

Cov[𝑿^,𝒀^]=𝔼[𝑿^𝒀^T]-𝔼[𝑿^]𝔼[𝒀^]T=𝔼[𝑿^(𝐂𝑿^+𝒄+𝒁^)T]-𝔼[𝑿^]𝔼[𝐂𝑿^+𝒄+𝒁^]T=𝔼[𝑿^𝑿^]𝐂T+𝔼[𝑿^]𝒄T+𝔼[𝑿^]𝔼[𝒁^]T-𝔼[𝑿^]𝔼[𝑿^]𝐂T-𝔼[𝑿^]𝒄T-𝔼[𝑿^]𝔼[𝒁^]T=Cov[𝑿^]𝐂T=𝚺x^𝐂T= . . 𝚺x^,y^.\begin{split}\text{Cov}{\mathopen{}\mathclose{{}\left[{\bm{\hat{X}}},{\bm{\hat% {Y}}}}\right]}&=\mathbb{E}{\mathopen{}\mathclose{{}\left[{\bm{\hat{X}}}{\bm{% \hat{Y}}}^{\text{T}}}\right]}-\mathbb{E}{\mathopen{}\mathclose{{}\left[{\bm{% \hat{X}}}}\right]}\mathbb{E}{\mathopen{}\mathclose{{}\left[{\bm{\hat{Y}}}}% \right]}^{\text{T}}\\ &=\mathbb{E}{\mathopen{}\mathclose{{}\left[{\bm{\hat{X}}}(\mathbf{{C}}{\bm{% \hat{X}}}+\bm{c}+{\bm{\hat{Z}}})^{\text{T}}}\right]}-\mathbb{E}{\mathopen{}% \mathclose{{}\left[{\bm{\hat{X}}}}\right]}\mathbb{E}{\mathopen{}\mathclose{{}% \left[\mathbf{{C}}{\bm{\hat{X}}}+\bm{c}+{\bm{\hat{Z}}}}\right]}^{\text{T}}\\ &=\mathbb{E}{\mathopen{}\mathclose{{}\left[{\bm{\hat{X}}}{\bm{\hat{X}}}}\right% ]}\mathbf{{C}}^{\text{T}}+\mathbb{E}{\mathopen{}\mathclose{{}\left[{\bm{\hat{X% }}}}\right]}\bm{c}^{\text{T}}+\mathbb{E}{\mathopen{}\mathclose{{}\left[{\bm{% \hat{X}}}}\right]}\mathbb{E}{\mathopen{}\mathclose{{}\left[{\bm{\hat{Z}}}}% \right]}^{\text{T}}-\mathbb{E}{\mathopen{}\mathclose{{}\left[{\bm{\hat{X}}}}% \right]}\mathbb{E}{\mathopen{}\mathclose{{}\left[{\bm{\hat{X}}}}\right]}% \mathbf{{C}}^{\text{T}}-\mathbb{E}{\mathopen{}\mathclose{{}\left[{\bm{\hat{X}}% }}\right]}\bm{c}^{\text{T}}-\mathbb{E}{\mathopen{}\mathclose{{}\left[{\bm{\hat% {X}}}}\right]}\mathbb{E}{\mathopen{}\mathclose{{}\left[{\bm{\hat{Z}}}}\right]}% ^{\text{T}}\\ &=\text{Cov}{\mathopen{}\mathclose{{}\left[{\bm{\hat{X}}}}\right]}\mathbf{{C}}% ^{\text{T}}\\ &=\mathbf{{\Sigma}}_{\hat{x}}\mathbf{{C}}^{\text{T}}=\mathrel{\vbox{\hbox{% \scriptsize.}\hbox{\scriptsize.} }}\mathbf{\Sigma}_{\hat{x},\hat{y}}.\end{split} (2.19)

This covariance will turn out to be useful later when we take on the learning problem.

Jointly Gaussian random variables and conjugacy.

For the models described by Eq. 2.11, it is perhaps unsurprising that the posterior is again normal (Eq. 2.12). After all, the normal distribution is well known to be the conjugate prior for the mean of another normal distribution (it induces a normal posterior distribution over that mean). However, in the family of jointly Gaussian models we are now considering (Eq. 2.11), the mean of the emission distribution is an affine function of, rather than identical with, the source random variable. Evidently, conjugacy holds for this more generic class as well as the classical result for a normally distributed mean, which can be construed as a special case.

margin: include a plot of example data margin: include the more explicit graphical model….

Factor analysis.

We consider now the special case where the underlying normal source variable 𝑿^{\bm{\hat{X}}} is never observed—it is latent. For example, suppose that 𝒀{\bm{Y}} is the (random) vector of performances on a battery of intelligence tests of various kinds. Each sample 𝒚n{\bm{y}}_{n} corresponds to individual nn’s performance on all LL exams. We hypothesize that this performance can be explained by a smaller, K<LK<L, number of latent intelligence “factors,” 𝑿^{\bm{\hat{X}}}, that are normally distributed across individuals. But we can’t observe them directly; we only observe test scores, which are (ideally) an affine function of these latent factors. Finally, we allow for errors in the model, errors in our measurements, unmodelled enviromental inputs, etc. by allowing our test results to have been corrupted by Gaussian noise.

𝑿^{\bm{\hat{X}}}p^(𝒙^;𝜽)=𝒩(𝟎,𝐈){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{};\bm{\theta}}\right)}={\mathcal{N}% \mathopen{}\mathclose{{}\left(\bm{0},\>\mathbf{I}}\right)}𝒀^{\bm{\hat{Y}}}p^(𝒚^|𝒙^;𝜽)=𝒩(𝐂𝒙^+𝒄,𝚲){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{}\middle|\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{};\bm{% \theta}}\right)}={\mathcal{N}\mathopen{}\mathclose{{}\left(\mathbf{{C}}% \leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}+% \bm{c},\>\mathbf{\Lambda}}\right)} NN
(A)
X^1{\hat{X}}_{1}X^2{\hat{X}}_{2}X^3{\hat{X}}_{3}Y^2{\hat{Y}}_{2}Y^1{\hat{Y}}_{1}Y^3{\hat{Y}}_{3}Y^4{\hat{Y}}_{4}Y^5{\hat{Y}}_{5} NN
(B)
Figure 2.3: The factor analyzer. (LABEL:sub@subfig:FAimplicit) The most abstract version of the model (cf. the GMM in Fig. 2.1). (LABEL:sub@subfig:FAexplicit) The independence statements made explicit.

This sounds like the jointly Gaussian model just discussed, but the fact that we never observe 𝑿^{\bm{\hat{X}}} obliges us to restrict the degress of freedom of the model, as discussed at the outset of this chapter. In particular, we have just seen that the marginal distribution of observed data is normal, Eq. 2.18, and in this factor analysis, this is the only distribution we ever see. Eq. 2.16 shows that, as far as this distribution is concerned, any change in the mean of the prior distribution, 𝝁x^\bm{\mu}_{\hat{x}}, could equally well be attributed to a change in the emission offset, 𝒄\bm{c}. So without loss of generality, we can let 𝝁x^=0\bm{\mu}_{\hat{x}}=0.

A similar consideration applies to the prior covariance. From Eq. 2.17, we see that any non-isotropic covariance could simply be absorbed into 𝐂\mathbf{{C}}. Indeed, this makes the intepretation of the model more intuitive: we are asking for the independent factors that together explain test performances. Any covariance across different kinds of intelligence tests is due to their having similar “loading” onto these independent factors.

[[We remove one more degree of freedom. In general, 𝚺y^|x^\mathbf{{\Sigma}}_{\hat{y}|\hat{x}} could be any positive definite matrix, but assuming it to be diagonal, 𝚲\mathbf{\Lambda}, yields a useful interpretation…..]].

So our complete model is:

p^(𝒙^;𝜽)=𝒩(𝟎,𝐈),\displaystyle{\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{};\bm{% \theta}}\right)}=\mathcal{N}\mathopen{}\mathclose{{}\left(\bm{0},\>\mathbf{I}}% \right), p^(𝒚^|𝒙^;𝜽)=𝒩(𝐂𝒙^+𝒄,𝚲).\displaystyle{\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{}\middle|% \leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{% };\bm{\theta}}\right)}=\mathcal{N}\mathopen{}\mathclose{{}\left(\mathbf{{C}}% \leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}+% \bm{c},\>\mathbf{\Lambda}}\right). (2.20)

In what follows, we shall assume that one has the correct model, and needs to solve only the inference problem. We shall return to learning in Section 7.3.

Inference and marginalization in a factor analyzer.

We take the marginal probability first this time. Applying Eqs. 2.16, 2.17, and 2.18 to the special case of Eq. 2.20, we have

p^(𝒚^;𝜽)=𝒩(𝒄,𝐂𝐂T+𝚲).{\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{};\bm{\theta}}\right)}=\mathcal{N}% \mathopen{}\mathclose{{}\left(\bm{c},\>\mathbf{{C}}\mathbf{{C}}^{\text{T}}+% \mathbf{\Lambda}}\right). (2.21)

Inference in a factor analyzer is just a special case of inference in the jointly Gaussian model introduced at the beginning of this section. That is, it requires only the computation of the posterior cumulants, Eqs. 2.13 and 2.14, under the distributions that define the model, Eq. 2.20:

p^(𝒙^|𝒚^;𝜽)=𝒩(𝚺x^|y^𝐂T𝚲-1(𝒚-𝒄),(𝐂T𝚲-1𝐂+𝐈)-1).{\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{}\middle|\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{};\bm{% \theta}}\right)}=\mathcal{N}\mathopen{}\mathclose{{}\left(\mathbf{{\Sigma}}_{% \hat{x}|\hat{y}}\mathbf{{C}}^{\text{T}}\mathbf{\Lambda}^{-1}(\leavevmode\color% [rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y}-\bm{c}),\>% \mathopen{}\mathclose{{}\left(\mathbf{{C}}^{\text{T}}\mathbf{\Lambda}^{-1}% \mathbf{{C}}+\mathbf{I}}\right)^{-1}}\right). (2.22)

Anticipating the computational costs of this, and more complicated, inference problems with Gaussian random variables, we note with some dismay the matrix inversions, which cost at least 𝒪(M2.373)\mathcal{O}(M^{2.373}) for M×MM\times M matrices. In factor analysis, as we have noted, the observations 𝒚\bm{y} are higher-dimensional than the latent variables 𝑿^{\bm{\hat{X}}}. It is fortunate, then, that the only matrix inversion carried out in the observation space is of 𝚲\mathbf{\Lambda}, which is just 𝒪(M)\mathcal{O}(M) (invert the entries on the diagonal). But what about models where the “source space” is larger than the “emission space”? What if the emission covariance is not diagonal? And how can we limit the total number of inversions when the inference step has to be applied repeatedly?

Alternative expressions for the posterior cumulants.

To change the space in which the posterior covariance is computed, we can apply the Woodbury matrix-inversion lemma, Eq. B.14, to Eq. 2.14:

𝚺x^|y^=(𝐂T𝚺y^|x^-1𝐂+𝚺x^-1)-1=𝚺x^-𝚺x^𝐂T(𝐂𝚺x^𝐂T+𝚺y^|x^)-1𝐂𝚺x^.\begin{split}\mathbf{{\Sigma}}_{\hat{x}|\hat{y}}&=\mathopen{}\mathclose{{}% \left(\mathbf{{C}}^{\text{T}}\mathbf{{\Sigma}}^{-1}_{\hat{y}|\hat{x}}\mathbf{{% C}}+\mathbf{{\Sigma}}^{-1}_{\hat{x}}}\right)^{-1}\\ &=\mathbf{{\Sigma}}_{\hat{x}}-\mathbf{{\Sigma}}_{\hat{x}}\mathbf{{C}}^{\text{T% }}(\mathbf{{C}}\mathbf{{\Sigma}}_{\hat{x}}\mathbf{{C}}^{\text{T}}+\mathbf{{% \Sigma}}_{\hat{y}|\hat{x}})^{-1}\mathbf{{C}}\mathbf{{\Sigma}}_{\hat{x}}.\end{split} (2.23)

This moves the inversion from source (𝑿^{\bm{\hat{X}}}) to emission (𝒀^{\bm{\hat{Y}}}) space. (Notice, incidentally, that the inverted quantity is precisely the marginal covariance, 𝚺y^\mathbf{{\Sigma}}_{\hat{y}}.)

The expression for the posterior mean, Eq. 2.13, likewise invokes an inversion in source space (𝚺x^-1\mathbf{{\Sigma}}^{-1}_{\hat{x}}), as well as one in the emission space (𝚺y^|x^-1\mathbf{{\Sigma}}^{-1}_{\hat{y}|\hat{x}}). In factor analysis, the first inversion was not required because the prior mean, 𝝁x^\bm{\mu}_{\hat{x}}, was assumed to be zero. Here we eliminate it by substituting the “Woodburied” posterior covariance, Eq. 2.23, into Eq. 2.13. We also make use of the definition of the emission precision, Eq. 2.15:

𝝁x^|y^=𝚺x^|y^[𝐂T𝚺y^|x^-1(𝒚^-𝒄)+𝚺x^-1𝝁x^]=𝐊(𝒚^-𝒄)+𝚺x^|y^(𝚺x^-1)𝝁x^=𝐊(𝒚^-𝒄)+𝚺x^|y^(𝐂T𝚺y^|x^-1𝐂-𝐂T𝚺y^|x^-1𝐂+𝚺x^-1)𝝁x^=𝐊(𝒚^-𝒄)-𝐊𝐂𝝁x^+𝚺x^|y^(𝐂T𝚺y^|x^-1𝐂+𝚺x^-1)𝝁x^=𝐊(𝒚^-𝒄)-𝐊𝐂𝝁x^+𝝁x^=𝝁x^+𝐊[𝒚^-(𝐂𝝁x^+𝒄)].\begin{split}\bm{\mu}_{\hat{x}|\hat{y}}&=\mathbf{{\Sigma}}_{\hat{x}|\hat{y}}% \mathopen{}\mathclose{{}\left[\mathbf{{C}}^{\text{T}}\mathbf{{\Sigma}}^{-1}_{% \hat{y}|\hat{x}}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\hat{y}}-\bm{c}}\right)+\mathbf{{\Sigma}}^{-1}% _{\hat{x}}\bm{\mu}_{\hat{x}}}\right]\\ &=\mathbf{K}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\hat{y}}-\bm{c}}\right)+\mathbf{{\Sigma}}_{% \hat{x}|\hat{y}}\mathopen{}\mathclose{{}\left(\mathbf{{\Sigma}}^{-1}_{\hat{x}}% }\right)\bm{\mu}_{\hat{x}}\\ &=\mathbf{K}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\hat{y}}-\bm{c}}\right)+\mathbf{{\Sigma}}_{% \hat{x}|\hat{y}}\mathopen{}\mathclose{{}\left(\mathbf{{C}}^{\text{T}}\mathbf{{% \Sigma}}^{-1}_{\hat{y}|\hat{x}}\mathbf{{C}}-\mathbf{{C}}^{\text{T}}\mathbf{{% \Sigma}}^{-1}_{\hat{y}|\hat{x}}\mathbf{{C}}+\mathbf{{\Sigma}}^{-1}_{\hat{x}}}% \right)\bm{\mu}_{\hat{x}}\\ &=\mathbf{K}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\hat{y}}-\bm{c}}\right)-\mathbf{K}\mathbf{{C}}% \bm{\mu}_{\hat{x}}+\mathbf{{\Sigma}}_{\hat{x}|\hat{y}}\mathopen{}\mathclose{{}% \left(\mathbf{{C}}^{\text{T}}\mathbf{{\Sigma}}^{-1}_{\hat{y}|\hat{x}}\mathbf{{% C}}+\mathbf{{\Sigma}}^{-1}_{\hat{x}}}\right)\bm{\mu}_{\hat{x}}\\ &=\mathbf{K}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\hat{y}}-\bm{c}}\right)-\mathbf{K}\mathbf{{C}}% \bm{\mu}_{\hat{x}}+\bm{\mu}_{\hat{x}}\\ &=\bm{\mu}_{\hat{x}}+\mathbf{K}\mathopen{}\mathclose{{}\left[\leavevmode\color% [rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}-\mathopen% {}\mathclose{{}\left(\mathbf{{C}}\bm{\mu}_{\hat{x}}+\bm{c}}\right)}\right].% \end{split} (2.24)

Intuitively, the posterior mean is the prior mean, adjusted by a “correction factor.” The correction measures the difference between the emission as observed (𝒚^\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}) and as predicted by the prior mean (𝐂𝝁x^+𝒄\mathbf{{C}}\bm{\mu}_{\hat{x}}+\bm{c}), weighted by the precision of the emission (𝐊\mathbf{K}).

Despite appearances, matrix inversions are still required in Eq. 2.24; they occur in the calculation of 𝐊\mathbf{K}. Consulting its definition, Eq. 2.15, we see one inversion of the emission covariance, and another in the computation of the posterior covariance, albeit in the space of emissions via Eq. 2.23. With a little work, one of these inversions can be eliminated:

𝐊=𝚺x^|y^𝐂T𝚺y^|x^-1=[𝚺x^𝐂T-𝚺x^𝐂T(𝐂𝚺x^𝐂T+𝚺y^|x^)-1𝐂𝚺x^𝐂T]𝚺y^|x^-1=𝚺x^𝐂T[𝐈-(𝐂𝚺x^𝐂T+𝚺y^|x^)-1𝐂𝚺x^𝐂T]𝚺y^|x^-1=𝚺x^𝐂T(𝐂𝚺x^𝐂T+𝚺y^|x^)-1[(𝐂𝚺x^𝐂T+𝚺y^|x^)-𝐂𝚺x^𝐂T]𝚺y^|x^-1=𝚺x^𝐂T(𝐂𝚺x^𝐂T+𝚺y^|x^)-1.\begin{split}\mathbf{K}=\mathbf{{\Sigma}}_{\hat{x}|\hat{y}}\mathbf{{C}}^{\text% {T}}\mathbf{{\Sigma}}^{-1}_{\hat{y}|\hat{x}}&=\mathopen{}\mathclose{{}\left[% \mathbf{{\Sigma}}_{\hat{x}}\mathbf{{C}}^{\text{T}}-\mathbf{{\Sigma}}_{\hat{x}}% \mathbf{{C}}^{\text{T}}\mathopen{}\mathclose{{}\left(\mathbf{{C}}\mathbf{{% \Sigma}}_{\hat{x}}\mathbf{{C}}^{\text{T}}+\mathbf{{\Sigma}}_{\hat{y}|\hat{x}}}% \right)^{-1}\mathbf{{C}}\mathbf{{\Sigma}}_{\hat{x}}\mathbf{{C}}^{\text{T}}}% \right]\mathbf{{\Sigma}}^{-1}_{\hat{y}|\hat{x}}\\ &=\mathbf{{\Sigma}}_{\hat{x}}\mathbf{{C}}^{\text{T}}\mathopen{}\mathclose{{}% \left[\mathbf{I}-\mathopen{}\mathclose{{}\left(\mathbf{{C}}\mathbf{{\Sigma}}_{% \hat{x}}\mathbf{{C}}^{\text{T}}+\mathbf{{\Sigma}}_{\hat{y}|\hat{x}}}\right)^{-% 1}\mathbf{{C}}\mathbf{{\Sigma}}_{\hat{x}}\mathbf{{C}}^{\text{T}}}\right]% \mathbf{{\Sigma}}^{-1}_{\hat{y}|\hat{x}}\\ &=\mathbf{{\Sigma}}_{\hat{x}}\mathbf{{C}}^{\text{T}}(\mathbf{{C}}\mathbf{{% \Sigma}}_{\hat{x}}\mathbf{{C}}^{\text{T}}+\mathbf{{\Sigma}}_{\hat{y}|\hat{x}})% ^{-1}\mathopen{}\mathclose{{}\left[\mathopen{}\mathclose{{}\left(\mathbf{{C}}% \mathbf{{\Sigma}}_{\hat{x}}\mathbf{{C}}^{\text{T}}+\mathbf{{\Sigma}}_{\hat{y}|% \hat{x}}}\right)-\mathbf{{C}}\mathbf{{\Sigma}}_{\hat{x}}\mathbf{{C}}^{\text{T}% }}\right]\mathbf{{\Sigma}}^{-1}_{\hat{y}|\hat{x}}\\ &=\mathbf{{\Sigma}}_{\hat{x}}\mathbf{{C}}^{\text{T}}(\mathbf{{C}}\mathbf{{% \Sigma}}_{\hat{x}}\mathbf{{C}}^{\text{T}}+\mathbf{{\Sigma}}_{\hat{y}|\hat{x}})% ^{-1}.\end{split} (2.25)

Thus in fact only one matrix need be inverted in computing the entire posterior distribution, namely 𝐂𝚺x^𝐂T+𝚺y^|x^\mathbf{{C}}\mathbf{{\Sigma}}_{\hat{x}}\mathbf{{C}}^{\text{T}}+\mathbf{{\Sigma% }}_{\hat{y}|\hat{x}}—which, by Eq. 2.17, is just the marginal covariance, 𝚺y^\mathbf{{\Sigma}}_{\hat{y}}. So we can also write

𝐊=𝚺x^𝐂T𝚺y^-1.\mathbf{K}=\mathbf{{\Sigma}}_{\hat{x}}\mathbf{{C}}^{\text{T}}\mathbf{{\Sigma}}% ^{-1}_{\hat{y}}.

To make the complete calculation more perspicuous, we collect here the marginal cumulants from Eqs. 2.16 and 2.17:

𝝁y^=𝐂𝝁x^+𝒄,𝚺y^=𝐂𝚺x^𝐂T+𝚺y^|x^,\begin{split}\bm{\mu}_{\hat{y}}&=\mathbf{{C}}\bm{\mu}_{\hat{x}}+\bm{c},\\ \mathbf{{\Sigma}}_{\hat{y}}&=\mathbf{{C}}\mathbf{{\Sigma}}_{\hat{x}}\mathbf{{C% }}^{\text{T}}+\mathbf{{\Sigma}}_{\hat{y}|\hat{x}},\end{split} (2.26)

along with this alternative (“Woodburied”) computation of the posterior cumulants:

𝐊=𝚺x^𝐂T𝚺y^-1𝝁x^|y^=𝝁x^+𝐊(𝒚^-𝝁y^)𝚺x^|y^=𝚺x^-𝐊𝐂𝚺x^=𝚺x^-𝐊𝚺y^𝐊T,\begin{split}\mathbf{K}&=\mathbf{{\Sigma}}_{\hat{x}}\mathbf{{C}}^{\text{T}}% \mathbf{{\Sigma}}^{-1}_{\hat{y}}\\ \bm{\mu}_{\hat{x}|\hat{y}}&=\bm{\mu}_{\hat{x}}+\mathbf{K}\mathopen{}\mathclose% {{}\left(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{% rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat% {y}}-\bm{\mu}_{\hat{y}}}\right)\\ \mathbf{{\Sigma}}_{\hat{x}|\hat{y}}&=\mathbf{{\Sigma}}_{\hat{x}}-\mathbf{K}% \mathbf{{C}}\mathbf{{\Sigma}}_{\hat{x}}\\ &=\mathbf{{\Sigma}}_{\hat{x}}-\mathbf{K}\mathbf{{\Sigma}}_{\hat{y}}\mathbf{K}^% {\text{T}},\end{split} (2.27)

where on the last line we have rewritten the posterior covariance yet again using the expression for the emission precision.

2.1.3 Sparse coding

To motivate “sparse” codes, we consider more complicated marginal distributions of data, focusing in particular on the distribution of natural images, for which mixture models and factor analyzers are inadequate models. To see this, we (following Hinton and Ghahramani [15]) cast these two model types as poles of a continuum of models, from sparsest to densest. As we saw above, one possible formulation of mixture models interprets the “source” variable, 𝑿^{\bm{\hat{X}}}, as a one-hot vector of length K{K} (as opposed to a scalar with support in the set {1,,K}\{1,\ldots,{K}\}). We can even think of each element X^k{\hat{X}}_{k} of 𝑿^{\bm{\hat{X}}} as a separate “unit” in a neural network. Indeed, the winner-take-all neural networks of the 1980s and ’90s [12, Chapter 9] can be thought of as implying (perhaps approximate) posterior distributions for generative models that are mixtures. From this perspective, then, the term “mixture” is potentially misleading: the marginal distribution of observations 𝒀^{\bm{\hat{Y}}} is indeed a mixture, whence the name, but each individual example 𝒚^\bm{\hat{y}} is not mixed. That is, each 𝒚^\bm{\hat{y}} is generated by just one element of 𝒙^\bm{\hat{x}}, and the code for generating it is maximally sparse.

At the other extreme, in factor analyzers, all of the “hidden units” (elements of 𝑿^{\bm{\hat{X}}}) typically will have non-zero values, since they are independent and normally distributed. All elements of a sample 𝒙^\bm{\hat{x}} therefore potentially contribute to a single emission 𝒚^\bm{\hat{y}}. This is a dense code.

To make matters even more concrete, consider a GMM whose emission covariance is fixed across classes, so that the activity of the source variables determines only the mean of the emission. Then the emissions for the GMM and factor analyzer can be written identically, as a normal distribution about 𝐂𝑿^\mathbf{{C}}{\bm{\hat{X}}} (cf. Eq. 2.5 and Eq. 2.20). That is, both models generate data as a “combination” of basis vectors, the columns of 𝐂\mathbf{{C}}, but differ in how many columns are allowed to contribute to a single observation.

The statistics of natural scenes.

Both types of model are problematic for complex distributions, like that of natural images. Gaussian mixture models can indeed approximate any marginal density over 𝒀{\bm{Y}} (Eq. 2.10) to arbitrary accuracy [49]: intuitively, a distribution of any shape can be constructed by placing little blobs of probability mass at various locations. This is essential for the distribution of images, where the presence of (e.g.) lines makes the marginal distribution of (say) triplets of adjacent pixels inexplicable by their mean values or even pairwise correlations alone (Fig. LABEL:fig:pixelDistribution). But each such bit of structure must be “explained” with a separate basis vector! There is no way to explain a letter “X” as a superimposition of two lines; rather, the “X” and the two lines into which it might be decomposed would each require its own basis vector. More generally, every image (or image patch) that could not be expressed as a Gaussian disturbance of some other image would require another column in 𝐂\mathbf{{C}}. Thus the ability to model structure comes at the price of a very large number of mixture components. It must be emphasized that we believe this price to be too high precisely because we believe the structure in images to be “componential,” explicable in terms of combinations of simpler basis elements.

A factor analyzer, in contrast, can avail itself of all basis vectors at once in attempting to explain or generate an observation, and consequently can in theory recognize an image in terms of its constituent parts. By the same token, it needs no more hidden units than there are pixels in the images: 𝐂\mathbf{{C}} can be square, i.e. a complete basis. However, the marginal distribution of observations under a factor analyzer is just another normal distribution (recall Eq. 2.21)—a fact that looked helpful from the perspective of calculating with the model, but is fatal for modeling natural images. This is a consequence of the higher-order structure in images alluded to above.

The existence of such higher-order structure as seen in Fig. LABEL:fig:pixelDistribution is not perhaps obvious a priori. After all, the size of the covariance matrix for the model marginal, p^(𝒚^;𝜽){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{};\bm{\theta}}\right)}, scales quadratically in the number of pixels—a lot of representational power to work with! For example, the covariance matrix of even modestly sized, (256 ×\times 256)-pixel, images has 2562(2562-1)/2256^{2}(256^{2}-1)/2\approx 2×1092\text{\times}{10}^{9} independent entries. It would seem that we could represent any distribution with this many parameters. But in fact, by taking certain symmetries into account, we reduce the number of covariance entries to a linear function of the number of pixels. The statistics of natural images are, for the most part66 6 Many natural images, as seen by land-dwelling animals, have sky in the upper portion and ground below, which does introduce statistical regularities that depend on absolute position. Overrepresentation of vertical and (especially) horizontal lines would likewise introduce assymmetries. , stationary: the correlation between any two pairs of pixels depends on the distance between them but not their absolute locations. In our example, there are only 2562256^{2} distances—from, say, the upper left pixel to any pixel in the image, including itself. So the covariance cannot encode so much information after all.

Covariance matrices arising from shift-invariant data will in general have this “symmetric Toeplitz” structure. More specifically, when the shift-invariant data are two-dimensional, like images, the covariance matrix will be block-Toeplitz with Toeplitz blocks [10]. Furthermore, if we assume that the covariance drops to zero at some distance within the image, and we ignore edge effects, the covariance matrix can be approximated as block-circulant with circulant blocks, in which case its eigenvectors are the (two-dimensional) discrete Fourier modes, and its eigenvalues constitute the power spectrum [10].77 7 Since the eigenvectors of the covariance matrix are the principal components of the data (see also Section 7.3), this likewise implies that the Fourier modes are the principal components of shift-invariant data. In natural images, the power spectrum appears to fall off with frequency ff at a rate of very nearly 1/f21/f^{2} [44].

Thus, natural images can be decorrelated by projection onto the Fourier modes, and then “whitened” by normalizing by the amplitude spectrum (the square roots of the power spectrum). Yet it is easily seen Fig. LABEL:fig:SimoncelliOlshausenFig5 that this process leaves most of the structure intact [44]. Or again, coming from the other direction, restoring the (approximately) 1/f21/f^{2} power spectrum to white noise does not produce a natural image (Fig. LABEL:fig:SimoncelliOlshausenFig5). Factor analyzers thus have no means of capturing the structure in natural images that lies beyond the mean and the power spectrum.

Moderately sparse generative models.

We have established, then, that neither factor analyzers nor GMMs are good models for natural images, albeit for roughly opposite reasons. Furthermore, the inadequacy of the multivariate Gaussian as a marginal distribution for natural images is relevant to a wider range of models than the factor analyzer. When the source variables are densely, even though non-normally, distributed, the model marginal p^(𝒚^;𝜽){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{};\bm{\theta}}\right)} may very well end up approximately normal anyway, for central-limit-like reasons: the marginal distribution is an average of many independent, “random” basis vectors. Averaging can wash out higher-order structure in these vectors. This is (I take it) another way of expressing Hinton’s conjecture that the most useful codes will be componential, in order to represent the properties—pose, deformation, etc.—of the objects in the image; but also non-dense, in order to accommodate multiple objects in the same image [15].

Between the “extremes” of the GMM and the factor analyzer lie models in which the “source” variables are sparsely distributed: some small subset are “on” for any given image. For example, they could be distributed according to a product over Bernoulli distributions with small means. Or again, we could interpret “on” more liberally and assign fat-tailed (leptokurtotic) distributions to the source variables, so that they usually take on values close to zero but can take on extreme probabilities more often than normally distributed variables. Common choices are the Laplace and Cauchy distributions [35, 29]. In this case it is convenient to write the prior distribution in the form of a product of Boltzmann distributions,

p^(𝒙^;𝜽)=1Z(𝜽)k=1Kexp{-Ek(x^k,𝜽)},\displaystyle{\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{};\bm{% \theta}}\right)}=\frac{1}{Z(\bm{\theta})}\prod_{k=1}^{{K}}\exp\mathopen{}% \mathclose{{}\left\{-E_{k}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]% {pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\hat{x}_{k},\bm{\theta})}\right\}, p^(𝒚^|𝒙^;𝜽)=𝒩(𝐂𝒙^,1λ𝐈),\displaystyle{\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{}\middle|% \leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{% };\bm{\theta}}\right)}=\mathcal{N}\mathopen{}\mathclose{{}\left(\mathbf{{C}}% \leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}},% \>\frac{1}{\lambda}\mathbf{I}}\right), (2.28)

with the energy functions EkE_{k} chosen so as to enforce some form of sparseness. For example,

Ek(x^k) . . ={log(βk2+x^k2)Cauchy distributionαk|x^k|Laplace distributionαkβlog(cosh(βx^k))hyperbolic-cosine distribution.E_{k}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}% {.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\hat{x}_{k})% \mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.} }}=\begin{cases}\log\mathopen{}\mathclose{{}\left(\beta_{k}^{2}+\leavevmode% \color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\hat{x}_{k}^{2}}\right% )&\text{Cauchy distribution}\\ \alpha_{k}\mathopen{}\mathclose{{}\left\lvert\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\hat{x}_{k}}\right\rvert&\text{Laplace % distribution}\\ \frac{\alpha_{k}}{\beta}\log\mathopen{}\mathclose{{}\left(\cosh(\beta% \leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\hat{x}_{k})}% \right)&\text{hyperbolic-cosine distribution}.\end{cases} (2.29)

The last is not a standard distribution, but for large β\beta it approximates the Laplace distribution while retaining differentiability at 0 [30]; see Fig. LABEL:fig:sparseDistributions. It will serve as our default example in the sparse-coding model that follows. The emissions, for their part, are once again normally distributed about a linear function of the source variables, and for simplicity we have given them identical variance, 1/λ1/\lambda. Eq. 2.28 describes the complete sparse-coding model, a graphical model that is shown in Fig. 2.4A.

𝑿^{\bm{\hat{X}}}p^(𝒙^;𝜽)=1Zk=1Kexp{-E(x^k)}{\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{};\bm{\theta}}\right)}={\frac{1}{Z}% \prod_{k=1}^{K}\exp\mathopen{}\mathclose{{}\left\{-E(\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\hat{x}_{k})}\right\}}𝒀^{\bm{\hat{Y}}}p^(𝒚^|𝒙^;𝜽)=𝒩(𝐂𝒙^,1λ𝐈){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{}\middle|\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{};\bm{% \theta}}\right)}={\mathcal{N}\mathopen{}\mathclose{{}\left(\mathbf{{C}}% \leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}},% \>\frac{1}{\lambda}\mathbf{I}}\right)} NN
(A)
X^1{\hat{X}}_{1}X^2{\hat{X}}_{2}X^3{\hat{X}}_{3}X^4{\hat{X}}_{4}X^5{\hat{X}}_{5}Y^1{\hat{Y}}_{1}Y^1{\hat{Y}}_{1}Y^3{\hat{Y}}_{3} NN
(B)
Figure 2.4: A sparse-coding model. (LABEL:sub@subfig:SCimplicit) The most abstract version of the model (cf. the GMM in Fig. 2.1 and the factor analyzer in Fig. 2.3A). The energy function EE enforces sparseness; e.g., E(x^k)=αk|x^k|E(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\hat{x}_{k})=% \alpha_{k}\mathopen{}\mathclose{{}\left\lvert\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\hat{x}_{k}}\right\rvert, or E(x^k)=log(β2+x^k2)E(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\hat{x}_{k})=% \log\mathopen{}\mathclose{{}\left(\beta^{2}+\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\hat{x}_{k}^{2}}\right) (LABEL:sub@subfig:SCexplicit) The independence statements and overcompleteness made explicit.

Since we are aiming at economy of expression (few descriptors per image), we must concede prodigality in vocabulary (many descriptors) [39]—i.e., an overcomplete set of basis vectors. This is made explicit in Fig. 2.4B, along with the independence statements implied by our choices of prior and emission distributions. A factor of 2 overcompleteness is common [29].

Inference and marginalization in a sparse-coding model: Laplace’s method.

There is a reason that factor analyzers and GMMs remain popular despite their limitations: the model posterior and marginal distributions are computable in closed form. This computability is due to the prior distributions being conjugate (or pseudo-conjugate) to the emission likelihoods. When they are not, the marginal and posterior distributions can typically be computed only approximately. When the prior is Laplace, e.g., it is not clear that the posterior can be computed (although see [38]).

Perhaps the simplest approach is to make a Taylor approximation.88 8 Another, more powerful class of techniques for approximate inference is based on (iterative) optimization of the posterior. However, this endows it with the character of learning, and accordingly we shall defer discussing these techniques until Chapter 6. In particular, consider approximating the energy, E(𝒙^)E(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}), of a distribution near its mode, 𝒙^0\bm{\hat{x}}^{0}, with the first three terms of the Taylor series. It is easy to see that, for 𝒙^0\bm{\hat{x}}^{0} to be a mode of the distribution, it must satisfy the following conditions:

E𝒙^T(𝒙^0)=0,\displaystyle\frac{\partial{E}}{\partial{\bm{\hat{x}}}^{\text{T}}}(\bm{\hat{x}% }^{0})=0, 2E𝒙^𝒙^T(𝒙^0)0;\displaystyle\frac{\partial^{2}{E}}{\partial{\bm{\hat{x}}}\partial{\bm{\hat{x}% }}^{\text{T}}}(\bm{\hat{x}}^{0})\succ 0;

i.e., at the mode, the energy’s gradient must be zero and its Hessian must be positive definite. For a generic energy, the second-order Taylor approximation at the mode is therefore

E(𝒙^)E(𝒙^0)+E𝒙^T(𝒙^0)(𝒙^-𝒙^0)+12(𝒙^-𝒙^0)T2E𝒙^𝒙^T(𝒙^0)(𝒙^-𝒙^0)=E(𝒙^0)+12(𝒙^-𝒙^0)T2E𝒙^𝒙^T(𝒙^0)(𝒙^-𝒙^0).\begin{split}E(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{% pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{\hat{x}})&\approx E(\bm{\hat{x}}^{0})+\frac{% \partial{E}}{\partial{\bm{\hat{x}}}^{\text{T}}}(\bm{\hat{x}}^{0})\mathopen{}% \mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{% pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{\hat{x}}-\bm{\hat{x}}^{0}}\right)+\frac{1}{2}% \mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[% named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{\hat{x}}-\bm{\hat{x}}^{0}}\right)^{\text{T}}% \frac{\partial^{2}{E}}{\partial{\bm{\hat{x}}}\partial{\bm{\hat{x}}}^{\text{T}}% }(\bm{\hat{x}}^{0})\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}-\bm{\hat{% x}}^{0}}\right)\\ &=E(\bm{\hat{x}}^{0})+\frac{1}{2}\mathopen{}\mathclose{{}\left(\leavevmode% \color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}-\bm{\hat{% x}}^{0}}\right)^{\text{T}}\frac{\partial^{2}{E}}{\partial{\bm{\hat{x}}}% \partial{\bm{\hat{x}}}^{\text{T}}}(\bm{\hat{x}}^{0})\mathopen{}\mathclose{{}% \left(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}% {.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}% -\bm{\hat{x}}^{0}}\right).\end{split} (2.30)

The only remaining 𝒙^\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}-dependent term should look familiar: it is the energy of a normal distribution with mean at the mode and inverse covariance equal to the Hessian of the energy (evaluated at the mode). The first term, E(𝒙^0)E(\bm{\hat{x}}^{0}), on the other hand, is constant in the argument 𝒙^\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}} and can therefore be absorbed into the normalizer. Hence we can approximate the posterior as

p^(𝒙^|𝒚^;𝜽)=1Zexp{-Epost(𝒙^,𝒚^,𝜽)}𝒩(𝒙^0,(2Epost𝒙^𝒙^T(𝒙^0,𝒚^))-1).{\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{}\middle|\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{};\bm{% \theta}}\right)}=\frac{1}{Z}\exp\mathopen{}\mathclose{{}\left\{-E_{\text{post}% }(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}},% \leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}},% \bm{\theta})}\right\}\approx\mathcal{N}\mathopen{}\mathclose{{}\left(\bm{\hat{% x}}^{0},\>\mathopen{}\mathclose{{}\left(\frac{\partial^{2}{E_{\text{post}}}}{% \partial{\bm{\hat{x}}}\partial{\bm{\hat{x}}}^{\text{T}}}(\bm{\hat{x}}^{0},% \leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}})% }\right)^{-1}}\right). (2.31)

Note that this approximation is valid at realizations of 𝑿^{\bm{\hat{X}}} even “moderately” close to 𝒙^0\bm{\hat{x}}^{0}, since the neglected higher-order terms need only be less than unity in order for the exponentiation to drive their contribution to the distribution down to much smaller values. On the other hand, it is obviously inappropriate for distributions having multiple modes, or over discrete random variables.

So much for approximating the posterior; we now move on to the model marginal. Using the definition of conditional probability with the approximate posterior distribution, Eq. 2.31, and then evaluating at the mode, we see that

p^(𝒙^,𝒚^;𝜽)=p^(𝒙^|𝒚^;𝜽)p^(𝒚^;𝜽)𝒩(𝒙^0,(2Epost𝒙^𝒙^T(𝒙^0,𝒚^))-1)p^(𝒚^;𝜽)p^(𝒙^0,𝒚^;𝜽)τ-k/2|2Epost𝒙^𝒙^T(𝒙^0,𝒚^)|1/2p^(𝒚^;𝜽)p^(𝒚^;𝜽)=p^(𝒙^0,𝒚^;𝜽)τk/2|2Epost𝒙^𝒙^T(𝒙^0,𝒚^)|-1/2.\begin{split}{\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{},% \leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{% };\bm{\theta}}\right)}&={\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode% \color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{}\middle|% \leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{% };\bm{\theta}}\right)}{\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[% rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{};\bm{% \theta}}\right)}\\ &\approx\mathcal{N}\mathopen{}\mathclose{{}\left(\bm{\hat{x}}^{0},\>\mathopen{% }\mathclose{{}\left(\frac{\partial^{2}{E_{\text{post}}}}{\partial{\bm{\hat{x}}% }\partial{\bm{\hat{x}}}^{\text{T}}}(\bm{\hat{x}}^{0},\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}})}\right)^% {-1}}\right){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{};\bm{% \theta}}\right)}\\ \implies{\hat{p}\mathopen{}\mathclose{{}\left(\bm{\hat{x}}^{0},\leavevmode% \color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{};\bm{% \theta}}\right)}&\approx\tau^{-k/2}\mathopen{}\mathclose{{}\left\lvert\frac{% \partial^{2}{E_{\text{post}}}}{\partial{\bm{\hat{x}}}\partial{\bm{\hat{x}}}^{% \text{T}}}(\bm{\hat{x}}^{0},\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named% ]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{\hat{y}})}\right\rvert^{1/2}{\hat{p}\mathopen{}% \mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{% pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{\hat{y}}{};\bm{\theta}}\right)}\\ \implies{\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{};\bm{\theta}}\right)}&={\hat{p}% \mathopen{}\mathclose{{}\left(\bm{\hat{x}}^{0},\leavevmode\color[rgb]{.5,.5,.5% }\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.% 5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{};\bm{\theta}}\right)}\tau^{k/2}% \mathopen{}\mathclose{{}\left\lvert\frac{\partial^{2}{E_{\text{post}}}}{% \partial{\bm{\hat{x}}}\partial{\bm{\hat{x}}}^{\text{T}}}(\bm{\hat{x}}^{0},% \leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}})% }\right\rvert^{-1/2}.\end{split} (2.32)

This is the approximate model marginal under Laplace’s method.

Let us specialize to an isotropic Gaussian emission and a factorial prior distribution, Eq. 2.28. Start with the Hessian:

2𝒙^𝒙^TEpost(𝒙^,𝒚^,𝜽)=2𝒙^𝒙^TEjoint(𝒙^,𝜽)=2𝒙^𝒙^T(k=1KEk(x^k,𝒚^,𝜽)+λ2𝒚-𝐂𝒙^2)=diag([2E1x^12(x^1,𝜽),,2EKx^K2(x^K,𝜽)]T)+λ𝐂T𝐂,\begin{split}\frac{\partial^{2}{}}{\partial{\bm{\hat{x}}}\partial{\bm{\hat{x}}% }^{\text{T}}}E_{\text{post}}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[% named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{\hat{x}},\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\hat{y}},\bm{\theta})&=\frac{\partial^{2}{}}{% \partial{\bm{\hat{x}}}\partial{\bm{\hat{x}}}^{\text{T}}}E_{\text{joint}}(% \leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}},% \bm{\theta})\\ &=\frac{\partial^{2}{}}{\partial{\bm{\hat{x}}}\partial{\bm{\hat{x}}}^{\text{T}% }}\mathopen{}\mathclose{{}\left(\sum_{k=1}^{{K}}E_{k}(\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\hat{x}_{k},% \leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}},% \bm{\theta})+\frac{\lambda}{2}\mathopen{}\mathclose{{}\left\lVert\bm{y}-% \mathbf{{C}}\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor% }{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{% \hat{x}}}\right\rVert^{2}}\right)\\ &=\text{diag}\mathopen{}\mathclose{{}\left(\mathopen{}\mathclose{{}\left[\frac% {\partial^{2}\!{E_{1}}}{\partial{\hat{x}_{1}}^{2}}(\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\hat{x}_{1},\bm{\theta% }),\cdots,\frac{\partial^{2}\!{E_{{K}}}}{\partial{\hat{x}_{{K}}}^{2}}(% \leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\hat{x}_{{K}}% ,\bm{\theta})}\right]^{\text{T}}}\right)+\lambda\mathbf{{C}}^{\text{T}}\mathbf% {{C}},\\ \end{split}

where diag(𝒂)\text{diag}\mathopen{}\mathclose{{}\left(\bm{a}}\right) is a diagonal matrix with the vector 𝒂\bm{a} as its diagonal. For concreteness, let us specialize further to Laplace-distributed source variables, as in Eq. 2.29. Unfortunately, the curvature of this energy function is undefined at the origin, and (equally bad) zero everywhere else. Lewicki and colleagues [29, 31] therefore suggest approximating this energy—in the Hessian only—with the hyperbolic-cosine energy (see again Eq. 2.29), which approaches the Laplace energy in the limit of large β\beta. The second derivative of this energy is

2Ekx^k2(x^k,𝜽)=2x^k2αkβlog(cosh(βx^k))=αkx^ktanh(βx^k)=αkβsech2(βx^k).\begin{split}\frac{\partial^{2}\!{E_{k}}}{\partial{\hat{x}_{k}}^{2}}(% \leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\hat{x}_{k},% \bm{\theta})=\frac{\partial^{2}\!{}}{\partial{\hat{x}_{k}}^{2}}\frac{\alpha_{k% }}{\beta}\log\mathopen{}\mathclose{{}\left(\cosh(\beta\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\hat{x}_{k})}\right)=% \alpha_{k}\frac{\partial{}}{\partial{\hat{x}_{k}}}\tanh(\beta\leavevmode\color% [rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\hat{x}_{k})=\alpha_{k% }\beta\operatorname{sech}^{2}(\beta\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\hat{x}_{k}).\end{split}

Hence, from Eq. 2.31, an approximate posterior distribution for the sparse-coding model is

p^(𝒙^|𝒚^;𝜽)𝒩(𝒙^0,(λ𝐂T𝐂+diag(𝜶βsech2(β𝒙^0)))-1),{\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\hat{x}}\middle|\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{};\bm{% \theta}}\right)}\approx\mathcal{N}\mathopen{}\mathclose{{}\left(\bm{\hat{x}}_{% 0},\>\mathopen{}\mathclose{{}\left(\lambda\mathbf{{C}}^{\text{T}}\mathbf{{C}}+% \text{diag}\mathopen{}\mathclose{{}\left(\bm{\alpha}\circ\beta\operatorname{% sech}^{2}(\beta\bm{\hat{x}}_{0})}\right)}\right)^{-1}}\right), (2.33)

where \circ is the element-wise product and sech()\operatorname{sech}(\cdot) is supposed to act element-wise. Notice that although this may appear to be independent of the observations 𝒚\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y}, they enter through the mode, which obviously varies as a function of 𝒚\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y}.

To specialize the marginal distribution to the Laplace prior, we use Eq. 2.32, substituting the Laplace energies into the joint, but the hyperbolic-cosine energies into the Hessian (to maintain differentiability):

p^(𝒚^;𝜽)τk/2Zjointexp{𝜶T|𝒙^0|-λ2𝒚^-𝐂𝒙^02}|λ𝐂T𝐂+diag(𝜶βsech2(β𝒙^0))|-1/2.{\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{};\bm{\theta}}\right)}\approx\frac{% \tau^{k/2}}{Z_{\text{joint}}}\exp\mathopen{}\mathclose{{}\left\{\bm{\alpha}^{% \text{T}}\mathopen{}\mathclose{{}\left\lvert\bm{\hat{x}}_{0}}\right\rvert-% \frac{\lambda}{2}\mathopen{}\mathclose{{}\left\lVert\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}-\mathbf{{% C}}\bm{\hat{x}}_{0}}\right\rVert^{2}}\right\}\mathopen{}\mathclose{{}\left% \lvert\lambda\mathbf{{C}}^{\text{T}}\mathbf{{C}}+\text{diag}\mathopen{}% \mathclose{{}\left(\bm{\alpha}\circ\beta\operatorname{sech}^{2}(\beta\bm{\hat{% x}}_{0})}\right)}\right\rvert^{-1/2}. (2.34)