8.1 Gaussian recognition distributions and sparse coding

For continuous random latent variables, the simplest choice is perhaps the (multivariate) normal distribution:

pˇ(𝒙ˇ|𝒚;ϕ)=𝒩(𝝂𝒚,𝐏𝒚-1).{\check{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\check{x}}{}\middle|\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y}{};\bm{\phi}}% \right)}=\mathcal{N}\mathopen{}\mathclose{{}\left({\bm{\nu}_{\leavevmode\color% [rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y}}},\>{\mathbf{{P% }}^{-1}_{\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{% rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y}}}% }\right).

Note that we are allowing the mean and precision to depend on the observations, although we have not as yet specified how. The symbol ϕ\bm{\phi} stands for all the currently unspecified parameters of the recognition distributions.

To derive an EM algorithm under this recognition model, we start as usual with the joint relative entropy. This consists of (among other things) the entropy of the recognition model, which under our current assumptions is just the entropy of a multivariate normal distribution, averaged under the data:

𝔼𝑿ˇ,𝒀[-logpˇ(𝑿ˇ|𝒀;ϕ)]=𝔼𝑿ˇ,𝒀[-12log|𝐏𝒀|+K2logτ+12(𝑿ˇ-𝝂𝒀)T𝐏𝒀(𝑿ˇ-𝝂𝒀)]=12(Klogτ+K-𝔼𝒀[log|𝐏𝒀|]).\begin{split}\mathbb{E}_{{\bm{\check{X}}}{},{\bm{Y}}{}}{\mathopen{}\mathclose{% {}\left[-\log{\check{p}\mathopen{}\mathclose{{}\left({\bm{\check{X}}}\middle|{% \bm{Y}};\bm{\phi}}\right)}}\right]}&=\mathbb{E}_{{\bm{\check{X}}}{},{\bm{Y}}{}% }{\mathopen{}\mathclose{{}\left[-\frac{1}{2}\log\mathopen{}\mathclose{{}\left% \lvert{\mathbf{{P}}_{{\bm{Y}}}}}\right\rvert+\frac{{K}}{2}\log\tau+\frac{1}{2}% \mathopen{}\mathclose{{}\left({\bm{\check{X}}}-{\bm{\nu}_{{\bm{Y}}}}}\right)^{% \text{T}}{\mathbf{{P}}_{{\bm{Y}}}}\mathopen{}\mathclose{{}\left({\bm{\check{X}% }}-{\bm{\nu}_{{\bm{Y}}}}}\right)}\right]}\\ &=\frac{1}{2}\mathopen{}\mathclose{{}\left({K}\log\tau+{K}-\mathbb{E}_{{\bm{Y}% }{}}{\mathopen{}\mathclose{{}\left[\log\mathopen{}\mathclose{{}\left\lvert{% \mathbf{{P}}_{{\bm{Y}}}}}\right\rvert}\right]}}\right).\end{split}

So the joint relative entropy is

JRE=𝔼𝑿ˇ,𝒀[-logp^(𝑿ˇ,𝒀;𝜽)]-𝔼𝑿ˇ,𝒀[-logpˇ(𝑿ˇ|𝒀;ϕ)]-𝔼𝒀[-logp(𝒀)]=𝔼𝑿ˇ,𝒀[-logp^(𝑿ˇ,𝒀;𝜽)]+𝔼𝒀[12log|𝐏𝒀|]+c.\begin{split}\mathcal{L}_{\text{JRE}}&=\mathbb{E}_{{\bm{\check{X}}}{},{\bm{Y}}% {}}{\mathopen{}\mathclose{{}\left[-\log{\hat{p}\mathopen{}\mathclose{{}\left({% \bm{\check{X}}},{\bm{Y}};\bm{\theta}}\right)}}\right]}-\mathbb{E}_{{\bm{\check% {X}}}{},{\bm{Y}}{}}{\mathopen{}\mathclose{{}\left[-\log{\check{p}\mathopen{}% \mathclose{{}\left({\bm{\check{X}}}\middle|{\bm{Y}};\bm{\phi}}\right)}}\right]% }-\mathbb{E}_{{\bm{Y}}{}}{\mathopen{}\mathclose{{}\left[-\log{p\mathopen{}% \mathclose{{}\left({\bm{Y}}}\right)}}\right]}\\ &=\mathbb{E}_{{\bm{\check{X}}}{},{\bm{Y}}{}}{\mathopen{}\mathclose{{}\left[-% \log{\hat{p}\mathopen{}\mathclose{{}\left({\bm{\check{X}}},{\bm{Y}};\bm{\theta% }}\right)}}\right]}+\mathbb{E}_{{\bm{Y}}{}}{\mathopen{}\mathclose{{}\left[% \frac{1}{2}\log\mathopen{}\mathclose{{}\left\lvert{\mathbf{{P}}_{{\bm{Y}}}}}% \right\rvert}\right]}+c.\end{split} (8.1)

Local parameters

Now let us assume that we will learn (in the E step) one pair of parameters for each observation, 𝒚\bm{y}; that is,

ϕ={𝝂𝒚1,𝐏𝒚1,,𝝂𝒚N,𝐏𝒚N}.\bm{\phi}=\mathopen{}\mathclose{{}\left\{{\bm{\nu}_{\bm{y}_{1}}},{\mathbf{{P}}% _{\bm{y}_{1}}},\ldots,{\bm{\nu}_{\bm{y}_{{N}}}},{\mathbf{{P}}_{\bm{y}_{{N}}}}}% \right\}. (8.2)

The advantage of this approach is flexibility: the mean vector and precision matrix for each datum can take on arbitrary values. The (potential) disadvantage is inefficiency: Unless the parameters can be derived in closed-form, this approach will not scale well to a large number of observations. Indeed, the recognition distribution for a new (or held-out) observation 𝒚N+1\bm{y}_{{N}+1} cannot be fit any faster, or predicted, on the basis of the previous observations. We shall explore alternatives shortly.

The E step.

It is instructive to begin the E-step optimization, even without having specified a generative model. Replacing the expectation over the data in Eq. 8.1 with a sample average and differentiating with respect to the recognition parameters (𝝂𝒚,𝐏𝒚{\bm{\nu}_{\bm{y}}},{\mathbf{{P}}_{\bm{y}}}) associated with a single datum (𝒚\bm{y}) clearly eliminates all the terms that depend on the parameters for other data. So we can drop the expectation over the data altogether. The gradient with respect to a recognition mean vector is then [36]

JRE𝝂𝒚=dd𝝂𝒚𝔼𝑿ˇ|𝒀[-logp^(𝑿ˇ,𝒚;𝜽)|𝒚]=𝔼𝑿ˇ|𝒀[-logp^𝒙^(𝑿ˇ,𝒚;𝜽)|𝒚]=𝔼𝑿ˇ|𝒀[Ejoint𝒙^(𝑿ˇ,𝒚,𝜽)|𝒚]=set0.\begin{split}\frac{\partial{\mathcal{L}_{\text{JRE}}}}{\partial{{\bm{\nu}_{\bm% {y}}}}}&=\frac{\mathop{}\!\mathrm{d}{}}{\mathop{}\!\mathrm{d}{{\bm{\nu}_{\bm{y% }}}}}\mathbb{E}_{{\bm{\check{X}}}{}|{\bm{Y}}}{\mathopen{}\mathclose{{}\left[-% \log{\hat{p}\mathopen{}\mathclose{{}\left({\bm{\check{X}}},\bm{y};\bm{\theta}}% \right)}\middle|\bm{y}{}}\right]}\\ &=\mathbb{E}_{{\bm{\check{X}}}|{\bm{Y}}}{\mathopen{}\mathclose{{}\left[-\frac{% \partial{\log\hat{p}}}{\partial{\bm{\hat{x}}}}\mathopen{}\mathclose{{}\left({% \bm{\check{X}}},\bm{y};\bm{\theta}}\right)\middle|\bm{y}}\right]}\\ &=\mathbb{E}_{{\bm{\check{X}}}|{\bm{Y}}}{\mathopen{}\mathclose{{}\left[\frac{% \partial{E_{\text{joint}}}}{\partial{\bm{\hat{x}}}}\mathopen{}\mathclose{{}% \left({\bm{\check{X}}},\bm{y},\bm{\theta}}\right)\middle|\bm{y}}\right]}% \stackrel{{\scriptstyle\text{set}}}{{=}}0.\end{split} (8.3)

The second line can be derived by exploiting the symmetry of 𝝂𝒚{\bm{\nu}_{\bm{y}}} and 𝑿^{\bm{\hat{X}}} in a Gaussian function, and then integrating by parts.margin: Exercise LABEL:ex:derivativesOfGaussianExpectations The energy consists of the 𝒙^\bm{\hat{x}}-dependent terms in -logp^-\log\hat{p}, hence the third line. Similarly, the gradient with respect to a recognition covariance matrix is

JRE𝐏𝒚-1=dd𝐏𝒚-1(𝔼𝑿ˇ|𝒀[-logp^(𝑿ˇ,𝒚;𝜽)|𝒚]-12log|𝐏𝒚-1|)=set0𝐏𝒚=2dd𝐏𝒚-1𝔼𝑿ˇ|𝒀[-logp^(𝑿ˇ,𝒚;𝜽)|𝒚]=𝔼𝑿ˇ|𝒀[-2logp^𝒙^𝒙^T(𝑿ˇ,𝒚;𝜽)|𝒚]=𝔼𝑿ˇ|𝒀[2Ejoint𝒙^𝒙^T(𝑿ˇ,𝒚,𝜽)|𝒚].\begin{split}\frac{\partial{\mathcal{L}_{\text{JRE}}}}{\partial{{\mathbf{{P}}^% {-1}_{\bm{y}}}}}&=\frac{\mathop{}\!\mathrm{d}{}}{\mathop{}\!\mathrm{d}{{% \mathbf{{P}}^{-1}_{\bm{y}}}}}\mathopen{}\mathclose{{}\left(\mathbb{E}_{{\bm{% \check{X}}}{}|{\bm{Y}}}{\mathopen{}\mathclose{{}\left[-\log{\hat{p}\mathopen{}% \mathclose{{}\left({\bm{\check{X}}},\bm{y};\bm{\theta}}\right)}\middle|\bm{y}{% }}\right]}-\frac{1}{2}\log\mathopen{}\mathclose{{}\left\lvert{\mathbf{{P}}^{-1% }_{\bm{y}}}}\right\rvert}\right)\stackrel{{\scriptstyle\text{set}}}{{=}}0\\ \implies{\mathbf{{P}}_{\bm{y}}}&=2\frac{\mathop{}\!\mathrm{d}{}}{\mathop{}\!% \mathrm{d}{{\mathbf{{P}}^{-1}_{\bm{y}}}}}\mathbb{E}_{{\bm{\check{X}}}{}|{\bm{Y% }}}{\mathopen{}\mathclose{{}\left[-\log{\hat{p}\mathopen{}\mathclose{{}\left({% \bm{\check{X}}},\bm{y};\bm{\theta}}\right)}\middle|\bm{y}{}}\right]}\\ &=\mathbb{E}_{{\bm{\check{X}}}|{\bm{Y}}}{\mathopen{}\mathclose{{}\left[-\frac{% \partial^{2}{\log\hat{p}}}{\partial{\bm{\hat{x}}}\partial{\bm{\hat{x}}}^{\text% {T}}}\mathopen{}\mathclose{{}\left({\bm{\check{X}}},\bm{y};\bm{\theta}}\right)% \middle|\bm{y}}\right]}\\ &=\mathbb{E}_{{\bm{\check{X}}}|{\bm{Y}}}{\mathopen{}\mathclose{{}\left[\frac{% \partial^{2}{E_{\text{joint}}}}{\partial{\bm{\hat{x}}}\partial{\bm{\hat{x}}}^{% \text{T}}}\mathopen{}\mathclose{{}\left({\bm{\check{X}}},\bm{y},\bm{\theta}}% \right)\middle|\bm{y}}\right]}.\end{split} (8.4)

The third line can be derived along similar lines to the identity for gradients with respect to the mean (but the integration by parts must be done twicemargin: Exercise LABEL:ex:derivativesOfGaussianExpectations [36]).

We have written the optimizations for the mean and precision in the form of Eqs. 8.3 and 8.4 in order to highlight the resemblance to the Laplace approximation encountered in Chapter 2—see especially Eq. 2.31 [36]. Note that the joint (here) and posterior (there) energies are identical up to terms constant in 𝒙^\bm{\hat{x}}. Therefore, Eqs. 8.3 and 8.4 become the Laplace approximation whenever the expectations commute with the energy-gradient and energy-Hessian functions of 𝒙^\bm{\hat{x}}. This happens precisely when the order of the energy in 𝒙^\bm{\hat{x}} is quadratic or less.

8.1.1 Independent sources and Gaussian emissions: sparse coding and other models

To make the procedure more concrete, let us make a few further assumptions. Although this is by no means obligatory, we let the emissions be Gaussians, as they have been in all preceding examples. For simplicity, we even let the noise be isotropic. The “source” random variables, on the other hand, we shall assume to be independent, but otherwise distributed generically:

p^(𝒙^;𝜽)=1Z(𝜽)k=1Kexp{-Ek(x^k,𝜽)},\displaystyle{\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{};\bm{% \theta}}\right)}=\frac{1}{Z(\bm{\theta})}\prod_{k=1}^{{K}}\exp\mathopen{}% \mathclose{{}\left\{-E_{k}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]% {pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\hat{x}_{k},\bm{\theta})}\right\}, p^(𝒚^|𝒙^;𝜽)=𝒩(𝐂𝒙^,1λ𝐈).\displaystyle{\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{}\middle|% \leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{% };\bm{\theta}}\right)}=\mathcal{N}\mathopen{}\mathclose{{}\left(\mathbf{{C}}% \leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}},% \>\frac{1}{\lambda}\mathbf{I}}\right). (8.5)

Among other models, Eq. 8.5 includes the sparse-coding models encountered in Section 2.1.3, in which each emission 𝒚^\bm{\hat{y}}{} is assembled from a modest number of basis vectors drawn from a large (overcomplete) dictionary—i.e., an emission matrix 𝐂\mathbf{{C}} that is “fat.” To enforce sparsity, the energy functions for the source variables can be chosen from leptokurtotic distributions. For example, we considered energy functions for the Laplace distribution and a smooth approximation thereof:

Ek(x^k) . . =αk|x^k|αkβlog(cosh(βx^k))for large β.\begin{split}E_{k}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{% pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\hat{x}_{k})&\mathrel{\vbox{\hbox{\scriptsize.}% \hbox{\scriptsize.} }}=\alpha_{k}\mathopen{}\mathclose{{}\left\lvert\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\hat{x}_{k}}\right% \rvert\\ &\approx\frac{\alpha_{k}}{\beta}\log\mathopen{}\mathclose{{}\left(\cosh(\beta% \leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\hat{x}_{k})}% \right)\>\>\>\text{for large }\beta.\end{split} (8.6)

This ensures that the source variables X^k{\hat{X}}_{k} take on values close to zero more frequently than normal random variables with similar variance. Unfortunately, as we saw, such fat-tailed distributions are not conjugate to the likelihood of 𝑿^{\bm{\hat{X}}} specified by Eq. 8.5, i.e. the Gaussian emission density, so they require approximate inference.

Here we are considering Gaussian recognition distributions. For the sake of generality, however, we shall refrain at first from assuming any form for the energy functions, deriving our results for the more general class described by Eq. 8.5. Now, the cross entropy of this joint distribution is:

𝔼𝑿ˇ,𝒀[-logp^(𝑿ˇ,𝒀;𝜽)]=𝔼𝑿ˇ,𝒀[K2logλ+λ2𝒀-𝐂𝑿ˇ2+logZ(𝜽)+k=1KEk(Xˇk,𝜽)]=K2logλ+logZ(𝜽)+12𝔼𝑿ˇ,𝒀[λ𝒀-𝐂𝑿ˇ2+2k=1KEk(Xˇk,𝜽)].\begin{split}\mathbb{E}_{{\bm{\check{X}}}{},{\bm{Y}}{}}{\mathopen{}\mathclose{% {}\left[-\log{\hat{p}\mathopen{}\mathclose{{}\left({\bm{\check{X}}},{\bm{Y}};% \bm{\theta}}\right)}}\right]}&=\mathbb{E}_{{\bm{\check{X}}},{\bm{Y}}}{% \mathopen{}\mathclose{{}\left[\frac{{K}}{2}\log\lambda+\frac{\lambda}{2}% \mathopen{}\mathclose{{}\left\lVert{\bm{Y}}-\mathbf{{C}}{\bm{\check{X}}}}% \right\rVert^{2}+\log Z(\bm{\theta})+\sum_{k=1}^{{K}}E_{k}({\check{X}}_{k},\bm% {\theta})}\right]}\\ &=\frac{{K}}{2}\log\lambda+\log Z(\bm{\theta})+\frac{1}{2}\mathbb{E}_{{\bm{% \check{X}}},{\bm{Y}}}{\mathopen{}\mathclose{{}\left[\lambda\mathopen{}% \mathclose{{}\left\lVert{\bm{Y}}-\mathbf{{C}}{\bm{\check{X}}}}\right\rVert^{2}% +2\sum_{k=1}^{{K}}E_{k}({\check{X}}_{k},\bm{\theta})}\right]}.\end{split} (8.7)

So far this follows directly from Eq. 8.5. The third term can be simplified further, however, by applying our assumption that the recognition distribution is normal, and using Eq. B.13 from the appendix:

𝔼𝑿ˇ,𝒀[λ𝒀-𝐂𝑿ˇ2]=λ𝔼𝒀[tr[𝐂𝐏𝒚-1𝐂T]+𝒀-𝐂𝝂𝒚2].\mathbb{E}_{{\bm{\check{X}}},{\bm{Y}}}{\mathopen{}\mathclose{{}\left[\lambda% \mathopen{}\mathclose{{}\left\lVert{\bm{Y}}-\mathbf{{C}}{\bm{\check{X}}}}% \right\rVert^{2}}\right]}=\lambda\mathbb{E}_{{\bm{Y}}}{\mathopen{}\mathclose{{% }\left[\text{tr}\mathopen{}\mathclose{{}\left[\mathbf{{C}}{\mathbf{{P}}^{-1}_{% \leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y}}}% \mathbf{{C}}^{\text{T}}}\right]+\mathopen{}\mathclose{{}\left\lVert{\bm{Y}}-% \mathbf{{C}}{\bm{\nu}_{\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{% pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{y}}}}\right\rVert^{2}}\right]}.

The M step.

We can now write a more specific expression for the joint relative entropy that will allow us to see how to carry out the M step, as well. Let us simplify even further and treat λ\lambda and the parameters of the prior as fixed. Then substituting the cross entropy of the joint, Eq. 8.7, into the joint relative entropy for Gaussian recognition distributions, Eq. 8.1, yields

JRE(𝜽,ϕ)=𝔼𝒀[λ2tr[𝐂𝐏𝒚-1𝐂T]+λ2𝒀-𝐂𝝂𝒚2+12log|𝐏𝒚|]+k=1K𝔼𝑿ˇ,𝒀[Ek(Xˇk,𝜽)]+c.\mathcal{L}_{\text{JRE}}(\bm{\theta},\bm{\phi})=\mathbb{E}_{{\bm{Y}}}{% \mathopen{}\mathclose{{}\left[\frac{\lambda}{2}\text{tr}\mathopen{}\mathclose{% {}\left[\mathbf{{C}}{\mathbf{{P}}^{-1}_{\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{y}}}\mathbf{{C}}^{\text{T}}}\right]+\frac{% \lambda}{2}\mathopen{}\mathclose{{}\left\lVert{\bm{Y}}-\mathbf{{C}}{\bm{\nu}_{% \leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y}}}}% \right\rVert^{2}+\frac{1}{2}\log\mathopen{}\mathclose{{}\left\lvert{\mathbf{{P% }}_{\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y}}}}% \right\rvert}\right]}+\sum_{k=1}^{{K}}\mathbb{E}_{{\bm{\check{X}}},{\bm{Y}}}{% \mathopen{}\mathclose{{}\left[E_{k}({\check{X}}_{k},\bm{\theta})}\right]}+c. (8.8)

(All “constant” terms have been lumped into a single scalar cc.) The only remaining model parameter to be optimized is the emission matrix, 𝐂\mathbf{{C}}. We can compute the gradient of the joint relative entropy, Eq. 8.8, with respect to 𝐂\mathbf{{C}} using some results from the appendix (Section B.1):

JRE𝐂=λ𝔼𝒀[𝐂𝐏𝒚-1-(𝒀-𝐂𝝂𝒚)𝝂𝒚T]=set0𝐂=𝒀𝝂𝒀T𝒀𝐏𝒀-1+𝝂𝒀𝝂𝒀T𝒀-1,\begin{split}\frac{\partial{\mathcal{L}_{\text{JRE}}}}{\partial{\mathbf{{C}}}}% &=\lambda\mathbb{E}_{{\bm{Y}}}{\mathopen{}\mathclose{{}\left[\mathbf{{C}}{% \mathbf{{P}}^{-1}_{\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{% pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{y}}}-\mathopen{}\mathclose{{}\left({\bm{Y}}-% \mathbf{{C}}{\bm{\nu}_{\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{% pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{y}}}}\right){{\bm{\nu}_{\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y}}}}^{\text{T}}}% \right]}\stackrel{{\scriptstyle\text{set}}}{{=}}0\\ \implies\mathbf{{C}}&={\mathopen{}\mathclose{{}\left\langle{{\bm{Y}}{\bm{\nu}_% {{\bm{Y}}}}^{\text{T}}}}\right\rangle_{{\bm{Y}}{}}}{\mathopen{}\mathclose{{}% \left\langle{{\mathbf{{P}}^{-1}_{{\bm{Y}}}}+{\bm{\nu}_{{\bm{Y}}}}{\bm{\nu}_{{% \bm{Y}}}}^{\text{T}}}}\right\rangle_{{\bm{Y}}{}}^{-1}},\end{split} (8.9)

Thus, the emission matrix is once again given by some version of the normal equations. In practice, however, if each 𝒚\bm{y} is (e.g.) an image, and the model is overcomplete, then the matrix inversions will be computationally challenging. Instead, the gradient in Eq. 8.9 (or the natural gradient [29, 31]) can be descended.

The E step.

To carry out the E step, we apply Eqs. 8.3 and 8.4 to the energy of the generative model’s joint or posterior distribution (they are equivalent up to constant terms), to wit, the bracketed terms in Eq. 8.7. Let us also commit to the source energies specified by Eq. 8.6, i.e., the energy of the Laplace distribution. Then the resulting joint energy,

Ejoint(𝒙^,𝒚,𝜽)=λ2𝒚-𝐂𝒙^2+k=1Kαk|x^k|,E_{\text{joint}}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{% pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{\hat{x}},\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{y},\bm{\theta})=\frac{\lambda}{2}\mathopen{}% \mathclose{{}\left\lVert\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{% pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{y}-\mathbf{{C}}\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\hat{x}}}\right\rVert^{2}+\sum_{k=1}^{{K}}% \alpha_{k}\mathopen{}\mathclose{{}\left\lvert\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\hat{x}_{k}}\right\rvert,

is quadratic in 𝒙^\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}} everywhere except at points where x^k\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\hat{x}_{k} is zero (for any kk). Therefore the energy-gradient function (of 𝒙^\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}) and the expectation in Eq. 8.3 “almost” commute with each other, and the gradient of the joint relative entropy becomes

JRE𝝂𝒚=𝔼𝑿ˇ|𝒚[Ejoint𝒙^(𝑿ˇ,𝒚,𝜽)|𝒚]Ejoint𝒙^(𝔼𝑿ˇ|𝒚[𝑿ˇ|𝒚],𝒚,𝜽)=dd𝝂𝒚(λ2𝒚-𝐂𝝂𝒚2+kKαk|ν𝒚k|)=set0𝝂𝒚*=argmin𝝂𝒚{λ2𝒚-𝐂𝝂𝒚2+kKαk|ν𝒚k|}.\begin{split}\frac{\partial{\mathcal{L}_{\text{JRE}}}}{\partial{{\bm{\nu}_{\bm% {y}}}}}&=\mathbb{E}_{{\bm{\check{X}}}|\bm{y}}{\mathopen{}\mathclose{{}\left[% \frac{\partial{E_{\text{joint}}}}{\partial{\bm{\hat{x}}}}\mathopen{}\mathclose% {{}\left({\bm{\check{X}}},\bm{y},\bm{\theta}}\right)\middle|\bm{y}}\right]}\\ &\approx\frac{\partial{E_{\text{joint}}}}{\partial{\bm{\hat{x}}}}\mathopen{}% \mathclose{{}\left(\mathbb{E}_{{\bm{\check{X}}}|\bm{y}}{\mathopen{}\mathclose{% {}\left[{\bm{\check{X}}}\middle|\bm{y}}\right]},\bm{y},\bm{\theta}}\right)\\ &=\frac{\mathop{}\!\mathrm{d}{}}{\mathop{}\!\mathrm{d}{{\bm{\nu}_{\bm{y}}}}}% \mathopen{}\mathclose{{}\left(\frac{\lambda}{2}\mathopen{}\mathclose{{}\left% \lVert\bm{y}-\mathbf{{C}}{\bm{\nu}_{\bm{y}}}}\right\rVert^{2}+\sum_{k}^{{K}}% \alpha_{k}\mathopen{}\mathclose{{}\left\lvert{\nu^{k}_{\bm{y}}}}\right\rvert}% \right)\stackrel{{\scriptstyle\text{set}}}{{=}}0\\ \implies{\bm{\nu}^{*}_{\bm{y}}}&=\operatorname*{argmin}_{{\bm{\nu}_{\bm{y}}}}% \mathopen{}\mathclose{{}\left\{\frac{\lambda}{2}\mathopen{}\mathclose{{}\left% \lVert\bm{y}-\mathbf{{C}}{\bm{\nu}_{\bm{y}}}}\right\rVert^{2}+\sum_{k}^{{K}}% \alpha_{k}\mathopen{}\mathclose{{}\left\lvert{\nu^{k}_{\bm{y}}}}\right\rvert}% \right\}.\end{split} (8.10)

Thus11 1 Technically the final line is only a sufficient, not a necessary, condition for the gradient to be zero, but we are assuming the Hessian is locally positive definite., the optimal choice for the mean is the solution to the L1L^{1}-penalized least-squares problem. For “fat” matrices 𝐂\mathbf{{C}}, this is known as basis-pursuit denoising,22 2 This is also sometimes referred to as the LASSO problem , although typically that term is reserved for “tall” output matrices 𝐂\mathbf{{C}}. a convex optimization for which efficient solvers exist.

That leaves the covariance matrix, given by Eq. 8.4. Again we exploit the fact that the Hessian of the energy is a sub-cubic—indeed, constant—function of 𝒙^\bm{\hat{x}}, except at points where any x^k\hat{x}_{k} is zero. Therefore,

𝐏𝒚=𝔼𝑿ˇ|𝒚[2Ejoint𝒙^𝒙^T(𝑿ˇ,𝒚,𝜽)|𝒚]2Ejoint𝒙^𝒙^T(𝔼𝑿ˇ|𝒚[𝑿ˇ|𝒚],𝒚,𝜽)λ𝐂T𝐂+diag(𝜶βsech2(β𝝂𝒚)).\begin{split}{\mathbf{{P}}_{\bm{y}}}&=\mathbb{E}_{{\bm{\check{X}}}|\bm{y}}{% \mathopen{}\mathclose{{}\left[\frac{\partial^{2}{E_{\text{joint}}}}{\partial{% \bm{\hat{x}}}\partial{\bm{\hat{x}}}^{\text{T}}}\mathopen{}\mathclose{{}\left({% \bm{\check{X}}},\bm{y},\bm{\theta}}\right)\middle|\bm{y}}\right]}\\ &\approx\frac{\partial^{2}{E_{\text{joint}}}}{\partial{\bm{\hat{x}}}\partial{% \bm{\hat{x}}}^{\text{T}}}\mathopen{}\mathclose{{}\left(\mathbb{E}_{{\bm{\check% {X}}}|\bm{y}}{\mathopen{}\mathclose{{}\left[{\bm{\check{X}}}\middle|\bm{y}}% \right]},\bm{y},\bm{\theta}}\right)\\ &\approx\lambda\mathbf{{C}}^{\text{T}}\mathbf{{C}}+\text{diag}\mathopen{}% \mathclose{{}\left(\bm{\alpha}\circ\beta\operatorname{sech}^{2}(\beta{\bm{\nu}% _{\bm{y}}})}\right).\end{split} (8.11)

The approximation in the final line follows from the approximation to the Laplace energy in Eq. 8.6.

Let us collect up our optimizations for the complete sparse-coding algorithm:

EM Algorithm for Laplace-Sparse Coding

\bullet\> E step: 𝝂𝒚(i+1)argmin𝝂𝒚{λ2𝒚-𝐂(i)𝝂𝒚(i)2+kKαk|ν𝒚k(i)|}{\bm{\nu}^{(i+1)}_{\bm{y}}}\leftarrow\operatorname*{argmin}_{{\bm{\nu}_{\bm{y}% }}}\mathopen{}\mathclose{{}\left\{\frac{\lambda}{2}\mathopen{}\mathclose{{}% \left\lVert\bm{y}-\mathbf{{C}}^{(i)}{\bm{\nu}_{\bm{y}}}^{(i)}}\right\rVert^{2}% +\sum_{k}^{{K}}\alpha_{k}\mathopen{}\mathclose{{}\left\lvert{\nu^{k}_{\bm{y}}}% ^{(i)}}\right\rvert}\right\}
𝐏𝒚(i+1)λ𝐂(i)T𝐂(i)+diag(𝜶βsech2(β𝝂𝒚(i))){\mathbf{{P}}^{(i+1)}_{\bm{y}}}\leftarrow\lambda{\mathbf{{C}}^{(i)}}^{\text{T}% }\mathbf{{C}}^{(i)}+\text{diag}\mathopen{}\mathclose{{}\left(\bm{\alpha}\circ% \beta\operatorname{sech}^{2}(\beta{\bm{\nu}^{(i)}_{\bm{y}}})}\right)
\bullet\> M step: 𝐂(i+1)𝒀𝝂𝒀(i+1)T𝒀𝐏𝒀(i+1)-1+𝝂𝒀(i+1)𝝂𝒀(i+1)T𝒀-1\mathbf{{C}}^{(i+1)}\leftarrow{\mathopen{}\mathclose{{}\left\langle{{\bm{Y}}{% \bm{\nu}^{(i+1)}_{{\bm{Y}}}}^{\text{T}}}}\right\rangle_{{\bm{Y}}{}}}{\mathopen% {}\mathclose{{}\left\langle{{{\mathbf{{P}}^{(i+1)}_{{\bm{Y}}}}}^{-1}+{\bm{\nu}% ^{(i+1)}_{{\bm{Y}}}}{\bm{\nu}^{(i+1)}_{{\bm{Y}}}}^{\text{T}}}}\right\rangle_{{% \bm{Y}}{}}^{-1}}.

Relationship to classical sparse coding.

We have arrived at a sparse-coding algorithm very similar to that of Lewicki and colleagues [29, 31], but by a somewhat different route. In those papers, the matrix of basis functions, 𝐂\mathbf{{C}}, is fit by directly descending along the gradient of the marginal relative entropy (MRE), Hpp^[𝒀;𝐂]{\text{H}_{p\hat{p}}{\mathopen{}\mathclose{{}\left[{\bm{Y}}{};\mathbf{{C}}}% \right]}}, rather than descending via the gradient of its upper bound, the joint relative entropy (JRE). The price, of course, is the intractability of the MRE. In the classical papers, this price is paid with Laplace’s method, approximating the model marginal, p^(𝒚^;𝐂){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{};\mathbf{{C}}}\right)}, with the integral 𝒙^p^(𝒙^)p^(𝒙^|𝒚^;𝐂)d𝒙^\int_{\bm{\hat{x}}{}}{\hat{p}\mathopen{}\mathclose{{}\left(\bm{\hat{x}}}\right% )}{\hat{p}\mathopen{}\mathclose{{}\left(\bm{\hat{x}}\middle|\leavevmode\color[% rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{};\mathbf% {{C}}}\right)}\mathop{}\!\mathrm{d}{\bm{\hat{x}}{}} in the vicinity of one of the modes, 𝒙^0\bm{\hat{x}}^{0}, of the model joint (Eq. 2.34). This in some sense removes the latent variables from the problem, although the modes still need to be estimated. Consequently, there is no E step, and no fixed parameters 𝝂𝒚{\bm{\nu}_{\bm{y}}} and 𝐏𝒚{\mathbf{{P}}_{\bm{y}}} (for any 𝒚\bm{y}).

Nevertheless, the Hessian of the joint (or posterior) energy still shows up in the classical derivation’s loss, this time as a byproduct of Laplace’s method, as a measure of the volume of the posterior. This is no coincidence: we have allowed the E step to collapse into a Laplace approximation [36] by using an energy function that is sub-cubic almost everywhere. However, in the classical method, this precision matrix is not fixed during optimization of 𝐂\mathbf{{C}}, but rather is interpreted as a function of 𝐂\mathbf{{C}} and 𝒙^0\bm{\hat{x}}^{0}f(𝐂,𝒙^0) . . =log|𝐏𝒚(𝐂,𝒙^0)|f(\mathbf{{C}},\bm{\hat{x}}^{0})\mathrel{\vbox{\hbox{\scriptsize.}\hbox{% \scriptsize.} }}=\log\mathopen{}\mathclose{{}\left\lvert{\mathbf{{P}}_{\bm{y}}}\mathopen{}% \mathclose{{}\left(\mathbf{{C}},\bm{\hat{x}}^{0}}\right)}\right\rvert. On the other hand, the trace term in Eq. 8.8, λtr[𝐂𝐏𝒚-1𝐂T]\lambda\text{tr}\mathopen{}\mathclose{{}\left[\mathbf{{C}}{\mathbf{{P}}^{-1}_{% \bm{y}}}\mathbf{{C}}^{\text{T}}}\right], never materializes, because the squared reconstruction error is simply evaluated at the mode, rather than being averaged under a recognition distribution.

Conceptually, these two terms encode the contribution of the recognition precision (𝐏𝒚{\mathbf{{P}}_{\bm{y}}}) to the two losses, ours and the classical. Evidently, increasing the precision has opposite effects on the two losses, lowering the JRE (via λtr[𝐂𝐏𝒚-1𝐂T]\lambda\text{tr}\mathopen{}\mathclose{{}\left[\mathbf{{C}}{\mathbf{{P}}^{-1}_{% \bm{y}}}\mathbf{{C}}^{\text{T}}}\right]) but raising the approximate MRE (via log|𝐏𝒚(𝐂,𝒙^0)|\log\mathopen{}\mathclose{{}\left\lvert{\mathbf{{P}}_{\bm{y}}}\mathopen{}% \mathclose{{}\left(\mathbf{{C}},\bm{\hat{x}}^{0}}\right)}\right\rvert). But increasing the magnitude of the emission matrix 𝐂\mathbf{{C}} has the same effect on these two terms: It increases the contribution of recognition uncertainty to the average reconstruction error (λtr[𝐂𝐏𝒚-1𝐂T]\lambda\text{tr}\mathopen{}\mathclose{{}\left[\mathbf{{C}}{\mathbf{{P}}^{-1}_{% \bm{y}}}\mathbf{{C}}^{\text{T}}}\right]), and it increases the recognition precision itself (and therefore log|𝐏𝒚(𝐂,𝒙^0)|\log\mathopen{}\mathclose{{}\left\lvert{\mathbf{{P}}_{\bm{y}}}\mathopen{}% \mathclose{{}\left(\mathbf{{C}},\bm{\hat{x}}^{0}}\right)}\right\rvert) via Eq. 8.11. Indeed, the partial derivative of f(𝐂,𝒙^0) . . =log|𝐏𝒚(𝐂,𝒙^0)|f(\mathbf{{C}},\bm{\hat{x}}^{0})\mathrel{\vbox{\hbox{\scriptsize.}\hbox{% \scriptsize.} }}=\log\mathopen{}\mathclose{{}\left\lvert{\mathbf{{P}}_{\bm{y}}}\mathopen{}% \mathclose{{}\left(\mathbf{{C}},\bm{\hat{x}}^{0}}\right)}\right\rvert with respect to 𝐂\mathbf{{C}} is precisely the same as the derivative of the trace term λtr[𝐂𝐏𝒚-1𝐂T]\lambda\text{tr}\mathopen{}\mathclose{{}\left[\mathbf{{C}}{\mathbf{{P}}^{-1}_{% \bm{y}}}\mathbf{{C}}^{\text{T}}}\right] with respect to 𝐂\mathbf{{C}} (namely, 2λ𝐂𝐏𝒚-12\lambda\mathbf{{C}}{\mathbf{{P}}^{-1}_{\bm{y}}}).

There is one last discrepancy between the loss gradients. In the classical derivation, the mode 𝒙^0\bm{\hat{x}}^{0} is not fixed, either, and in turns depends on 𝐂\mathbf{{C}}. So the total derivative of log|𝐏𝒚(𝐂,𝒙^0(𝐂))|\log\mathopen{}\mathclose{{}\left\lvert{\mathbf{{P}}_{\bm{y}}}\mathopen{}% \mathclose{{}\left(\mathbf{{C}},\bm{\hat{x}}^{0}(\mathbf{{C}})}\right)}\right\rvert generates extra terms. On the other hand, in the original papers, these are derived and then promptly dropped, under the conjecture that curvature unrelated to volume is irrelevant to the optimization.

8.1.2 Independent-components analysis

Just as EM becomes KK-means for the Gaussian mixture model and (iterative) principal-components analysis for the factor analyzer, as the variance of the (Gaussian) emission is shrunk to zero, EM for the sparse coding model becomes independent-components analysis (ICA)…..