7.3 Factor analysis and principal-components analysis

Retaining the Gaussian emissions from the GMM but exchanging the categorical latent variable for a standard normal variate yields “factor analysis” (see Section 2.1.2). We also restrict the emission covariance to be diagonal in order to remove a degree of freedom that is not (as we shall see) identifiable from the data. The model is fully described by Eq. 2.20. However, we depart slightly from that formulation by augmenting the latent variable vector ${\bm{\check{X}}}$ with a “random” scalar that is 1 with probability 1. This allows us to absorb the bias $\bm{c}$ into a column of the emission matrix $\mathbf{{C}}$ , reducing clutter without reducing generality.

The learning problem starts once again with minimization of the joint cross entropy:

\begin{split}{\text{H}_{(p\check{p})\hat{p}}{\mathopen{}\mathclose{{}\left[{%\bm{\check{X}}}{},{\bm{Y}}{};\bm{\theta}}\right]}}&\approx{\mathopen{}%\mathclose{{}\left\langle{-\log{\hat{p}\mathopen{}\mathclose{{}\left({\bm{%\check{X}}},{\bm{Y}};\bm{\theta}}\right)}}}\right\rangle_{{\bm{\check{X}}}{},{%\bm{Y}}{}}}\\&={\mathopen{}\mathclose{{}\left\langle{-\log\mathopen{}\mathclose{{}\left({%\hat{p}\mathopen{}\mathclose{{}\left({\bm{Y}}\middle|{\bm{\check{X}}};\bm{%\theta}}\right)}{\hat{p}\mathopen{}\mathclose{{}\left({\bm{\check{X}}};\bm{%\theta}}\right)}}\right)}}\right\rangle_{{\bm{\check{X}}}{},{\bm{Y}}{}}}\\&={\mathopen{}\mathclose{{}\left\langle{-\log\mathcal{N}\mathopen{}\mathclose{%{}\left(\mathbf{{C}}{\bm{\check{X}}},\>\mathbf{\Lambda}}\right)-\log\mathcal{N%}\mathopen{}\mathclose{{}\left(\bm{0},\>\mathbf{I}}\right)}}\right\rangle_{{%\bm{\check{X}}},{\bm{Y}}}}.\end{split}

(7.2)

The M step.

The model prior distribution does not depend on any parameters, so only the model emission is differentiated. Starting with the emission matrix $\mathbf{{C}}$ :

\begin{split}0\stackrel{{\scriptstyle\text{set}}}{{=}}\frac{\mathop{}\!\mathrm%{d}{\text{H}}}{\mathop{}\!\mathrm{d}{\mathbf{{C}}}}&={\mathopen{}\mathclose{{}%\left\langle{-\frac{\mathop{}\!\mathrm{d}{}}{\mathop{}\!\mathrm{d}{\mathbf{{C}%}}}\log\mathcal{N}\mathopen{}\mathclose{{}\left(\mathbf{{C}}{\bm{\check{X}}},%\>\mathbf{\Lambda}}\right)}}\right\rangle_{{\bm{\check{X}}},{\bm{Y}}}}\\&={\mathopen{}\mathclose{{}\left\langle{\frac{1}{2}\frac{\mathop{}\!\mathrm{d}%{}}{\mathop{}\!\mathrm{d}{\mathbf{{C}}}}\mathopen{}\mathclose{{}\left({\bm{Y}}%-\mathbf{{C}}{\bm{\check{X}}}}\right)^{\text{T}}\mathbf{\Lambda}^{-1}\mathopen%{}\mathclose{{}\left({\bm{Y}}-\mathbf{{C}}{\bm{\check{X}}}}\right)}}\right%\rangle_{{\bm{\check{X}}},{\bm{Y}}}}\\&={\mathopen{}\mathclose{{}\left\langle{\mathbf{\Lambda}^{-1}\mathopen{}%\mathclose{{}\left({\bm{Y}}-\mathbf{{C}}{\bm{\check{X}}}}\right){\bm{\check{X}%}}^{\text{T}}}}\right\rangle_{{\bm{\check{X}}},{\bm{Y}}}}\\\implies\mathbf{{C}}&={\mathopen{}\mathclose{{}\left\langle{{\bm{Y}}{\bm{%\check{X}}}^{\text{T}}}}\right\rangle_{{\bm{\check{X}}},{\bm{Y}}}}{\mathopen{}%\mathclose{{}\left\langle{{\bm{\check{X}}}{\bm{\check{X}}}^{\text{T}}}}\right%\rangle_{{\bm{\check{X}}},{\bm{Y}}}^{-1}},\end{split}

the normal equations. Thus, in a fully observed model, finding $\mathbf{{C}}$ amounts to linear regression.

The emission covariance also takes on a familiar form:

\begin{split}0\stackrel{{\scriptstyle\text{set}}}{{=}}\frac{\mathop{}\!\mathrm%{d}{\text{H}}}{\mathop{}\!\mathrm{d}{\mathbf{\Lambda}^{-1}}}&={\mathopen{}%\mathclose{{}\left\langle{-\frac{\mathop{}\!\mathrm{d}{}}{\mathop{}\!\mathrm{d%}{\mathbf{\Lambda}^{-1}}}\log\mathcal{N}\mathopen{}\mathclose{{}\left(\mathbf{%{C}}{\bm{\check{X}}},\>\mathbf{\Lambda}}\right)}}\right\rangle_{{\bm{\check{X}%}}{},{\bm{Y}}{}}}\\&={\mathopen{}\mathclose{{}\left\langle{-\frac{\mathop{}\!\mathrm{d}{}}{%\mathop{}\!\mathrm{d}{\mathbf{\Lambda}^{-1}}}\mathopen{}\mathclose{{}\left[%\frac{1}{2}\log\mathopen{}\mathclose{{}\left\lvert\mathbf{\Lambda}^{-1}}\right%\rvert-\frac{1}{2}\mathopen{}\mathclose{{}\left({\bm{Y}}-\mathbf{{C}}{\bm{%\check{X}}}}\right)^{\text{T}}\mathbf{\Lambda}^{-1}\mathopen{}\mathclose{{}%\left({\bm{Y}}-\mathbf{{C}}{\bm{\check{X}}}}\right)}\right]}}\right\rangle_{{%\bm{\check{X}}}{},{\bm{Y}}{}}}\\&={\mathopen{}\mathclose{{}\left\langle{\mathbf{\Lambda}-\mathopen{}\mathclose%{{}\left({\bm{Y}}-\mathbf{{C}}{\bm{\check{X}}}}\right)\mathopen{}\mathclose{{}%\left({\bm{Y}}-\mathbf{{C}}{\bm{\check{X}}}}\right)^{\text{T}}}}\right\rangle_%{{\bm{\check{X}}}{},{\bm{Y}}{}}}\\\implies\mathbf{\Lambda}&={\mathopen{}\mathclose{{}\left\langle{\mathopen{}%\mathclose{{}\left({\bm{Y}}-\mathbf{{C}}{\bm{\check{X}}}}\right)\mathopen{}%\mathclose{{}\left({\bm{Y}}-\mathbf{{C}}{\bm{\check{X}}}}\right)^{\text{T}}}}%\right\rangle_{{\bm{\check{X}}}{},{\bm{Y}}{}}}\\\end{split}

The final line can be simplified using our newly acquired formula for $\mathbf{{C}}$ . First expanding the quadratic and then applying the identity:

\begin{split}\mathbf{\Lambda}&={\mathopen{}\mathclose{{}\left\langle{{\bm{Y}}{%\bm{Y}}^{\text{T}}-{\bm{Y}}{\bm{\check{X}}}^{\text{T}}\mathbf{{C}}^{\text{T}}-%\mathbf{{C}}{\bm{\check{X}}}{\bm{Y}}^{\text{T}}-\mathbf{{C}}{\bm{\check{X}}}{%\bm{\check{X}}}^{\text{T}}\mathbf{{C}}^{\text{T}}}}\right\rangle_{{\bm{\check{%X}}},{\bm{Y}}}}\\&={\mathopen{}\mathclose{{}\left\langle{{\bm{Y}}{\bm{Y}}^{\text{T}}}}\right%\rangle_{{\bm{Y}}}}-{\mathopen{}\mathclose{{}\left\langle{{\bm{Y}}{\bm{\check{%X}}}^{\text{T}}}}\right\rangle_{{\bm{\check{X}}},{\bm{Y}}}}\mathbf{{C}}^{\text%{T}}-\mathbf{{C}}{\mathopen{}\mathclose{{}\left\langle{{\bm{\check{X}}}{\bm{Y}%}^{\text{T}}}}\right\rangle_{{\bm{\check{X}}},{\bm{Y}}}}+\mathbf{{C}}{%\mathopen{}\mathclose{{}\left\langle{{\bm{\check{X}}}{\bm{\check{X}}}^{\text{T%}}}}\right\rangle_{{\bm{\check{X}}}}}\mathbf{{C}}^{\text{T}}\\&={\mathopen{}\mathclose{{}\left\langle{{\bm{Y}}{\bm{Y}}^{\text{T}}}}\right%\rangle_{{\bm{Y}}}}-\mathbf{{C}}{\mathopen{}\mathclose{{}\left\langle{{\bm{%\check{X}}}{\bm{Y}}^{\text{T}}}}\right\rangle_{{\bm{\check{X}}},{\bm{Y}}}}.%\end{split}

Now, we require $\mathbf{\Lambda}$ to be diagonal. It may be observed that the derivative with respect to any particular entry of $\mathbf{\Lambda}$ is independent of all other entries, so simply setting some components to zero does not change the optimum for the other components. So we merely extract the diagonal from the final equation.

The E step.

In Section 2.1.2, we derived the posterior distribution for factor analysis:

\displaystyle{\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{}\middle|%\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{%.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{%};\bm{\theta}}\right)}=\mathcal{N}\mathopen{}\mathclose{{}\left(\mathbf{K}%\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{%.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y},\>%\mathopen{}\mathclose{{}\left(\mathbf{{C}}^{\text{T}}\mathbf{\Lambda}^{-1}%\mathbf{{C}}+\mathbf{I}}\right)^{-1}}\right),

\displaystyle\mathbf{K}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\mathopen{}\mathclose{{}\left(\mathbf{{C}}^{\text{T}}\mathbf{\Lambda}^{-1}%\mathbf{{C}}+\mathbf{I}}\right)^{-1}\mathbf{{C}}^{\text{T}}\mathbf{\Lambda}^{-%1}.

(7.3)

In the E step, then, the expected sufficient statistics for $\mathbf{{C}}$ and $\mathbf{\Lambda}$ are calculated as

\begin{split}{\mathopen{}\mathclose{{}\left\langle{{\bm{Y}}\mathbb{E}_{{\bm{%\check{X}}}{}|{\bm{Y}}}{\mathopen{}\mathclose{{}\left[{\bm{\check{X}}}^{\text{%T}}\middle|{\bm{Y}}{}}\right]}}}\right\rangle_{{\bm{Y}}{}}}&={\mathopen{}%\mathclose{{}\left\langle{{\bm{Y}}{\bm{Y}}^{\text{T}}}}\right\rangle_{{\bm{Y}}%{}}}\mathbf{K}^{\text{T}}\end{split}

and

\begin{split}{\mathopen{}\mathclose{{}\left\langle{\mathbb{E}_{{\bm{\check{X}}%}{}|{\bm{Y}}}{\mathopen{}\mathclose{{}\left[{\bm{\check{X}}}{\bm{\check{X}}}^{%\text{T}}\middle|{\bm{Y}}{}}\right]}}}\right\rangle_{{\bm{Y}}{}}}&={\mathopen{%}\mathclose{{}\left\langle{\text{Cov}_{{\bm{\check{X}}}{}|{\bm{Y}}}{\mathopen{%}\mathclose{{}\left[{\bm{\check{X}}}\middle|{\bm{Y}}{}}\right]}+\mathbb{E}_{{%\bm{\check{X}}}{}|{\bm{Y}}}{\mathopen{}\mathclose{{}\left[{\bm{\check{X}}}%\middle|{\bm{Y}}{}}\right]}\mathbb{E}_{{\bm{\check{X}}}{}|{\bm{Y}}}{\mathopen{%}\mathclose{{}\left[{\bm{\check{X}}}^{\text{T}}\middle|{\bm{Y}}{}}\right]}}}%\right\rangle_{{\bm{Y}}{}}}\\&={\mathopen{}\mathclose{{}\left\langle{\mathopen{}\mathclose{{}\left(\mathbf{%{C}}^{\text{T}}\mathbf{\Lambda}^{-1}\mathbf{{C}}+\mathbf{I}}\right)^{-1}+%\mathbf{K}{\bm{Y}}{\bm{Y}}^{\text{T}}\mathbf{K}^{\text{T}}}}\right\rangle_{{%\bm{Y}}{}}}\\&=\mathopen{}\mathclose{{}\left(\mathbf{{C}}^{\text{T}}\mathbf{\Lambda}^{-1}%\mathbf{{C}}+\mathbf{I}}\right)^{-1}+\mathbf{K}{\mathopen{}\mathclose{{}\left%\langle{{\bm{Y}}{\bm{Y}}^{\text{T}}}}\right\rangle_{{\bm{Y}}{}}}\mathbf{K}^{%\text{T}}.\end{split}

7.3.1 Principal-components analysis

We saw that in the limit of equal and infinite emission precisions, EM for the GMM reduces to $K$ -means. Now we investigate this limit in the case of EM for the factor analyzer. In this case the only parameter to estimate is $\mathbf{{C}}$ .

From Eq. 7.3, we see that the posterior covariance goes to zero as $\mathbf{\Lambda}^{-1}$ goes to infinity—inference becomes deterministic. With slightly more work, we can also determine the mean, $\bm{\bar{x}}$ , to which each $\bm{y}$ is deterministic assigned. Setting $\mathbf{\Lambda}=\epsilon\mathbf{I}$ , we find

{\bm{\bar{X}}}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\mathbb{E}_{{\bm{\hat{X}}}|{\bm{Y}}}{\mathopen{}\mathclose{{}\left[{\bm{%\hat{X}}}^{\text{T}}\middle|{\bm{Y}}}\right]}=\mathopen{}\mathclose{{}\left(%\mathbf{{C}}^{\text{T}}\frac{\mathbf{I}}{\epsilon}\mathbf{{C}}+\mathbf{I}}%\right)^{-1}\mathbf{{C}}^{\text{T}}\frac{\mathbf{I}}{\epsilon}{\bm{Y}}=%\mathopen{}\mathclose{{}\left(\mathbf{{C}}^{\text{T}}\mathbf{{C}}+\epsilon%\mathbf{I}}\right)^{-1}\mathbf{{C}}^{\text{T}}{\bm{Y}}\xrightarrow[\epsilon\to0%]{}\mathopen{}\mathclose{{}\left(\mathbf{{C}}^{\text{T}}\mathbf{{C}}}\right)^{%-1}\mathbf{{C}}^{\text{T}}{\bm{Y}}.

The final expression is the Moore-Penrose pseudo-inverse of $\mathbf{{C}}$ , i.e., the latent-space projection of ${\bm{Y}}$ that yields the smallest reconstruction error under the emission matrix $\mathbf{{C}}$ .

[……]

[Tipping1999]

Iterative Principal-Components Analysis

$\bullet\>$ E step:	${\bm{\bar{X}}}^{(i+1)}\leftarrow\mathopen{}\mathclose{{}\left({\mathbf{{C}}^{(%i)}}^{\text{T}}\mathbf{{C}}^{(i)}}\right)^{-1}{\mathbf{{C}}^{(i)}}^{\text{T}}{%\bm{Y}}$
$\bullet\>$ M step:	$\mathbf{{C}}^{(i+1)}\leftarrow{\mathopen{}\mathclose{{}\left\langle{{\bm{Y}}{{%}{\bm{\bar{X}}}^{(i+1)}}^{\text{T}}}}\right\rangle_{{\bm{Y}}}}{\mathopen{}%\mathclose{{}\left\langle{{\bm{\bar{X}}}^{(i+1)}{{}{\bm{\bar{X}}}^{(i+1)}}^{%\text{T}}}}\right\rangle_{{\bm{Y}}}^{-1}}$