7.3 Factor analysis and principal-components analysis

Retaining the Gaussian emissions from the GMM but exchanging the categorical latent variable for a standard normal variate yields “factor analysis” (see Section 2.1.2). We also restrict the emission covariance to be diagonal in order to remove a degree of freedom that is not (as we shall see) identifiable from the data. The model is fully described by Eq. 2.20. However, we depart slightly from that formulation by augmenting the latent variable vector 𝑿ˇ{\bm{\check{X}}} with a “random” scalar that is 1 with probability 1. This allows us to absorb the bias 𝒄\bm{c} into a column of the emission matrix 𝐂\mathbf{{C}}, reducing clutter without reducing generality.

The learning problem starts once again with minimization of the joint cross entropy:

H(ppˇ)p^[𝑿ˇ,𝒀;𝜽]-logp^(𝑿ˇ,𝒀;𝜽)𝑿ˇ,𝒀=-log(p^(𝒀|𝑿ˇ;𝜽)p^(𝑿ˇ;𝜽))𝑿ˇ,𝒀=-log𝒩(𝐂𝑿ˇ,𝚲)-log𝒩(𝟎,𝐈)𝑿ˇ,𝒀.\begin{split}{\text{H}_{(p\check{p})\hat{p}}{\mathopen{}\mathclose{{}\left[{% \bm{\check{X}}}{},{\bm{Y}}{};\bm{\theta}}\right]}}&\approx{\mathopen{}% \mathclose{{}\left\langle{-\log{\hat{p}\mathopen{}\mathclose{{}\left({\bm{% \check{X}}},{\bm{Y}};\bm{\theta}}\right)}}}\right\rangle_{{\bm{\check{X}}}{},{% \bm{Y}}{}}}\\ &={\mathopen{}\mathclose{{}\left\langle{-\log\mathopen{}\mathclose{{}\left({% \hat{p}\mathopen{}\mathclose{{}\left({\bm{Y}}\middle|{\bm{\check{X}}};\bm{% \theta}}\right)}{\hat{p}\mathopen{}\mathclose{{}\left({\bm{\check{X}}};\bm{% \theta}}\right)}}\right)}}\right\rangle_{{\bm{\check{X}}}{},{\bm{Y}}{}}}\\ &={\mathopen{}\mathclose{{}\left\langle{-\log\mathcal{N}\mathopen{}\mathclose{% {}\left(\mathbf{{C}}{\bm{\check{X}}},\>\mathbf{\Lambda}}\right)-\log\mathcal{N% }\mathopen{}\mathclose{{}\left(\bm{0},\>\mathbf{I}}\right)}}\right\rangle_{{% \bm{\check{X}}},{\bm{Y}}}}.\end{split} (7.2)

The M step.

The model prior distribution does not depend on any parameters, so only the model emission is differentiated. Starting with the emission matrix 𝐂\mathbf{{C}}:

0=setdHd𝐂=-dd𝐂log𝒩(𝐂𝑿ˇ,𝚲)𝑿ˇ,𝒀=12dd𝐂(𝒀-𝐂𝑿ˇ)T𝚲-1(𝒀-𝐂𝑿ˇ)𝑿ˇ,𝒀=𝚲-1(𝒀-𝐂𝑿ˇ)𝑿ˇT𝑿ˇ,𝒀𝐂=𝒀𝑿ˇT𝑿ˇ,𝒀𝑿ˇ𝑿ˇT𝑿ˇ,𝒀-1,\begin{split}0\stackrel{{\scriptstyle\text{set}}}{{=}}\frac{\mathop{}\!\mathrm% {d}{\text{H}}}{\mathop{}\!\mathrm{d}{\mathbf{{C}}}}&={\mathopen{}\mathclose{{}% \left\langle{-\frac{\mathop{}\!\mathrm{d}{}}{\mathop{}\!\mathrm{d}{\mathbf{{C}% }}}\log\mathcal{N}\mathopen{}\mathclose{{}\left(\mathbf{{C}}{\bm{\check{X}}},% \>\mathbf{\Lambda}}\right)}}\right\rangle_{{\bm{\check{X}}},{\bm{Y}}}}\\ &={\mathopen{}\mathclose{{}\left\langle{\frac{1}{2}\frac{\mathop{}\!\mathrm{d}% {}}{\mathop{}\!\mathrm{d}{\mathbf{{C}}}}\mathopen{}\mathclose{{}\left({\bm{Y}}% -\mathbf{{C}}{\bm{\check{X}}}}\right)^{\text{T}}\mathbf{\Lambda}^{-1}\mathopen% {}\mathclose{{}\left({\bm{Y}}-\mathbf{{C}}{\bm{\check{X}}}}\right)}}\right% \rangle_{{\bm{\check{X}}},{\bm{Y}}}}\\ &={\mathopen{}\mathclose{{}\left\langle{\mathbf{\Lambda}^{-1}\mathopen{}% \mathclose{{}\left({\bm{Y}}-\mathbf{{C}}{\bm{\check{X}}}}\right){\bm{\check{X}% }}^{\text{T}}}}\right\rangle_{{\bm{\check{X}}},{\bm{Y}}}}\\ \implies\mathbf{{C}}&={\mathopen{}\mathclose{{}\left\langle{{\bm{Y}}{\bm{% \check{X}}}^{\text{T}}}}\right\rangle_{{\bm{\check{X}}},{\bm{Y}}}}{\mathopen{}% \mathclose{{}\left\langle{{\bm{\check{X}}}{\bm{\check{X}}}^{\text{T}}}}\right% \rangle_{{\bm{\check{X}}},{\bm{Y}}}^{-1}},\end{split}

the normal equations. Thus, in a fully observed model, finding 𝐂\mathbf{{C}} amounts to linear regression.

The emission covariance also takes on a familiar form:

0=setdHd𝚲-1=-dd𝚲-1log𝒩(𝐂𝑿ˇ,𝚲)𝑿ˇ,𝒀=-dd𝚲-1[12log|𝚲-1|-12(𝒀-𝐂𝑿ˇ)T𝚲-1(𝒀-𝐂𝑿ˇ)]𝑿ˇ,𝒀=𝚲-(𝒀-𝐂𝑿ˇ)(𝒀-𝐂𝑿ˇ)T𝑿ˇ,𝒀𝚲=(𝒀-𝐂𝑿ˇ)(𝒀-𝐂𝑿ˇ)T𝑿ˇ,𝒀\begin{split}0\stackrel{{\scriptstyle\text{set}}}{{=}}\frac{\mathop{}\!\mathrm% {d}{\text{H}}}{\mathop{}\!\mathrm{d}{\mathbf{\Lambda}^{-1}}}&={\mathopen{}% \mathclose{{}\left\langle{-\frac{\mathop{}\!\mathrm{d}{}}{\mathop{}\!\mathrm{d% }{\mathbf{\Lambda}^{-1}}}\log\mathcal{N}\mathopen{}\mathclose{{}\left(\mathbf{% {C}}{\bm{\check{X}}},\>\mathbf{\Lambda}}\right)}}\right\rangle_{{\bm{\check{X}% }}{},{\bm{Y}}{}}}\\ &={\mathopen{}\mathclose{{}\left\langle{-\frac{\mathop{}\!\mathrm{d}{}}{% \mathop{}\!\mathrm{d}{\mathbf{\Lambda}^{-1}}}\mathopen{}\mathclose{{}\left[% \frac{1}{2}\log\mathopen{}\mathclose{{}\left\lvert\mathbf{\Lambda}^{-1}}\right% \rvert-\frac{1}{2}\mathopen{}\mathclose{{}\left({\bm{Y}}-\mathbf{{C}}{\bm{% \check{X}}}}\right)^{\text{T}}\mathbf{\Lambda}^{-1}\mathopen{}\mathclose{{}% \left({\bm{Y}}-\mathbf{{C}}{\bm{\check{X}}}}\right)}\right]}}\right\rangle_{{% \bm{\check{X}}}{},{\bm{Y}}{}}}\\ &={\mathopen{}\mathclose{{}\left\langle{\mathbf{\Lambda}-\mathopen{}\mathclose% {{}\left({\bm{Y}}-\mathbf{{C}}{\bm{\check{X}}}}\right)\mathopen{}\mathclose{{}% \left({\bm{Y}}-\mathbf{{C}}{\bm{\check{X}}}}\right)^{\text{T}}}}\right\rangle_% {{\bm{\check{X}}}{},{\bm{Y}}{}}}\\ \implies\mathbf{\Lambda}&={\mathopen{}\mathclose{{}\left\langle{\mathopen{}% \mathclose{{}\left({\bm{Y}}-\mathbf{{C}}{\bm{\check{X}}}}\right)\mathopen{}% \mathclose{{}\left({\bm{Y}}-\mathbf{{C}}{\bm{\check{X}}}}\right)^{\text{T}}}}% \right\rangle_{{\bm{\check{X}}}{},{\bm{Y}}{}}}\\ \end{split}

The final line can be simplified using our newly acquired formula for 𝐂\mathbf{{C}}. First expanding the quadratic and then applying the identity:

𝚲=𝒀𝒀T-𝒀𝑿ˇT𝐂T-𝐂𝑿ˇ𝒀T-𝐂𝑿ˇ𝑿ˇT𝐂T𝑿ˇ,𝒀=𝒀𝒀T𝒀-𝒀𝑿ˇT𝑿ˇ,𝒀𝐂T-𝐂𝑿ˇ𝒀T𝑿ˇ,𝒀+𝐂𝑿ˇ𝑿ˇT𝑿ˇ𝐂T=𝒀𝒀T𝒀-𝐂𝑿ˇ𝒀T𝑿ˇ,𝒀.\begin{split}\mathbf{\Lambda}&={\mathopen{}\mathclose{{}\left\langle{{\bm{Y}}{% \bm{Y}}^{\text{T}}-{\bm{Y}}{\bm{\check{X}}}^{\text{T}}\mathbf{{C}}^{\text{T}}-% \mathbf{{C}}{\bm{\check{X}}}{\bm{Y}}^{\text{T}}-\mathbf{{C}}{\bm{\check{X}}}{% \bm{\check{X}}}^{\text{T}}\mathbf{{C}}^{\text{T}}}}\right\rangle_{{\bm{\check{% X}}},{\bm{Y}}}}\\ &={\mathopen{}\mathclose{{}\left\langle{{\bm{Y}}{\bm{Y}}^{\text{T}}}}\right% \rangle_{{\bm{Y}}}}-{\mathopen{}\mathclose{{}\left\langle{{\bm{Y}}{\bm{\check{% X}}}^{\text{T}}}}\right\rangle_{{\bm{\check{X}}},{\bm{Y}}}}\mathbf{{C}}^{\text% {T}}-\mathbf{{C}}{\mathopen{}\mathclose{{}\left\langle{{\bm{\check{X}}}{\bm{Y}% }^{\text{T}}}}\right\rangle_{{\bm{\check{X}}},{\bm{Y}}}}+\mathbf{{C}}{% \mathopen{}\mathclose{{}\left\langle{{\bm{\check{X}}}{\bm{\check{X}}}^{\text{T% }}}}\right\rangle_{{\bm{\check{X}}}}}\mathbf{{C}}^{\text{T}}\\ &={\mathopen{}\mathclose{{}\left\langle{{\bm{Y}}{\bm{Y}}^{\text{T}}}}\right% \rangle_{{\bm{Y}}}}-\mathbf{{C}}{\mathopen{}\mathclose{{}\left\langle{{\bm{% \check{X}}}{\bm{Y}}^{\text{T}}}}\right\rangle_{{\bm{\check{X}}},{\bm{Y}}}}.% \end{split}

Now, we require 𝚲\mathbf{\Lambda} to be diagonal. It may be observed that the derivative with respect to any particular entry of 𝚲\mathbf{\Lambda} is independent of all other entries, so simply setting some components to zero does not change the optimum for the other components. So we merely extract the diagonal from the final equation.

The E step.

In Section 2.1.2, we derived the posterior distribution for factor analysis:

p^(𝒙^|𝒚^;𝜽)=𝒩(𝐊𝒚,(𝐂T𝚲-1𝐂+𝐈)-1),\displaystyle{\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{}\middle|% \leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{% };\bm{\theta}}\right)}=\mathcal{N}\mathopen{}\mathclose{{}\left(\mathbf{K}% \leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y},\>% \mathopen{}\mathclose{{}\left(\mathbf{{C}}^{\text{T}}\mathbf{\Lambda}^{-1}% \mathbf{{C}}+\mathbf{I}}\right)^{-1}}\right), 𝐊 . . =(𝐂T𝚲-1𝐂+𝐈)-1𝐂T𝚲-1.\displaystyle\mathbf{K}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.} }}=\mathopen{}\mathclose{{}\left(\mathbf{{C}}^{\text{T}}\mathbf{\Lambda}^{-1}% \mathbf{{C}}+\mathbf{I}}\right)^{-1}\mathbf{{C}}^{\text{T}}\mathbf{\Lambda}^{-% 1}. (7.3)

In the E step, then, the expected sufficient statistics for 𝐂\mathbf{{C}} and 𝚲\mathbf{\Lambda} are calculated as

𝒀𝔼𝑿ˇ|𝒀[𝑿ˇT|𝒀]𝒀=𝒀𝒀T𝒀𝐊T\begin{split}{\mathopen{}\mathclose{{}\left\langle{{\bm{Y}}\mathbb{E}_{{\bm{% \check{X}}}{}|{\bm{Y}}}{\mathopen{}\mathclose{{}\left[{\bm{\check{X}}}^{\text{% T}}\middle|{\bm{Y}}{}}\right]}}}\right\rangle_{{\bm{Y}}{}}}&={\mathopen{}% \mathclose{{}\left\langle{{\bm{Y}}{\bm{Y}}^{\text{T}}}}\right\rangle_{{\bm{Y}}% {}}}\mathbf{K}^{\text{T}}\end{split}

and

𝔼𝑿ˇ|𝒀[𝑿ˇ𝑿ˇT|𝒀]𝒀=Cov𝑿ˇ|𝒀[𝑿ˇ|𝒀]+𝔼𝑿ˇ|𝒀[𝑿ˇ|𝒀]𝔼𝑿ˇ|𝒀[𝑿ˇT|𝒀]𝒀=(𝐂T𝚲-1𝐂+𝐈)-1+𝐊𝒀𝒀T𝐊T𝒀=(𝐂T𝚲-1𝐂+𝐈)-1+𝐊𝒀𝒀T𝒀𝐊T.\begin{split}{\mathopen{}\mathclose{{}\left\langle{\mathbb{E}_{{\bm{\check{X}}% }{}|{\bm{Y}}}{\mathopen{}\mathclose{{}\left[{\bm{\check{X}}}{\bm{\check{X}}}^{% \text{T}}\middle|{\bm{Y}}{}}\right]}}}\right\rangle_{{\bm{Y}}{}}}&={\mathopen{% }\mathclose{{}\left\langle{\text{Cov}_{{\bm{\check{X}}}{}|{\bm{Y}}}{\mathopen{% }\mathclose{{}\left[{\bm{\check{X}}}\middle|{\bm{Y}}{}}\right]}+\mathbb{E}_{{% \bm{\check{X}}}{}|{\bm{Y}}}{\mathopen{}\mathclose{{}\left[{\bm{\check{X}}}% \middle|{\bm{Y}}{}}\right]}\mathbb{E}_{{\bm{\check{X}}}{}|{\bm{Y}}}{\mathopen{% }\mathclose{{}\left[{\bm{\check{X}}}^{\text{T}}\middle|{\bm{Y}}{}}\right]}}}% \right\rangle_{{\bm{Y}}{}}}\\ &={\mathopen{}\mathclose{{}\left\langle{\mathopen{}\mathclose{{}\left(\mathbf{% {C}}^{\text{T}}\mathbf{\Lambda}^{-1}\mathbf{{C}}+\mathbf{I}}\right)^{-1}+% \mathbf{K}{\bm{Y}}{\bm{Y}}^{\text{T}}\mathbf{K}^{\text{T}}}}\right\rangle_{{% \bm{Y}}{}}}\\ &=\mathopen{}\mathclose{{}\left(\mathbf{{C}}^{\text{T}}\mathbf{\Lambda}^{-1}% \mathbf{{C}}+\mathbf{I}}\right)^{-1}+\mathbf{K}{\mathopen{}\mathclose{{}\left% \langle{{\bm{Y}}{\bm{Y}}^{\text{T}}}}\right\rangle_{{\bm{Y}}{}}}\mathbf{K}^{% \text{T}}.\end{split}

7.3.1 Principal-components analysis

We saw that in the limit of equal and infinite emission precisions, EM for the GMM reduces to KK-means. Now we investigate this limit in the case of EM for the factor analyzer. In this case the only parameter to estimate is 𝐂\mathbf{{C}}.

From Eq. 7.3, we see that the posterior covariance goes to zero as 𝚲-1\mathbf{\Lambda}^{-1} goes to infinity—inference becomes deterministic. With slightly more work, we can also determine the mean, 𝒙¯\bm{\bar{x}}, to which each 𝒚\bm{y} is deterministic assigned. Setting 𝚲=ϵ𝐈\mathbf{\Lambda}=\epsilon\mathbf{I}, we find

𝑿¯ . . =𝔼𝑿^|𝒀[𝑿^T|𝒀]=(𝐂T𝐈ϵ𝐂+𝐈)-1𝐂T𝐈ϵ𝒀=(𝐂T𝐂+ϵ𝐈)-1𝐂T𝒀ϵ0(𝐂T𝐂)-1𝐂T𝒀.{\bm{\bar{X}}}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.} }}=\mathbb{E}_{{\bm{\hat{X}}}|{\bm{Y}}}{\mathopen{}\mathclose{{}\left[{\bm{% \hat{X}}}^{\text{T}}\middle|{\bm{Y}}}\right]}=\mathopen{}\mathclose{{}\left(% \mathbf{{C}}^{\text{T}}\frac{\mathbf{I}}{\epsilon}\mathbf{{C}}+\mathbf{I}}% \right)^{-1}\mathbf{{C}}^{\text{T}}\frac{\mathbf{I}}{\epsilon}{\bm{Y}}=% \mathopen{}\mathclose{{}\left(\mathbf{{C}}^{\text{T}}\mathbf{{C}}+\epsilon% \mathbf{I}}\right)^{-1}\mathbf{{C}}^{\text{T}}{\bm{Y}}\xrightarrow[\epsilon\to 0% ]{}\mathopen{}\mathclose{{}\left(\mathbf{{C}}^{\text{T}}\mathbf{{C}}}\right)^{% -1}\mathbf{{C}}^{\text{T}}{\bm{Y}}.

The final expression is the Moore-Penrose pseudo-inverse of 𝐂\mathbf{{C}}, i.e., the latent-space projection of 𝒀{\bm{Y}} that yields the smallest reconstruction error under the emission matrix 𝐂\mathbf{{C}}.

[……]

[Tipping1999]

Iterative Principal-Components Analysis

\bullet\> E step: 𝑿¯(i+1)(𝐂(i)T𝐂(i))-1𝐂(i)T𝒀{\bm{\bar{X}}}^{(i+1)}\leftarrow\mathopen{}\mathclose{{}\left({\mathbf{{C}}^{(% i)}}^{\text{T}}\mathbf{{C}}^{(i)}}\right)^{-1}{\mathbf{{C}}^{(i)}}^{\text{T}}{% \bm{Y}}
\bullet\> M step: 𝐂(i+1)𝒀𝑿¯(i+1)T𝒀𝑿¯(i+1)𝑿¯(i+1)T𝒀-1\mathbf{{C}}^{(i+1)}\leftarrow{\mathopen{}\mathclose{{}\left\langle{{\bm{Y}}{{% }{\bm{\bar{X}}}^{(i+1)}}^{\text{T}}}}\right\rangle_{{\bm{Y}}}}{\mathopen{}% \mathclose{{}\left\langle{{\bm{\bar{X}}}^{(i+1)}{{}{\bm{\bar{X}}}^{(i+1)}}^{% \text{T}}}}\right\rangle_{{\bm{Y}}}^{-1}}