7.1 The Gaussian mixture model and KK-means

We return once again to the GMM, but this time, armed with the EM algorithm, we finally derive the learning rules. Since the entropy The joint cross entropy for a Gaussian mixture model is

H(ppˇ)p^[𝑿ˇ,𝒀;𝜽]-logp^(𝑿ˇ,𝒀;𝜽)𝑿ˇ,𝒀=-logp^(𝒀|𝑿ˇ;𝜽)p^(𝑿ˇ;𝜽)𝑿ˇ,𝒀=-logk=1K[𝒩(𝒀;𝝁k,𝚺k)πk]Xˇk𝑿ˇ,𝒀=-k=1KXˇk[12log|𝚺k-1|-12(𝝁k-𝒀)T𝚺k-1(𝝁k-𝒀)+logπk]𝑿ˇ,𝒀+C.\begin{split}{\text{H}_{(p\check{p})\hat{p}}{\mathopen{}\mathclose{{}\left[{% \bm{\check{X}}}{},{\bm{Y}}{};\bm{\theta}}\right]}}&\approx{\mathopen{}% \mathclose{{}\left\langle{-\log{\hat{p}\mathopen{}\mathclose{{}\left({\bm{% \check{X}}},{\bm{Y}};\bm{\theta}}\right)}}}\right\rangle_{{\bm{\check{X}}}{},{% \bm{Y}}{}}}\\ &={\mathopen{}\mathclose{{}\left\langle{-\log{\hat{p}\mathopen{}\mathclose{{}% \left({\bm{Y}}\middle|{\bm{\check{X}}};\bm{\theta}}\right)}{\hat{p}\mathopen{}% \mathclose{{}\left({\bm{\check{X}}};\bm{\theta}}\right)}}}\right\rangle_{{\bm{% \check{X}}}{},{\bm{Y}}{}}}\\ &={\mathopen{}\mathclose{{}\left\langle{-\log\prod_{k=1}^{{K}}\mathopen{}% \mathclose{{}\left[\mathcal{N}\mathopen{}\mathclose{{}\left({\bm{Y}};\bm{\mu}_% {k},\>\mathbf{{\Sigma}}_{k}}\right)\pi_{k}}\right]^{{\check{X}}_{k}}}}\right% \rangle_{{\bm{\check{X}}}{},{\bm{Y}}{}}}\\ &={\mathopen{}\mathclose{{}\left\langle{-\sum_{k=1}^{{K}}{\check{X}}_{k}% \mathopen{}\mathclose{{}\left[\frac{1}{2}\log\mathopen{}\mathclose{{}\left% \lvert\mathbf{{\Sigma}}_{k}^{-1}}\right\rvert-\frac{1}{2}\mathopen{}\mathclose% {{}\left(\bm{\mu}_{k}-{\bm{Y}}}\right)^{\text{T}}\mathbf{{\Sigma}}_{k}^{-1}% \mathopen{}\mathclose{{}\left(\bm{\mu}_{k}-{\bm{Y}}}\right)+\log\pi_{k}}\right% ]}}\right\rangle_{{\bm{\check{X}}}{},{\bm{Y}}{}}}+C.\end{split}

To enforce the fact that the prior probabilities sum to one, we can augment the loss with a Lagrangian term:

(𝜽)=λ(k=1Kπk-1)+H(ppˇ)p^[𝑿ˇ,𝒀;𝜽]=λ(k=1Kπk-1)+-k=1KXˇk[12log|𝚺k-1|-12(𝝁k-𝒀)T𝚺k-1(𝝁k-𝒀)+logπk]𝑿ˇ,𝒀+C.\begin{split}\mathcal{L}(\bm{\theta})&=\lambda\mathopen{}\mathclose{{}\left(% \sum_{k=1}^{{K}}\pi_{k}-1}\right)+{\text{H}_{(p\check{p})\hat{p}}{\mathopen{}% \mathclose{{}\left[{\bm{\check{X}}}{},{\bm{Y}}{};\bm{\theta}}\right]}}\\ &=\lambda\mathopen{}\mathclose{{}\left(\sum_{k=1}^{{K}}\pi_{k}-1}\right)+{% \mathopen{}\mathclose{{}\left\langle{-\sum_{k=1}^{{K}}{\check{X}}_{k}\mathopen% {}\mathclose{{}\left[\frac{1}{2}\log\mathopen{}\mathclose{{}\left\lvert\mathbf% {{\Sigma}}_{k}^{-1}}\right\rvert-\frac{1}{2}\mathopen{}\mathclose{{}\left(\bm{% \mu}_{k}-{\bm{Y}}}\right)^{\text{T}}\mathbf{{\Sigma}}_{k}^{-1}\mathopen{}% \mathclose{{}\left(\bm{\mu}_{k}-{\bm{Y}}}\right)+\log\pi_{k}}\right]}}\right% \rangle_{{\bm{\check{X}}}{},{\bm{Y}}{}}}+C.\end{split}

The M step.

We take the derivatives in turn. First the mixing proportions:

0=setddπk=λ-Xˇk𝑿ˇ,𝒀πkk=1Kπkλ=k=1KXˇk𝑿ˇ,𝒀λ=1πk=Xˇk𝑿ˇ,𝒀;0\stackrel{{\scriptstyle\text{set}}}{{=}}\frac{\mathop{}\!\mathrm{d}{\mathcal{% L}}}{\mathop{}\!\mathrm{d}{\pi_{k}}}=\lambda-\frac{{\mathopen{}\mathclose{{}% \left\langle{{\check{X}}_{k}}}\right\rangle_{{\bm{\check{X}}}{},{\bm{Y}}{}}}}{% \pi_{k}}\implies\sum_{k=1}^{{K}}\pi_{k}\lambda=\sum_{k=1}^{{K}}{\mathopen{}% \mathclose{{}\left\langle{{\check{X}}_{k}}}\right\rangle_{{\bm{\check{X}}}{},{% \bm{Y}}{}}}\implies\lambda=1\implies\pi_{k}={\mathopen{}\mathclose{{}\left% \langle{{\check{X}}_{k}}}\right\rangle_{{\bm{\check{X}}}{},{\bm{Y}}{}}};

then the class-conditional means:

0=set𝝁kT=Xˇk(𝝁k-𝒀)T𝚺k-1𝑿ˇ,𝒀𝝁k=Xˇk𝒀𝑿ˇ,𝒀Xˇk𝑿ˇ,𝒀;0\stackrel{{\scriptstyle\text{set}}}{{=}}\frac{\partial{\mathcal{L}}}{\partial% {\bm{\mu}_{k}}^{\text{T}}}={\mathopen{}\mathclose{{}\left\langle{{\check{X}}_{% k}\mathopen{}\mathclose{{}\left(\bm{\mu}_{k}-{\bm{Y}}}\right)^{\text{T}}% \mathbf{{\Sigma}}_{k}^{-1}}}\right\rangle_{{\bm{\check{X}}}{},{\bm{Y}}{}}}% \implies\bm{\mu}_{k}=\frac{{\mathopen{}\mathclose{{}\left\langle{{\check{X}}_{% k}{\bm{Y}}}}\right\rangle_{{\bm{\check{X}}}{},{\bm{Y}}{}}}}{{\mathopen{}% \mathclose{{}\left\langle{{\check{X}}_{k}}}\right\rangle_{{\bm{\check{X}}}{},{% \bm{Y}}{}}}};

and the class-conditional covariances:

0=setdd𝚺k-1=-Xˇk[12𝚺k-12(𝝁k-𝒀)(𝝁k-𝒀)T]𝑿ˇ,𝒀𝚺k=Xˇk(𝝁k-𝒀)(𝝁k-𝒀)T𝑿ˇ,𝒀Xˇk𝑿ˇ,𝒀=Xˇk𝒀𝒀T𝑿ˇ,𝒀Xˇk𝑿ˇ,𝒀-𝝁k𝝁kT.\begin{split}0\stackrel{{\scriptstyle\text{set}}}{{=}}\frac{\mathop{}\!\mathrm% {d}{\mathcal{L}}}{\mathop{}\!\mathrm{d}{\mathbf{{\Sigma}}_{k}^{-1}}}&={% \mathopen{}\mathclose{{}\left\langle{-{\check{X}}_{k}\mathopen{}\mathclose{{}% \left[\frac{1}{2}\mathbf{{\Sigma}}_{k}-\frac{1}{2}\mathopen{}\mathclose{{}% \left(\bm{\mu}_{k}-{\bm{Y}}}\right)\mathopen{}\mathclose{{}\left(\bm{\mu}_{k}-% {\bm{Y}}}\right)^{\text{T}}}\right]}}\right\rangle_{{\bm{\check{X}}}{},{\bm{Y}% }{}}}\\ \implies\mathbf{{\Sigma}}_{k}&=\frac{{\mathopen{}\mathclose{{}\left\langle{{% \check{X}}_{k}\mathopen{}\mathclose{{}\left(\bm{\mu}_{k}-{\bm{Y}}}\right)% \mathopen{}\mathclose{{}\left(\bm{\mu}_{k}-{\bm{Y}}}\right)^{\text{T}}}}\right% \rangle_{{\bm{\check{X}}}{},{\bm{Y}}{}}}}{{\mathopen{}\mathclose{{}\left% \langle{{\check{X}}_{k}}}\right\rangle_{{\bm{\check{X}}}{},{\bm{Y}}{}}}}=\frac% {{\mathopen{}\mathclose{{}\left\langle{{\check{X}}_{k}{\bm{Y}}{\bm{Y}}^{\text{% T}}}}\right\rangle_{{\bm{\check{X}}}{},{\bm{Y}}{}}}}{{\mathopen{}\mathclose{{}% \left\langle{{\check{X}}_{k}}}\right\rangle_{{\bm{\check{X}}}{},{\bm{Y}}{}}}}-% \bm{\mu}_{k}\bm{\mu}_{k}^{\text{T}}.\end{split}

Whether in EM or under a fully observed model, the optimal parameters are intuitive. The optimal mixing proportion, emission mean, and emission covariance for class kk are their sample counterparts, i.e. the sample proportion, sample mean, and sample covariance (resp.); or, to put it yet another way, the average number of occurrences of class kk, the average value of the samples 𝒀{\bm{Y}} from class kk, and the average covariance of the samples 𝒀{\bm{Y}} from class kk. The difference between learning (in EM) with latent, rather than fully observed, classes is that these are weighted, rather than unweighted, averages. In particular, class assignments are soft: each class takes some continuous-valued responsibilitymargin: class responsibilities for each datum 𝒚n\bm{y}_{n}, namely 𝔼p^[Xˇk|𝒚n]=p^(Xˇk=1|𝒚n;𝜽old)\mathbb{E}_{\hat{p}}{\mathopen{}\mathclose{{}\left[{\check{X}}_{k}|\bm{y}_{n}}% \right]}={\hat{p}\mathopen{}\mathclose{{}\left({\check{X}}_{k}=1\middle|\bm{y}% _{n};\bm{\theta}_{\text{old}}}\right)}, the probability of that class under the (previous) posterior distribution.

For example, when the class labels are observed, the denominator in the equation for the optimal mean becomes just the number of times class kk occurred, and the numerator picks out just those samples 𝒚n\bm{y}_{n} associated with class kk. This is the sample average of the 𝒚n\bm{y}_{n} in class kk. But when the class 𝑿ˇ{\bm{\check{X}}} is latent, the denominator is the average soft assignment or responsibility of class kk, and the numerator is the (soft) average of all observations 𝒚n\bm{y}_{n}, each weighted by the responsibility that class kk takes for it.

The E step.

Evidently, the expected sufficient statistics are Xˇk𝑿ˇ,𝒀{\mathopen{}\mathclose{{}\left\langle{{\check{X}}_{k}}}\right\rangle_{{\bm{% \check{X}}}{},{\bm{Y}}{}}}, Xˇk𝒀𝑿ˇ,𝒀{\mathopen{}\mathclose{{}\left\langle{{\check{X}}_{k}{\bm{Y}}}}\right\rangle_{% {\bm{\check{X}}}{},{\bm{Y}}{}}}, and Xˇk𝒀𝒀T𝑿ˇ,𝒀{\mathopen{}\mathclose{{}\left\langle{{\check{X}}_{k}{\bm{Y}}{\bm{Y}}^{\text{T% }}}}\right\rangle_{{\bm{\check{X}}}{},{\bm{Y}}{}}}. During EM, i.e. when the averaging distribution is p^(𝒙^|𝒚;𝜽old)p(𝒚){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{}\middle|\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y};\bm{\theta}^{% \text{old}}}\right)}{p\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y}{}}\right)}, these are written more explicitly as

𝔼Xˇk|𝒀[Xˇk|𝒀]𝒀,\displaystyle{\mathopen{}\mathclose{{}\left\langle{\mathbb{E}_{{\check{X}}_{k}% {}|{\bm{Y}}}{\mathopen{}\mathclose{{}\left[{\check{X}}_{k}\middle|{\bm{Y}}{}}% \right]}}}\right\rangle_{{\bm{Y}}{}}}, 𝔼Xˇk|𝒀[Xˇk|𝒀]𝒀𝒀,\displaystyle{\mathopen{}\mathclose{{}\left\langle{\mathbb{E}_{{\check{X}}_{k}% {}|{\bm{Y}}}{\mathopen{}\mathclose{{}\left[{\check{X}}_{k}\middle|{\bm{Y}}{}}% \right]}{\bm{Y}}}}\right\rangle_{{\bm{Y}}{}}}, 𝔼Xˇk|𝒀[Xˇk|𝒀]𝒀𝒀T𝒀.\displaystyle{\mathopen{}\mathclose{{}\left\langle{\mathbb{E}_{{\check{X}}_{k}% {}|{\bm{Y}}}{\mathopen{}\mathclose{{}\left[{\check{X}}_{k}\middle|{\bm{Y}}{}}% \right]}{\bm{Y}}{\bm{Y}}^{\text{T}}}}\right\rangle_{{\bm{Y}}{}}}.

We derived the posterior mean that occurs in all of these expressions in Section 2.1.1. We repeat Eqs. 2.7 and 2.8 here for convenience:

𝔼Xˇk|𝒀[Xˇk|𝒚]=softmax{𝒛}k,zk=logπk+12log|𝚺k-1|-12(𝒚-𝝁k)T𝚺k-1(𝒚-𝝁k).\mathbb{E}_{{\check{X}}_{k}{}|{\bm{Y}}}{\mathopen{}\mathclose{{}\left[{\check{% X}}_{k}\middle|\bm{y}{}}\right]}=\operatorname*{softmax}\mathopen{}\mathclose{% {}\left\{\bm{z}}\right\}_{k},\qquad z_{k}=\log\pi_{k}+\frac{1}{2}\log|\mathbf{% {\Sigma}}^{-1}_{k}|-\frac{1}{2}\mathopen{}\mathclose{{}\left(\bm{y}-\bm{\mu}_{% k}}\right)^{\text{T}}\mathbf{{\Sigma}}^{-1}_{k}\mathopen{}\mathclose{{}\left(% \bm{y}-\bm{\mu}_{k}}\right).

7.1.1 KK-means

In Section 2.1.1, we saw what happens to the posterior of the GMM when all classes use the same covariance matrix, 𝚺\mathbf{{\Sigma}}. The class boundaries become lines, and the responsibility of class jj for observation 𝒚\bm{y} becomes

p^(X^j=1|𝒚;𝜽)=exp{-12(𝒚-𝝁j)T𝚺-1(𝒚-𝝁j)}πjk=1Kexp{-12(𝒚-𝝁k)T𝚺-1(𝒚-𝝁k)}πk.\begin{split}{\hat{p}\mathopen{}\mathclose{{}\left({\hat{X}}_{j}=1\middle|\bm{% y};\bm{\theta}}\right)}&=\frac{\exp\mathopen{}\mathclose{{}\left\{-\frac{1}{2}% \mathopen{}\mathclose{{}\left(\bm{y}-\bm{\mu}_{j}}\right)^{\text{T}}\mathbf{{% \Sigma}}^{-1}\mathopen{}\mathclose{{}\left(\bm{y}-\bm{\mu}_{j}}\right)}\right% \}\pi_{j}}{\sum_{k=1}^{{K}}\exp\mathopen{}\mathclose{{}\left\{-\frac{1}{2}% \mathopen{}\mathclose{{}\left(\bm{y}-\bm{\mu}_{k}}\right)^{\text{T}}\mathbf{{% \Sigma}}^{-1}\mathopen{}\mathclose{{}\left(\bm{y}-\bm{\mu}_{k}}\right)}\right% \}\pi_{k}}.\end{split} (7.1)

It is not hard to see that in the limit of infinite precision, this quantity goes to zero unless 𝒚\bm{y} is closer to 𝝁j\bm{\mu}_{j} than any other mean, in which case it goes to 1. The prior probabilities 𝝅\bm{\pi} have become irrelevant. Then the algorithm becomes

KK-Means                      

\bullet\> E step: pˇ(i+1)(X^j=1|𝒚){1,if j=argmink𝒚-𝝁k(i)0,otherwise{\check{p}^{(i+1)}\mathopen{}\mathclose{{}\left({\hat{X}}_{j}=1\middle|\bm{y}}% \right)}\leftarrow\begin{cases}1,&\text{if }j=\operatorname*{argmin}_{k}% \mathopen{}\mathclose{{}\left\lVert\bm{y}-\bm{\mu}_{k}^{(i)}}\right\rVert\\ 0,&\text{otherwise}\end{cases}
\bullet\> M step: 𝝁j(i)pˇ(i+1)(X^j=1|𝒀)𝒀𝒀pˇ(i+1)(X^j=1|𝒀)𝒀\bm{\mu}_{j}^{(i)}\leftarrow\frac{{\mathopen{}\mathclose{{}\left\langle{{% \check{p}^{(i+1)}\mathopen{}\mathclose{{}\left({\hat{X}}_{j}=1\middle|{\bm{Y}}% }\right)}{\bm{Y}}}}\right\rangle_{{\bm{Y}}{}}}}{{\mathopen{}\mathclose{{}\left% \langle{{\check{p}^{(i+1)}\mathopen{}\mathclose{{}\left({\hat{X}}_{j}=1\middle% |{\bm{Y}}}\right)}}}\right\rangle_{{\bm{Y}}{}}}}

This algorithm, which pre-existed EM for the GMM, is known as KK-meansmargin: KK-means .