7.1 The Gaussian mixture model and $K$ -means

We return once again to the GMM, but this time, armed with the EM algorithm, we finally derive the learning rules. Since the entropy The joint cross entropy for a Gaussian mixture model is

\begin{split}{\text{H}_{(p\check{p})\hat{p}}{\mathopen{}\mathclose{{}\left[{% \bm{\check{X}}}{},{\bm{Y}}{};\bm{\theta}}\right]}}&\approx{\mathopen{}% \mathclose{{}\left\langle{-\log{\hat{p}\mathopen{}\mathclose{{}\left({\bm{% \check{X}}},{\bm{Y}};\bm{\theta}}\right)}}}\right\rangle_{{\bm{\check{X}}}{},{% \bm{Y}}{}}}\\ &={\mathopen{}\mathclose{{}\left\langle{-\log{\hat{p}\mathopen{}\mathclose{{}% \left({\bm{Y}}\middle|{\bm{\check{X}}};\bm{\theta}}\right)}{\hat{p}\mathopen{}% \mathclose{{}\left({\bm{\check{X}}};\bm{\theta}}\right)}}}\right\rangle_{{\bm{% \check{X}}}{},{\bm{Y}}{}}}\\ &={\mathopen{}\mathclose{{}\left\langle{-\log\prod_{k=1}^{{K}}\mathopen{}% \mathclose{{}\left[\mathcal{N}\mathopen{}\mathclose{{}\left({\bm{Y}};\bm{\mu}_% {k},\>\mathbf{{\Sigma}}_{k}}\right)\pi_{k}}\right]^{{\check{X}}_{k}}}}\right% \rangle_{{\bm{\check{X}}}{},{\bm{Y}}{}}}\\ &={\mathopen{}\mathclose{{}\left\langle{-\sum_{k=1}^{{K}}{\check{X}}_{k}% \mathopen{}\mathclose{{}\left[\frac{1}{2}\log\mathopen{}\mathclose{{}\left% \lvert\mathbf{{\Sigma}}_{k}^{-1}}\right\rvert-\frac{1}{2}\mathopen{}\mathclose% {{}\left(\bm{\mu}_{k}-{\bm{Y}}}\right)^{\text{T}}\mathbf{{\Sigma}}_{k}^{-1}% \mathopen{}\mathclose{{}\left(\bm{\mu}_{k}-{\bm{Y}}}\right)+\log\pi_{k}}\right% ]}}\right\rangle_{{\bm{\check{X}}}{},{\bm{Y}}{}}}+C.\end{split}

To enforce the fact that the prior probabilities sum to one, we can augment the loss with a Lagrangian term:

\begin{split}\mathcal{L}(\bm{\theta})&=\lambda\mathopen{}\mathclose{{}\left(% \sum_{k=1}^{{K}}\pi_{k}-1}\right)+{\text{H}_{(p\check{p})\hat{p}}{\mathopen{}% \mathclose{{}\left[{\bm{\check{X}}}{},{\bm{Y}}{};\bm{\theta}}\right]}}\\ &=\lambda\mathopen{}\mathclose{{}\left(\sum_{k=1}^{{K}}\pi_{k}-1}\right)+{% \mathopen{}\mathclose{{}\left\langle{-\sum_{k=1}^{{K}}{\check{X}}_{k}\mathopen% {}\mathclose{{}\left[\frac{1}{2}\log\mathopen{}\mathclose{{}\left\lvert\mathbf% {{\Sigma}}_{k}^{-1}}\right\rvert-\frac{1}{2}\mathopen{}\mathclose{{}\left(\bm{% \mu}_{k}-{\bm{Y}}}\right)^{\text{T}}\mathbf{{\Sigma}}_{k}^{-1}\mathopen{}% \mathclose{{}\left(\bm{\mu}_{k}-{\bm{Y}}}\right)+\log\pi_{k}}\right]}}\right% \rangle_{{\bm{\check{X}}}{},{\bm{Y}}{}}}+C.\end{split}

The M step.

We take the derivatives in turn. First the mixing proportions:

0\stackrel{{\scriptstyle\text{set}}}{{=}}\frac{\mathop{}\!\mathrm{d}{\mathcal{% L}}}{\mathop{}\!\mathrm{d}{\pi_{k}}}=\lambda-\frac{{\mathopen{}\mathclose{{}% \left\langle{{\check{X}}_{k}}}\right\rangle_{{\bm{\check{X}}}{},{\bm{Y}}{}}}}{% \pi_{k}}\implies\sum_{k=1}^{{K}}\pi_{k}\lambda=\sum_{k=1}^{{K}}{\mathopen{}% \mathclose{{}\left\langle{{\check{X}}_{k}}}\right\rangle_{{\bm{\check{X}}}{},{% \bm{Y}}{}}}\implies\lambda=1\implies\pi_{k}={\mathopen{}\mathclose{{}\left% \langle{{\check{X}}_{k}}}\right\rangle_{{\bm{\check{X}}}{},{\bm{Y}}{}}};

then the class-conditional means:

0\stackrel{{\scriptstyle\text{set}}}{{=}}\frac{\partial{\mathcal{L}}}{\partial% {\bm{\mu}_{k}}^{\text{T}}}={\mathopen{}\mathclose{{}\left\langle{{\check{X}}_{% k}\mathopen{}\mathclose{{}\left(\bm{\mu}_{k}-{\bm{Y}}}\right)^{\text{T}}% \mathbf{{\Sigma}}_{k}^{-1}}}\right\rangle_{{\bm{\check{X}}}{},{\bm{Y}}{}}}% \implies\bm{\mu}_{k}=\frac{{\mathopen{}\mathclose{{}\left\langle{{\check{X}}_{% k}{\bm{Y}}}}\right\rangle_{{\bm{\check{X}}}{},{\bm{Y}}{}}}}{{\mathopen{}% \mathclose{{}\left\langle{{\check{X}}_{k}}}\right\rangle_{{\bm{\check{X}}}{},{% \bm{Y}}{}}}};

and the class-conditional covariances:

\begin{split}0\stackrel{{\scriptstyle\text{set}}}{{=}}\frac{\mathop{}\!\mathrm% {d}{\mathcal{L}}}{\mathop{}\!\mathrm{d}{\mathbf{{\Sigma}}_{k}^{-1}}}&={% \mathopen{}\mathclose{{}\left\langle{-{\check{X}}_{k}\mathopen{}\mathclose{{}% \left[\frac{1}{2}\mathbf{{\Sigma}}_{k}-\frac{1}{2}\mathopen{}\mathclose{{}% \left(\bm{\mu}_{k}-{\bm{Y}}}\right)\mathopen{}\mathclose{{}\left(\bm{\mu}_{k}-% {\bm{Y}}}\right)^{\text{T}}}\right]}}\right\rangle_{{\bm{\check{X}}}{},{\bm{Y}% }{}}}\\ \implies\mathbf{{\Sigma}}_{k}&=\frac{{\mathopen{}\mathclose{{}\left\langle{{% \check{X}}_{k}\mathopen{}\mathclose{{}\left(\bm{\mu}_{k}-{\bm{Y}}}\right)% \mathopen{}\mathclose{{}\left(\bm{\mu}_{k}-{\bm{Y}}}\right)^{\text{T}}}}\right% \rangle_{{\bm{\check{X}}}{},{\bm{Y}}{}}}}{{\mathopen{}\mathclose{{}\left% \langle{{\check{X}}_{k}}}\right\rangle_{{\bm{\check{X}}}{},{\bm{Y}}{}}}}=\frac% {{\mathopen{}\mathclose{{}\left\langle{{\check{X}}_{k}{\bm{Y}}{\bm{Y}}^{\text{% T}}}}\right\rangle_{{\bm{\check{X}}}{},{\bm{Y}}{}}}}{{\mathopen{}\mathclose{{}% \left\langle{{\check{X}}_{k}}}\right\rangle_{{\bm{\check{X}}}{},{\bm{Y}}{}}}}-% \bm{\mu}_{k}\bm{\mu}_{k}^{\text{T}}.\end{split}

Whether in EM or under a fully observed model, the optimal parameters are intuitive. The optimal mixing proportion, emission mean, and emission covariance for class $k$ are their sample counterparts, i.e. the sample proportion, sample mean, and sample covariance (resp.); or, to put it yet another way, the average number of occurrences of class $k$ , the average value of the samples ${\bm{Y}}$ from class $k$ , and the average covariance of the samples ${\bm{Y}}$ from class $k$ . The difference between learning (in EM) with latent, rather than fully observed, classes is that these are weighted, rather than unweighted, averages. In particular, class assignments are soft: each class takes some continuous-valued responsibility^†^†margin: class responsibilities for each datum $\bm{y}_{n}$ , namely $\mathbb{E}_{\hat{p}}{\mathopen{}\mathclose{{}\left[{\check{X}}_{k}|\bm{y}_{n}}% \right]}={\hat{p}\mathopen{}\mathclose{{}\left({\check{X}}_{k}=1\middle|\bm{y}% _{n};\bm{\theta}_{\text{old}}}\right)}$ , the probability of that class under the (previous) posterior distribution.

For example, when the class labels are observed, the denominator in the equation for the optimal mean becomes just the number of times class $k$ occurred, and the numerator picks out just those samples $\bm{y}_{n}$ associated with class $k$ . This is the sample average of the $\bm{y}_{n}$ in class $k$ . But when the class ${\bm{\check{X}}}$ is latent, the denominator is the average soft assignment or responsibility of class $k$ , and the numerator is the (soft) average of all observations $\bm{y}_{n}$ , each weighted by the responsibility that class $k$ takes for it.

The E step.

Evidently, the expected sufficient statistics are ${\mathopen{}\mathclose{{}\left\langle{{\check{X}}_{k}}}\right\rangle_{{\bm{% \check{X}}}{},{\bm{Y}}{}}}$ , ${\mathopen{}\mathclose{{}\left\langle{{\check{X}}_{k}{\bm{Y}}}}\right\rangle_{% {\bm{\check{X}}}{},{\bm{Y}}{}}}$ , and ${\mathopen{}\mathclose{{}\left\langle{{\check{X}}_{k}{\bm{Y}}{\bm{Y}}^{\text{T% }}}}\right\rangle_{{\bm{\check{X}}}{},{\bm{Y}}{}}}$ . During EM, i.e. when the averaging distribution is ${\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{}\middle|\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y};\bm{\theta}^{% \text{old}}}\right)}{p\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y}{}}\right)}$ , these are written more explicitly as

\displaystyle{\mathopen{}\mathclose{{}\left\langle{\mathbb{E}_{{\check{X}}_{k}% {}|{\bm{Y}}}{\mathopen{}\mathclose{{}\left[{\check{X}}_{k}\middle|{\bm{Y}}{}}% \right]}}}\right\rangle_{{\bm{Y}}{}}},

\displaystyle{\mathopen{}\mathclose{{}\left\langle{\mathbb{E}_{{\check{X}}_{k}% {}|{\bm{Y}}}{\mathopen{}\mathclose{{}\left[{\check{X}}_{k}\middle|{\bm{Y}}{}}% \right]}{\bm{Y}}}}\right\rangle_{{\bm{Y}}{}}},

\displaystyle{\mathopen{}\mathclose{{}\left\langle{\mathbb{E}_{{\check{X}}_{k}% {}|{\bm{Y}}}{\mathopen{}\mathclose{{}\left[{\check{X}}_{k}\middle|{\bm{Y}}{}}% \right]}{\bm{Y}}{\bm{Y}}^{\text{T}}}}\right\rangle_{{\bm{Y}}{}}}.

We derived the posterior mean that occurs in all of these expressions in Section 2.1.1. We repeat Eqs. 2.7 and 2.8 here for convenience:

\mathbb{E}_{{\check{X}}_{k}{}|{\bm{Y}}}{\mathopen{}\mathclose{{}\left[{\check{% X}}_{k}\middle|\bm{y}{}}\right]}=\operatorname*{softmax}\mathopen{}\mathclose{% {}\left\{\bm{z}}\right\}_{k},\qquad z_{k}=\log\pi_{k}+\frac{1}{2}\log|\mathbf{% {\Sigma}}^{-1}_{k}|-\frac{1}{2}\mathopen{}\mathclose{{}\left(\bm{y}-\bm{\mu}_{% k}}\right)^{\text{T}}\mathbf{{\Sigma}}^{-1}_{k}\mathopen{}\mathclose{{}\left(% \bm{y}-\bm{\mu}_{k}}\right).

7.1.1 $K$ -means

In Section 2.1.1, we saw what happens to the posterior of the GMM when all classes use the same covariance matrix, $\mathbf{{\Sigma}}$ . The class boundaries become lines, and the responsibility of class $j$ for observation $\bm{y}$ becomes

\begin{split}{\hat{p}\mathopen{}\mathclose{{}\left({\hat{X}}_{j}=1\middle|\bm{% y};\bm{\theta}}\right)}&=\frac{\exp\mathopen{}\mathclose{{}\left\{-\frac{1}{2}% \mathopen{}\mathclose{{}\left(\bm{y}-\bm{\mu}_{j}}\right)^{\text{T}}\mathbf{{% \Sigma}}^{-1}\mathopen{}\mathclose{{}\left(\bm{y}-\bm{\mu}_{j}}\right)}\right% \}\pi_{j}}{\sum_{k=1}^{{K}}\exp\mathopen{}\mathclose{{}\left\{-\frac{1}{2}% \mathopen{}\mathclose{{}\left(\bm{y}-\bm{\mu}_{k}}\right)^{\text{T}}\mathbf{{% \Sigma}}^{-1}\mathopen{}\mathclose{{}\left(\bm{y}-\bm{\mu}_{k}}\right)}\right% \}\pi_{k}}.\end{split}

(7.1)

It is not hard to see that in the limit of infinite precision, this quantity goes to zero unless $\bm{y}$ is closer to $\bm{\mu}_{j}$ than any other mean, in which case it goes to 1. The prior probabilities $\bm{\pi}$ have become irrelevant. Then the algorithm becomes

$K$ -Means

$\bullet\>$ E step:	${\check{p}^{(i+1)}\mathopen{}\mathclose{{}\left({\hat{X}}_{j}=1\middle\|\bm{y}}% \right)}\leftarrow\begin{cases}1,&\text{if }j=\operatorname*{argmin}_{k}% \mathopen{}\mathclose{{}\left\lVert\bm{y}-\bm{\mu}_{k}^{(i)}}\right\rVert\\ 0,&\text{otherwise}\end{cases}$
$\bullet\>$ M step:	$\bm{\mu}_{j}^{(i)}\leftarrow\frac{{\mathopen{}\mathclose{{}\left\langle{{% \check{p}^{(i+1)}\mathopen{}\mathclose{{}\left({\hat{X}}_{j}=1\middle\|{\bm{Y}}% }\right)}{\bm{Y}}}}\right\rangle_{{\bm{Y}}{}}}}{{\mathopen{}\mathclose{{}\left% \langle{{\check{p}^{(i+1)}\mathopen{}\mathclose{{}\left({\hat{X}}_{j}=1\middle% \|{\bm{Y}}}\right)}}}\right\rangle_{{\bm{Y}}{}}}}$

This algorithm, which pre-existed EM for the GMM, is known as $K$ -means^†^†margin: $K$ -means .

7.1 The Gaussian mixture model and KK-means

The M step.

The E step.

7.1.1 KK-means

7.1 The Gaussian mixture model and $K$ -means

7.1.1 $K$ -means