6.1 Introduction

[[ We are concerned with learning in generative models under two circumstances, “supervised” and “unsupervised.” The paradigm case of each is the illustrated in Fig. LABEL:fig:clustering. In Fig. LABEL:subfig:labeledClusters, each datum (𝒚\bm{y}) comes with a class label—but despite the “missing” labels in Fig. LABEL:subfig:unlabeledClusters, class structure is still perspicuous. Both sets of data can be fit with a Gaussian mixture model (Chapter 2), but in the case of Fig. LABEL:subfig:unlabeledClusters, the “source” variables 𝑿^{\bm{\hat{X}}} are “latent” or unobserved. The learning algorithms for the two cases are consequently different. Nevertheless, for the GMM, the supervised-learning algorithm can be written as a special case of the unsupervised-learning algorithm, and indeed this is often the case. So we shall concentrate on the unsupervised problems, deriving the supervised solutions along the way. ]]

Density estimation.

Let us begin with an even simpler data set to model, Fig. LABEL:subfig:cluster. The data look to be distributed normally, so it would be sensible simply to let p^(𝒚^;𝜽){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{};\bm{\theta}}\right)} be 𝒩(𝒎,𝐒)\mathcal{N}\mathopen{}\mathclose{{}\left(\bm{m},\>\mathbf{S}}\right), with 𝒎\bm{m} and 𝐒\mathbf{S} the sample mean and sample covariance. But let us proceed somewhat naïvely according to the generic procedure introduced in Section 4.2. The procedure enjoins us to minimize the relative entropy; or, equivalently, since the entropy doesn’t depend on the parameters, the cross entropy:

=𝔼𝒀[-logp^(𝒀;𝜽)]=𝔼𝒀[-log(τ-M/2|𝚺|-1/2exp{-12(𝒀-𝝁)T𝚺-1(𝒀-𝝁)})]=12𝔼𝒀[Mlogτ-log|𝚺-1|+(𝒀-𝝁)T𝚺-1(𝒀-𝝁)].\begin{split}\mathcal{L}&=\mathbb{E}_{{\bm{Y}}{}}{\mathopen{}\mathclose{{}% \left[-\log{\hat{p}\mathopen{}\mathclose{{}\left({\bm{Y}};\bm{\theta}}\right)}% }\right]}\\ &=\mathbb{E}_{{\bm{Y}}{}}{\mathopen{}\mathclose{{}\left[-\log\mathopen{}% \mathclose{{}\left(\tau^{-M/2}\mathopen{}\mathclose{{}\left\lvert\mathbf{{% \Sigma}}}\right\rvert^{-1/2}\exp\mathopen{}\mathclose{{}\left\{-\frac{1}{2}% \mathopen{}\mathclose{{}\left({\bm{Y}}-\bm{\mu}}\right)^{\text{T}}\mathbf{{% \Sigma}}^{-1}\mathopen{}\mathclose{{}\left({\bm{Y}}-\bm{\mu}}\right)}\right\}}% \right)}\right]}\\ &=\frac{1}{2}\mathbb{E}_{{\bm{Y}}{}}{\mathopen{}\mathclose{{}\left[M\log\tau-% \log\mathopen{}\mathclose{{}\left\lvert\mathbf{{\Sigma}}^{-1}}\right\rvert+% \mathopen{}\mathclose{{}\left({\bm{Y}}-\bm{\mu}}\right)^{\text{T}}\mathbf{{% \Sigma}}^{-1}\mathopen{}\mathclose{{}\left({\bm{Y}}-\bm{\mu}}\right)}\right]}.% \end{split}

Differentiating with respect to 𝝁\bm{\mu} indeed indicates that 𝝁\bm{\mu} should be set equal to the sample average:

dd𝝁=𝔼𝒀[𝚺-1(𝒀-𝝁)]=set0𝝁=𝔼𝒀[𝒀]𝒀𝒀,\frac{\mathop{}\!\mathrm{d}{\mathcal{L}}}{\mathop{}\!\mathrm{d}{\bm{\mu}}}=% \mathbb{E}_{{\bm{Y}}{}}{\mathopen{}\mathclose{{}\left[\mathbf{{\Sigma}}^{-1}% \mathopen{}\mathclose{{}\left({\bm{Y}}-\bm{\mu}}\right)}\right]}\\ \stackrel{{\scriptstyle\text{set}}}{{=}}0\implies\bm{\mu}=\mathbb{E}_{{\bm{Y}}% }{\mathopen{}\mathclose{{}\left[{\bm{Y}}}\right]}\approx{\mathopen{}\mathclose% {{}\left\langle{{\bm{Y}}}}\right\rangle_{{\bm{Y}}}},

where in the final equality we approximate the expectation under the (unavailable) data distribution with an average under (available) samples from it. Likewise, differentiating with respect to 𝚺-1\mathbf{{\Sigma}}^{-1}, we find (after consulting Section B.1)

dd𝚺-1=12𝔼𝒀[-𝚺+(𝒀-𝝁)(𝒀-𝝁)T]=set0𝚺=𝔼𝒀[(𝒀-𝝁)(𝒀-𝝁)T](𝒀-𝝁)(𝒀-𝝁)T𝒀.\begin{split}\frac{\mathop{}\!\mathrm{d}{\mathcal{L}}}{\mathop{}\!\mathrm{d}{% \mathbf{{\Sigma}}^{-1}}}&=\frac{1}{2}\mathbb{E}_{{\bm{Y}}{}}{\mathopen{}% \mathclose{{}\left[-\mathbf{{\Sigma}}+\mathopen{}\mathclose{{}\left({\bm{Y}}-% \bm{\mu}}\right)\mathopen{}\mathclose{{}\left({\bm{Y}}-\bm{\mu}}\right)^{\text% {T}}}\right]}\stackrel{{\scriptstyle\text{set}}}{{=}}0\\ \implies\mathbf{{\Sigma}}&=\mathbb{E}_{{\bm{Y}}}{\mathopen{}\mathclose{{}\left% [\mathopen{}\mathclose{{}\left({\bm{Y}}-\bm{\mu}}\right)\mathopen{}\mathclose{% {}\left({\bm{Y}}-\bm{\mu}}\right)^{\text{T}}}\right]}\approx{\mathopen{}% \mathclose{{}\left\langle{\mathopen{}\mathclose{{}\left({\bm{Y}}-\bm{\mu}}% \right)\mathopen{}\mathclose{{}\left({\bm{Y}}-\bm{\mu}}\right)^{\text{T}}}}% \right\rangle_{{\bm{Y}}}}.\end{split}

So far, so good. We now proceed to the dataset shown in Fig. LABEL:subfig:unlabeledClusters. Here by all appearances is a mixture of Gaussians. In Section 2.1.1 we derived the marginal distribution for the GMM, Eq. 2.10, so it seems that perhaps we can use the same procedure as for the single Gaussian. The cross-entropy loss is

=𝔼𝒀[-logp^(𝒀;𝜽)]=𝔼𝒀[-log(k=1Kp^(𝒀|X^k=1;𝜽)πk)]=𝔼𝒀[-log(k=1Kτ-M/2|𝚺k|-1/2exp{-12(𝒀-𝝁k)T𝚺k-1(𝒀-𝝁k)}πk)].\begin{split}\mathcal{L}&=\mathbb{E}_{{\bm{Y}}{}}{\mathopen{}\mathclose{{}% \left[-\log{\hat{p}\mathopen{}\mathclose{{}\left({\bm{Y}};\bm{\theta}}\right)}% }\right]}\\ &=\mathbb{E}_{{\bm{Y}}{}}{\mathopen{}\mathclose{{}\left[-\log\mathopen{}% \mathclose{{}\left(\sum_{k=1}^{{K}}{\hat{p}\mathopen{}\mathclose{{}\left({\bm{% Y}}\middle|{\hat{X}}_{k}=1;\bm{\theta}}\right)}\pi_{k}}\right)}\right]}\\ &=\mathbb{E}_{{\bm{Y}}{}}{\mathopen{}\mathclose{{}\left[-\log\mathopen{}% \mathclose{{}\left(\sum_{k=1}^{{K}}\tau^{-M/2}\mathopen{}\mathclose{{}\left% \lvert\mathbf{{\Sigma}}_{k}}\right\rvert^{-1/2}\exp\mathopen{}\mathclose{{}% \left\{-\frac{1}{2}\mathopen{}\mathclose{{}\left({\bm{Y}}-\bm{\mu}_{k}}\right)% ^{\text{T}}\mathbf{{\Sigma}}^{-1}_{k}\mathopen{}\mathclose{{}\left({\bm{Y}}-% \bm{\mu}_{k}}\right)}\right\}\pi_{k}}\right)}\right]}.\end{split} (6.1)

We have encountered a problem. The summation (across classes) inside the logarithm couples the parameters of the KK Gaussians together. In particular, differentiating with respect to 𝝁k\bm{\mu}_{k} or 𝚺k\mathbf{{\Sigma}}_{k} will yield an expression involving all the means and covariances. Solving these coupled equations (after setting the gradients to zero) is not straightforward.