6.2 Latent-variable density estimation

Now notice that the root of the problem, the summation sign in Eq. 6.1, was introduced by the marginalization. Indeed, the joint distribution, the product of Eqs. 2.3 and 2.4, is

p^(𝒙^,𝒚^;𝜽)=kK[𝒩(𝝁k,𝚺k)πk]x^k.{\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{},\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{};\bm{\theta}}\right)}=\prod_{k}^{{K}% }\mathopen{}\mathclose{{}\left[\mathcal{N}\mathopen{}\mathclose{{}\left(\bm{% \mu}_{k},\>\mathbf{{\Sigma}}_{k}}\right)\pi_{k}}\right]^{\leavevmode\color[rgb% ]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\hat{x}_{k}}. (6.2)

The log of this distribution evidently does decouple into a sum of K{K} terms, each involving only 𝝁k\bm{\mu}_{k}, 𝚺k\mathbf{{\Sigma}}_{k}, and πk\pi_{k} (for a single kk). So perhaps we should try to re-express the marginal cross entropy, or anyway its gradient, in terms of the joint cross entropy.

Introducing the joint distribution.

Starting with the gradient of a generic marginal cross entropy, we move the derivative into the expection, “anti-marginalize” to restore the latent variables, and then rearrange terms:

dd𝜽𝔼𝒀[-logp^(𝒀;𝜽)]=𝔼𝒀[-1p^(𝒀;𝜽)dd𝜽p^(𝒀;𝜽)]=𝔼𝒀[-1p^(𝒀;𝜽)dd𝜽𝒙^p^(𝒙^,𝒀;𝜽)d𝒙^]=𝔼𝒀[-1p^(𝒀;𝜽)𝒙^p^(𝒙^,𝒀;𝜽)dd𝜽logp^(𝒙^,𝒀;𝜽)d𝒙^]=𝔼𝒀[-𝒙^p^(𝒙^|𝒀;𝜽)dd𝜽logp^(𝒙^,𝒀;𝜽)d𝒙^]=𝔼𝒀[𝔼𝑿^|𝒀^[-dd𝜽logp^(𝒙^,𝒀;𝜽)|𝒀]]=𝔼𝑿^,𝒀[-dd𝜽logp^(𝑿^,𝒀;𝜽)],\begin{split}{\frac{\mathop{}\!\mathrm{d}{}}{\mathop{}\!\mathrm{d}{\bm{\theta}% }}}\mathbb{E}_{{\bm{Y}}{}}{\mathopen{}\mathclose{{}\left[-\log{\hat{p}% \mathopen{}\mathclose{{}\left({\bm{Y}};\bm{\theta}}\right)}}\right]}&=\mathbb{% E}_{{\bm{Y}}{}}{\mathopen{}\mathclose{{}\left[-\frac{1}{{\hat{p}\mathopen{}% \mathclose{{}\left({\bm{Y}};\bm{\theta}}\right)}}{\frac{\mathop{}\!\mathrm{d}{% }}{\mathop{}\!\mathrm{d}{\bm{\theta}}}}{\hat{p}\mathopen{}\mathclose{{}\left({% \bm{Y}};\bm{\theta}}\right)}}\right]}\\ &=\mathbb{E}_{{\bm{Y}}{}}{\mathopen{}\mathclose{{}\left[-\frac{1}{{\hat{p}% \mathopen{}\mathclose{{}\left({\bm{Y}};\bm{\theta}}\right)}}{{\frac{\mathop{}% \!\mathrm{d}{}}{\mathop{}\!\mathrm{d}{\bm{\theta}}}}}\int_{\bm{\hat{x}}{}}{% \hat{p}\mathopen{}\mathclose{{}\left(\bm{\hat{x}},{\bm{Y}};\bm{\theta}}\right)% }\mathop{}\!\mathrm{d}{\bm{\hat{x}}{}}}\right]}\\ &=\mathbb{E}_{{\bm{Y}}{}}{\mathopen{}\mathclose{{}\left[-\frac{1}{{\hat{p}% \mathopen{}\mathclose{{}\left({\bm{Y}};\bm{\theta}}\right)}}\int_{\bm{\hat{x}}% {}}{\hat{p}\mathopen{}\mathclose{{}\left(\bm{\hat{x}},{\bm{Y}};\bm{\theta}}% \right)}{\frac{\mathop{}\!\mathrm{d}{}}{\mathop{}\!\mathrm{d}{\bm{\theta}}}}% \log{\hat{p}\mathopen{}\mathclose{{}\left(\bm{\hat{x}},{\bm{Y}};\bm{\theta}}% \right)}\mathop{}\!\mathrm{d}{\bm{\hat{x}}{}}}\right]}\\ &=\mathbb{E}_{{\bm{Y}}{}}{\mathopen{}\mathclose{{}\left[-\int_{\bm{\hat{x}}{}}% {\hat{p}\mathopen{}\mathclose{{}\left(\bm{\hat{x}}\middle|{\bm{Y}};\bm{\theta}% }\right)}{\frac{\mathop{}\!\mathrm{d}{}}{\mathop{}\!\mathrm{d}{\bm{\theta}}}}% \log{\hat{p}\mathopen{}\mathclose{{}\left(\bm{\hat{x}},{\bm{Y}};\bm{\theta}}% \right)}\mathop{}\!\mathrm{d}{\bm{\hat{x}}{}}}\right]}\\ &=\mathbb{E}_{{\bm{Y}}{}}{\mathopen{}\mathclose{{}\left[\mathbb{E}_{{\bm{\hat{% X}}}{}|{\bm{\hat{Y}}}}{\mathopen{}\mathclose{{}\left[-{\frac{\mathop{}\!% \mathrm{d}{}}{\mathop{}\!\mathrm{d}{\bm{\theta}}}}\log{\hat{p}\mathopen{}% \mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{% pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{\hat{x}}{},{\bm{Y}};\bm{\theta}}\right)}\middle% |{\bm{Y}}{}}\right]}}\right]}\\ &=\mathbb{E}_{{\bm{\hat{X}}}{},{\bm{Y}}{}}{\mathopen{}\mathclose{{}\left[-{% \frac{\mathop{}\!\mathrm{d}{}}{\mathop{}\!\mathrm{d}{\bm{\theta}}}}\log{\hat{p% }\mathopen{}\mathclose{{}\left({\bm{\hat{X}}},{\bm{Y}};\bm{\theta}}\right)}}% \right]},\end{split} (6.3)

where in the final line we have combined the data marginal and the model posterior into a single hybrid joint distributionmargin: hybrid joint distribution , p^(𝒙^|𝒚;𝜽)p(𝒚){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{}\middle|\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y};\bm{\theta}}% \right)}{p\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{y}{}}\right)}. This looks promising. It almost says that the gradient of the marginal (or “incomplete”) cross entropy is the same as the gradient of a joint (or “complete”) cross entropy. But the derivative cannot pass outside the expectation, because the hybrid joint distribution depends, like the model distribution, on the parameters 𝜽\bm{\theta}.

To see the implications of this dependence on the parameters, we return to our workhorse example, the Gaussian mixture model. Inserting Eq. 6.2 into the final line of Eq. 6.3 shows that

dd𝜽𝔼𝒀[-logp^(𝒀;𝜽)]=𝔼𝑿^,𝒀[-dd𝜽logp^(𝑿^,𝒀;𝜽)]=𝔼𝑿^,𝒀[-dd𝜽logkK[𝒩(𝝁k,𝚺k)πk]X^k]=𝔼𝑿^,𝒀[12dd𝜽kKX^k(Mlogτ+log|𝚺k|+(𝒀-𝝁k)T𝚺k-1(𝒀-𝝁k)-logπk)].\begin{split}{\frac{\mathop{}\!\mathrm{d}{}}{\mathop{}\!\mathrm{d}{\bm{\theta}% }}}\mathbb{E}_{{\bm{Y}}{}}{\mathopen{}\mathclose{{}\left[-\log{\hat{p}% \mathopen{}\mathclose{{}\left({\bm{Y}};\bm{\theta}}\right)}}\right]}&=\mathbb{% E}_{{\bm{\hat{X}}}{},{\bm{Y}}{}}{\mathopen{}\mathclose{{}\left[-{\frac{\mathop% {}\!\mathrm{d}{}}{\mathop{}\!\mathrm{d}{\bm{\theta}}}}\log{\hat{p}\mathopen{}% \mathclose{{}\left({\bm{\hat{X}}},{\bm{Y}};\bm{\theta}}\right)}}\right]}\\ &=\mathbb{E}_{{\bm{\hat{X}}},{\bm{Y}}}{\mathopen{}\mathclose{{}\left[-{\frac{% \mathop{}\!\mathrm{d}{}}{\mathop{}\!\mathrm{d}{\bm{\theta}}}}\log\prod_{k}^{{K% }}\mathopen{}\mathclose{{}\left[\mathcal{N}\mathopen{}\mathclose{{}\left(\bm{% \mu}_{k},\>\mathbf{{\Sigma}}_{k}}\right)\pi_{k}}\right]^{{\hat{X}}_{k}}}\right% ]}\\ &=\mathbb{E}_{{\bm{\hat{X}}},{\bm{Y}}}{\mathopen{}\mathclose{{}\left[\frac{1}{% 2}{\frac{\mathop{}\!\mathrm{d}{}}{\mathop{}\!\mathrm{d}{\bm{\theta}}}}\sum_{k}% ^{{K}}{\hat{X}}_{k}\mathopen{}\mathclose{{}\left(M\log\tau+\log\mathopen{}% \mathclose{{}\left\lvert\mathbf{{\Sigma}}_{k}}\right\rvert+\mathopen{}% \mathclose{{}\left({\bm{Y}}-\bm{\mu}_{k}}\right)^{\text{T}}\mathbf{{\Sigma}}^{% -1}_{k}\mathopen{}\mathclose{{}\left({\bm{Y}}-\bm{\mu}_{k}}\right)-\log\pi_{k}% }\right)}\right]}.\end{split}

The gradient with respect to (e.g.) 𝝁k\bm{\mu}_{k} is therefore

dd𝝁k𝔼𝒀[-logp^(𝒀;𝜽)]=𝔼𝑿^,𝒀[X^k𝚺-1(𝒀-𝝁k)]=set0𝝁k=𝔼𝑿^,𝒀[X^k𝒀]𝔼𝑿^,𝒀[X^k].\begin{split}{\frac{\mathop{}\!\mathrm{d}{}}{\mathop{}\!\mathrm{d}{\bm{\mu}_{k% }}}}\mathbb{E}_{{\bm{Y}}{}}{\mathopen{}\mathclose{{}\left[-\log{\hat{p}% \mathopen{}\mathclose{{}\left({\bm{Y}};\bm{\theta}}\right)}}\right]}&=\mathbb{% E}_{{\bm{\hat{X}}},{\bm{Y}}}{\mathopen{}\mathclose{{}\left[{\hat{X}}_{k}% \mathbf{{\Sigma}}^{-1}\mathopen{}\mathclose{{}\left({\bm{Y}}-\bm{\mu}_{k}}% \right)}\right]}\stackrel{{\scriptstyle\text{set}}}{{=}}0\\ \implies\bm{\mu}_{k}&=\frac{\mathbb{E}_{{\bm{\hat{X}}},{\bm{Y}}}{\mathopen{}% \mathclose{{}\left[{\hat{X}}_{k}{\bm{Y}}}\right]}}{\mathbb{E}_{{\bm{\hat{X}}},% {\bm{Y}}}{\mathopen{}\mathclose{{}\left[{\hat{X}}_{k}}\right]}}.\end{split} (6.4)

This formula looks elegant only if we forget that 𝝁k\bm{\mu}_{k} is on both sides. The expectations under the model posterior, p^(𝒙^|𝒀;𝜽){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{}\middle|{\bm{Y}};\bm{\theta}}\right)}, involve 𝝁k\bm{\mu}_{k}—indeed, they involve all of the parameters (recall Eq. 2.6)! This is reminiscent of the problem with the direct optimization of the marginal cross entropy, Eq. 6.1, so it seems perhaps we have made no progress.