6.2 Latent-variable density estimation

Now notice that the root of the problem, the summation sign in Eq. 6.1, was introduced by the marginalization. Indeed, the joint distribution, the product of Eqs. 2.3 and 2.4, is

{\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{},\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{};\bm{\theta}}\right)}=\prod_{k}^{{K}%}\mathopen{}\mathclose{{}\left[\mathcal{N}\mathopen{}\mathclose{{}\left(\bm{%\mu}_{k},\>\mathbf{{\Sigma}}_{k}}\right)\pi_{k}}\right]^{\leavevmode\color[rgb%]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\hat{x}_{k}}.

(6.2)

The log of this distribution evidently does decouple into a sum of ${K}$ terms, each involving only $\bm{\mu}_{k}$ , $\mathbf{{\Sigma}}_{k}$ , and $\pi_{k}$ (for a single $k$ ). So perhaps we should try to re-express the marginal cross entropy, or anyway its gradient, in terms of the joint cross entropy.

Introducing the joint distribution.

Starting with the gradient of a generic marginal cross entropy, we move the derivative into the expection, “anti-marginalize” to restore the latent variables, and then rearrange terms:

\begin{split}{\frac{\mathop{}\!\mathrm{d}{}}{\mathop{}\!\mathrm{d}{\bm{\theta}%}}}\mathbb{E}_{{\bm{Y}}{}}{\mathopen{}\mathclose{{}\left[-\log{\hat{p}%\mathopen{}\mathclose{{}\left({\bm{Y}};\bm{\theta}}\right)}}\right]}&=\mathbb{%E}_{{\bm{Y}}{}}{\mathopen{}\mathclose{{}\left[-\frac{1}{{\hat{p}\mathopen{}%\mathclose{{}\left({\bm{Y}};\bm{\theta}}\right)}}{\frac{\mathop{}\!\mathrm{d}{%}}{\mathop{}\!\mathrm{d}{\bm{\theta}}}}{\hat{p}\mathopen{}\mathclose{{}\left({%\bm{Y}};\bm{\theta}}\right)}}\right]}\\&=\mathbb{E}_{{\bm{Y}}{}}{\mathopen{}\mathclose{{}\left[-\frac{1}{{\hat{p}%\mathopen{}\mathclose{{}\left({\bm{Y}};\bm{\theta}}\right)}}{{\frac{\mathop{}%\!\mathrm{d}{}}{\mathop{}\!\mathrm{d}{\bm{\theta}}}}}\int_{\bm{\hat{x}}{}}{%\hat{p}\mathopen{}\mathclose{{}\left(\bm{\hat{x}},{\bm{Y}};\bm{\theta}}\right)%}\mathop{}\!\mathrm{d}{\bm{\hat{x}}{}}}\right]}\\&=\mathbb{E}_{{\bm{Y}}{}}{\mathopen{}\mathclose{{}\left[-\frac{1}{{\hat{p}%\mathopen{}\mathclose{{}\left({\bm{Y}};\bm{\theta}}\right)}}\int_{\bm{\hat{x}}%{}}{\hat{p}\mathopen{}\mathclose{{}\left(\bm{\hat{x}},{\bm{Y}};\bm{\theta}}%\right)}{\frac{\mathop{}\!\mathrm{d}{}}{\mathop{}\!\mathrm{d}{\bm{\theta}}}}%\log{\hat{p}\mathopen{}\mathclose{{}\left(\bm{\hat{x}},{\bm{Y}};\bm{\theta}}%\right)}\mathop{}\!\mathrm{d}{\bm{\hat{x}}{}}}\right]}\\&=\mathbb{E}_{{\bm{Y}}{}}{\mathopen{}\mathclose{{}\left[-\int_{\bm{\hat{x}}{}}%{\hat{p}\mathopen{}\mathclose{{}\left(\bm{\hat{x}}\middle|{\bm{Y}};\bm{\theta}%}\right)}{\frac{\mathop{}\!\mathrm{d}{}}{\mathop{}\!\mathrm{d}{\bm{\theta}}}}%\log{\hat{p}\mathopen{}\mathclose{{}\left(\bm{\hat{x}},{\bm{Y}};\bm{\theta}}%\right)}\mathop{}\!\mathrm{d}{\bm{\hat{x}}{}}}\right]}\\&=\mathbb{E}_{{\bm{Y}}{}}{\mathopen{}\mathclose{{}\left[\mathbb{E}_{{\bm{\hat{%X}}}{}|{\bm{\hat{Y}}}}{\mathopen{}\mathclose{{}\left[-{\frac{\mathop{}\!%\mathrm{d}{}}{\mathop{}\!\mathrm{d}{\bm{\theta}}}}\log{\hat{p}\mathopen{}%\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{%pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}%\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{},{\bm{Y}};\bm{\theta}}\right)}\middle%|{\bm{Y}}{}}\right]}}\right]}\\&=\mathbb{E}_{{\bm{\hat{X}}}{},{\bm{Y}}{}}{\mathopen{}\mathclose{{}\left[-{%\frac{\mathop{}\!\mathrm{d}{}}{\mathop{}\!\mathrm{d}{\bm{\theta}}}}\log{\hat{p%}\mathopen{}\mathclose{{}\left({\bm{\hat{X}}},{\bm{Y}};\bm{\theta}}\right)}}%\right]},\end{split}

(6.3)

where in the final line we have combined the data marginal and the model posterior into a single hybrid joint distribution^†^†margin: hybrid joint distribution , ${\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{}\middle|\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y};\bm{\theta}}%\right)}{p\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{y}{}}\right)}$ . This looks promising. It almost says that the gradient of the marginal (or “incomplete”) cross entropy is the same as the gradient of a joint (or “complete”) cross entropy. But the derivative cannot pass outside the expectation, because the hybrid joint distribution depends, like the model distribution, on the parameters $\bm{\theta}$ .

To see the implications of this dependence on the parameters, we return to our workhorse example, the Gaussian mixture model. Inserting Eq. 6.2 into the final line of Eq. 6.3 shows that

\begin{split}{\frac{\mathop{}\!\mathrm{d}{}}{\mathop{}\!\mathrm{d}{\bm{\theta}%}}}\mathbb{E}_{{\bm{Y}}{}}{\mathopen{}\mathclose{{}\left[-\log{\hat{p}%\mathopen{}\mathclose{{}\left({\bm{Y}};\bm{\theta}}\right)}}\right]}&=\mathbb{%E}_{{\bm{\hat{X}}}{},{\bm{Y}}{}}{\mathopen{}\mathclose{{}\left[-{\frac{\mathop%{}\!\mathrm{d}{}}{\mathop{}\!\mathrm{d}{\bm{\theta}}}}\log{\hat{p}\mathopen{}%\mathclose{{}\left({\bm{\hat{X}}},{\bm{Y}};\bm{\theta}}\right)}}\right]}\\&=\mathbb{E}_{{\bm{\hat{X}}},{\bm{Y}}}{\mathopen{}\mathclose{{}\left[-{\frac{%\mathop{}\!\mathrm{d}{}}{\mathop{}\!\mathrm{d}{\bm{\theta}}}}\log\prod_{k}^{{K%}}\mathopen{}\mathclose{{}\left[\mathcal{N}\mathopen{}\mathclose{{}\left(\bm{%\mu}_{k},\>\mathbf{{\Sigma}}_{k}}\right)\pi_{k}}\right]^{{\hat{X}}_{k}}}\right%]}\\&=\mathbb{E}_{{\bm{\hat{X}}},{\bm{Y}}}{\mathopen{}\mathclose{{}\left[\frac{1}{%2}{\frac{\mathop{}\!\mathrm{d}{}}{\mathop{}\!\mathrm{d}{\bm{\theta}}}}\sum_{k}%^{{K}}{\hat{X}}_{k}\mathopen{}\mathclose{{}\left(M\log\tau+\log\mathopen{}%\mathclose{{}\left\lvert\mathbf{{\Sigma}}_{k}}\right\rvert+\mathopen{}%\mathclose{{}\left({\bm{Y}}-\bm{\mu}_{k}}\right)^{\text{T}}\mathbf{{\Sigma}}^{%-1}_{k}\mathopen{}\mathclose{{}\left({\bm{Y}}-\bm{\mu}_{k}}\right)-\log\pi_{k}%}\right)}\right]}.\end{split}

The gradient with respect to (e.g.) $\bm{\mu}_{k}$ is therefore

\begin{split}{\frac{\mathop{}\!\mathrm{d}{}}{\mathop{}\!\mathrm{d}{\bm{\mu}_{k%}}}}\mathbb{E}_{{\bm{Y}}{}}{\mathopen{}\mathclose{{}\left[-\log{\hat{p}%\mathopen{}\mathclose{{}\left({\bm{Y}};\bm{\theta}}\right)}}\right]}&=\mathbb{%E}_{{\bm{\hat{X}}},{\bm{Y}}}{\mathopen{}\mathclose{{}\left[{\hat{X}}_{k}%\mathbf{{\Sigma}}^{-1}\mathopen{}\mathclose{{}\left({\bm{Y}}-\bm{\mu}_{k}}%\right)}\right]}\stackrel{{\scriptstyle\text{set}}}{{=}}0\\\implies\bm{\mu}_{k}&=\frac{\mathbb{E}_{{\bm{\hat{X}}},{\bm{Y}}}{\mathopen{}%\mathclose{{}\left[{\hat{X}}_{k}{\bm{Y}}}\right]}}{\mathbb{E}_{{\bm{\hat{X}}},%{\bm{Y}}}{\mathopen{}\mathclose{{}\left[{\hat{X}}_{k}}\right]}}.\end{split}

(6.4)

This formula looks elegant only if we forget that $\bm{\mu}_{k}$ is on both sides. The expectations under the model posterior, ${\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{}\middle|{\bm{Y}};\bm{\theta}}\right)}$ , involve $\bm{\mu}_{k}$ —indeed, they involve all of the parameters (recall Eq. 2.6)! This is reminiscent of the problem with the direct optimization of the marginal cross entropy, Eq. 6.1, so it seems perhaps we have made no progress.