6.2 Latent-variable density estimation
Now notice that the root of the problem, the summation sign in Eq. 6.1, was introduced by the marginalization. Indeed, the joint distribution, the product of Eqs. 2.3 and 2.4, is
The log of this distribution evidently does decouple into a sum of
Introducing the joint distribution.
Starting with the gradient of a generic marginal cross entropy, we move the derivative into the expection, “anti-marginalize” to restore the latent variables, and then rearrange terms:
where in the final line we have combined the data marginal and the model posterior into a single hybrid joint distribution††margin:
hybrid joint distribution
,
To see the implications of this dependence on the parameters, we return to our workhorse example, the Gaussian mixture model. Inserting Eq. 6.2 into the final line of Eq. 6.3 shows that
The gradient with respect to (e.g.)
This formula looks elegant only if we forget that