Starting with the gradient of a generic marginal cross entropy, we move the derivative into the expection, “anti-marginalize” to restore the latent variables, and then rearrange terms:
(6.3)
where in the final line we have combined the data marginal and the model posterior into a single hybrid joint distribution††margin:
hybrid joint distribution
,
.
This looks promising.
It almost says that the gradient of the marginal (or “incomplete”) cross entropy is the same as the gradient of a joint (or “complete”) cross entropy.
But the derivative cannot pass outside the expectation, because the hybrid joint distribution depends, like the model distribution, on the parameters .
To see the implications of this dependence on the parameters, we return to our workhorse example, the Gaussian mixture model.
Inserting Eq. 6.2 into the final line of Eq. 6.3 shows that
The gradient with respect to (e.g.) is therefore
(6.4)
This formula looks elegant only if we forget that is on both sides.
The expectations under the model posterior, , involve —indeed, they involve all of the parameters (recall Eq. 2.6)!
This is reminiscent of the problem with the direct optimization of the marginal cross entropy, Eq. 6.1, so it seems perhaps we have made no progress.