Let us begin with an even simpler data set to model, Fig. LABEL:subfig:cluster.
The data look to be distributed normally, so it would be sensible simply to let be , with and the sample mean and sample covariance.
But let us proceed somewhat naïvely according to the generic procedure introduced in Section 4.2.
The procedure enjoins us to minimize the relative entropy; or, equivalently, since the entropy doesn’t depend on the parameters, the cross entropy:
Differentiating with respect to indeed indicates that should be set equal to the sample average:
where in the final equality we approximate the expectation under the (unavailable) data distribution with an average under (available) samples from it.
Likewise, differentiating with respect to , we find (after consulting Section B.1)
So far, so good.
We now proceed to the dataset shown in Fig. LABEL:subfig:unlabeledClusters.
Here by all appearances is a mixture of Gaussians.
In Section 2.1.1 we derived the marginal distribution for the GMM, Eq. 2.10, so it seems that perhaps we can use the same procedure as for the single Gaussian.
The cross-entropy loss is
(6.1)
We have encountered a problem.
The summation (across classes) inside the logarithm couples the parameters of the Gaussians together.
In particular, differentiating with respect to or will yield an expression involving all the means and covariances.
Solving these coupled equations (after setting the gradients to zero) is not straightforward.