7.1 The Gaussian mixture model and -means
We return once again to the GMM, but this time, armed with the EM algorithm, we finally derive the learning rules. Since the entropy The joint cross entropy for a Gaussian mixture model is
To enforce the fact that the prior probabilities sum to one, we can augment the loss with a Lagrangian term:
The M step.
We take the derivatives in turn. First the mixing proportions:
then the class-conditional means:
and the class-conditional covariances:
Whether in EM or under a fully observed model, the optimal parameters are intuitive. The optimal mixing proportion, emission mean, and emission covariance for class are their sample counterparts, i.e. the sample proportion, sample mean, and sample covariance (resp.); or, to put it yet another way, the average number of occurrences of class , the average value of the samples from class , and the average covariance of the samples from class . The difference between learning (in EM) with latent, rather than fully observed, classes is that these are weighted, rather than unweighted, averages. In particular, class assignments are soft: each class takes some continuous-valued responsibility††margin: class responsibilities for each datum , namely , the probability of that class under the (previous) posterior distribution.
For example, when the class labels are observed, the denominator in the equation for the optimal mean becomes just the number of times class occurred, and the numerator picks out just those samples associated with class . This is the sample average of the in class . But when the class is latent, the denominator is the average soft assignment or responsibility of class , and the numerator is the (soft) average of all observations , each weighted by the responsibility that class takes for it.
7.1.1 -means
In Section 2.1.1, we saw what happens to the posterior of the GMM when all classes use the same covariance matrix, . The class boundaries become lines, and the responsibility of class for observation becomes
It is not hard to see that in the limit of infinite precision, this quantity goes to zero unless is closer to than any other mean, in which case it goes to 1. The prior probabilities have become irrelevant. Then the algorithm becomes
-Means
E step: | |
M step: |
This algorithm, which pre-existed EM for the GMM, is known as -means††margin: -means .