7.2 The hidden Markov model
We recall the HMM from Section 2.2.
Here we emphasize that multiple sequences have been observed; hence the plate in Fig. LABEL:fig:HMMwithplate.
Note also that, below, elements of vectors are indicated with a superscript when the subscript is already taken by the time variable.
Bare random variables, i.e. without super- or subscripts, denote the complete set of observations.
The elements of the state-transition matrix
Again we need to enforce probabilities summing to one, in this case both the prior probability
The M step.
The derivative with respect to
Likewise for the class-conditional means and covariances, although here we have to remember to sum across all samples:
Finally, we derive the optimal state-transition probabilities:
The optimal parameters again have intuitive interpretations. The class-conditional means and covariance are computed exactly as with the GMM, except that the averages are now across time and independent sequences rather than independent samples. In the case of EM, there is one other crucial distinction from the GMM: The posterior means of the HMM are conditioned on samples from all time; that is, they are the smoother means. This distinction has no relevance in the GMM, which has no temporal dimension.
Similarly, the optimal mixing proportions for the initial state look like the optimal mixing proportions for the latent classes in the GMM, although the sample average for the HMM is over sequences, rather than individual samples. One upshot is that it would be impossible to estimate the initial mixing proportions properly without access to multiple, independent sequences (this makes sense). And again, we must be careful in the case of EM: the expectation should be taken under the smoother distribution over the initial state. That is, to estimate the initial mixing proportions, the inference algorithm should first run all the way to the end (filter) and back (smoother)! This may at first be surprising, but note that future observations should indeed have some (albeit diminishing) influence on our belief about initial state. (We can imagine, colorfully, an unexpected future observation in light of which we revise our belief about the initial state: “Oh, I guess it must have started in state five, then….”)
We turn to the remaining parameters, the elements of the state-transition matrix,
The E step.
Let us again make explicit the expected sufficient statistics for EM:
There are a few points to note.
First, there are really only two kinds of expections: over a single latent variable,