7.2 The hidden Markov model

We recall the HMM from Section 2.2. Here we emphasize that multiple sequences have been observed; hence the plate in Fig. LABEL:fig:HMMwithplate. Note also that, below, elements of vectors are indicated with a superscript when the subscript is already taken by the time variable. Bare random variables, i.e. without super- or subscripts, denote the complete set of observations. The elements of the state-transition matrix 𝐀\mathbf{{A}} are referred to as aija_{ij}, with ii and jj indicating row and column, respectively. We also use the abbreviation p^(𝒙^1|𝒙^0;𝜽)..=p^(𝒙^1;𝜽)=Categ(𝝅){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}_{1}\middle|\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}_{%\ignorespaces 0};\bm{\theta}}\right)}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{%\scriptsize.}}}={\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}_{1};\bm{\theta}}\right)}=\text{Categ}%\mathopen{}\mathclose{{}\left(\bm{\pi}}\right). For concreteness we commit here to Gaussian emissions (the derivation is conceptually identical for multinomial emissions). Then the complete cross entropy is

H(ppˇ)p^[𝑿ˇ,𝒀;𝜽]-logp^(𝑿ˇ,𝒀;𝜽)𝑿ˇ,𝒀=-logt=1Tp^(𝑿ˇt|𝑿ˇt-1;𝜽)p^(𝒀t|𝑿ˇt;𝜽)𝑿ˇ,𝒀=-log(t=2Ti=1Kj=1KaijXˇtiXˇt-1j)(k=1KπkXˇ1k)(t=1Tk=1K𝒩(𝝁k,𝚺k)Xˇtk)𝑿ˇ,𝒀=-t=2Ti=1Kj=1KXˇtiXˇt-1jlogaij-k=1KXˇ1klogπk-t=1Tk=1KXˇtk𝒩(𝝁k,𝚺k)𝑿ˇ,𝒀.\begin{split}{\text{H}_{(p\check{p})\hat{p}}{\mathopen{}\mathclose{{}\left[{%\bm{\check{X}}},{\bm{Y}};\bm{\theta}}\right]}}&\approx{\mathopen{}\mathclose{{%}\left\langle{-\log{\hat{p}\mathopen{}\mathclose{{}\left({\bm{\check{X}}},{\bm%{Y}};\bm{\theta}}\right)}}}\right\rangle_{{\bm{\check{X}}}{},{\bm{Y}}{}}}\\&={\mathopen{}\mathclose{{}\left\langle{-\log\prod_{t=1}^{{{T}}}{\hat{p}%\mathopen{}\mathclose{{}\left({\bm{\check{X}}}_{t}\middle|{\bm{\check{X}}}_{t-%1};\bm{\theta}}\right)}{\hat{p}\mathopen{}\mathclose{{}\left({\bm{Y}}_{t}%\middle|{\bm{\check{X}}}_{t};\bm{\theta}}\right)}}}\right\rangle_{{\bm{\check{%X}}}{},{\bm{Y}}{}}}\\&={\mathopen{}\mathclose{{}\left\langle{-\log\mathopen{}\mathclose{{}\left(%\prod_{t=2}^{{{T}}}\prod_{i=1}^{{K}}\prod_{j=1}^{{K}}a_{ij}^{{\check{X}}^{i}_{%t}{\check{X}}^{j}_{t-1}}}\right)\mathopen{}\mathclose{{}\left(\prod_{k=1}^{{K}%}\pi_{k}^{{\check{X}}^{k}_{1}}}\right)\mathopen{}\mathclose{{}\left(\prod_{t=1%}^{{{T}}}\prod_{k=1}^{{K}}\mathcal{N}\mathopen{}\mathclose{{}\left(\bm{\mu}_{k%},\>\mathbf{{\Sigma}}_{k}}\right)^{{\check{X}}^{k}_{t}}}\right)}}\right\rangle%_{{\bm{\check{X}}}{},{\bm{Y}}{}}}\\&={\mathopen{}\mathclose{{}\left\langle{-\sum_{t=2}^{{{T}}}\sum_{i=1}^{{K}}%\sum_{j=1}^{{K}}{\check{X}}^{i}_{t}{\check{X}}^{j}_{t-1}\log a_{ij}-\sum_{k=1}%^{{K}}{\check{X}}^{k}_{1}\log\pi_{k}-\sum_{t=1}^{{{T}}}\sum_{k=1}^{{K}}{\check%{X}}^{k}_{t}\mathcal{N}\mathopen{}\mathclose{{}\left(\bm{\mu}_{k},\>\mathbf{{%\Sigma}}_{k}}\right)}}\right\rangle_{{\bm{\check{X}}}{},{\bm{Y}}{}}}.\end{split}

Again we need to enforce probabilities summing to one, in this case both the prior probability 𝝅\bm{\pi} and all K{K} columns of the state-transition matrix 𝐀\mathbf{{A}}. The Lagrangian becomes

(𝜽)=λ(k=1Kπk-1)+j=1Kηj(i=1Kaij-1)+H(ppˇ)p^[𝑿ˇ,𝒀;𝜽].\mathcal{L}(\bm{\theta})=\lambda\mathopen{}\mathclose{{}\left(\sum_{k=1}^{{K}}%\pi_{k}-1}\right)+\sum_{j=1}^{{K}}\eta_{j}\mathopen{}\mathclose{{}\left(\sum_{%i=1}^{{K}}a_{ij}-1}\right)+{\text{H}_{(p\check{p})\hat{p}}{\mathopen{}%\mathclose{{}\left[{\bm{\check{X}}},{\bm{Y}};\bm{\theta}}\right]}}.

The M step.

The derivative with respect to πk\pi_{k} is exactly as above, just confined to the very first sample (t=1t=1), and so the optimal prior probability is

πk=Xˇ1k𝑿ˇ,𝒀.\pi_{k}={\mathopen{}\mathclose{{}\left\langle{{\check{X}}^{k}_{1}}}\right%\rangle_{{\bm{\check{X}}},{\bm{Y}}{}}}.

Likewise for the class-conditional means and covariances, although here we have to remember to sum across all samples:

𝝁k=t=1TXˇtk𝒀t𝑿ˇ,𝒀t=1TXˇtk𝑿ˇ,𝒀,𝚺k=t=1TXˇtk𝒀t𝒀tT𝑿ˇ,𝒀t=1TXˇtk𝑿ˇ,𝒀-𝝁k𝝁kT.\begin{split}\bm{\mu}_{k}=\frac{{\mathopen{}\mathclose{{}\left\langle{\sum_{t=%1}^{{{T}}}{\check{X}}^{k}_{t}{\bm{Y}}_{t}}}\right\rangle_{{\bm{\check{X}}}{},{%\bm{Y}}{}}}}{{\mathopen{}\mathclose{{}\left\langle{\sum_{t=1}^{{{T}}}{\check{X%}}^{k}_{t}}}\right\rangle_{{\bm{\check{X}}}{},{\bm{Y}}{}}}},\qquad\mathbf{{%\Sigma}}_{k}&=\frac{{\mathopen{}\mathclose{{}\left\langle{\sum_{t=1}^{{{T}}}{%\check{X}}^{k}_{t}{\bm{Y}}_{t}{{\bm{Y}}_{t}}^{\text{T}}}}\right\rangle_{{\bm{%\check{X}}}{},{\bm{Y}}{}}}}{{\mathopen{}\mathclose{{}\left\langle{\sum_{t=1}^{%{{T}}}{\check{X}}^{k}_{t}}}\right\rangle_{{\bm{\check{X}}}{},{\bm{Y}}{}}}}-\bm%{\mu}_{k}\bm{\mu}_{k}^{\text{T}}.\end{split}

Finally, we derive the optimal state-transition probabilities:

0=setaijT=ηj-t=2TXˇtiXˇt-1j𝑿ˇ,𝒀aijaijηj=t=2TXˇtiXˇt-1j𝑿ˇ,𝒀i=1Kaijηj=t=2Ti=1KXˇtiXˇt-1j𝑿ˇ,𝒀ηj=t=2TXˇt-1j𝑿ˇ,𝒀aij=t=2TXˇtiXˇt-1j𝑿ˇ,𝒀t=2TXˇt-1j𝑿ˇ,𝒀\begin{split}0\stackrel{{\scriptstyle\text{set}}}{{=}}\frac{\partial{\mathcal{%L}}}{\partial{a_{ij}}^{\text{T}}}=\eta_{j}-\frac{{\mathopen{}\mathclose{{}%\left\langle{\sum_{t=2}^{{{T}}}{\check{X}}^{i}_{t}{\check{X}}^{j}_{t-1}}}%\right\rangle_{{\bm{\check{X}}}{},{\bm{Y}}{}}}}{a_{ij}}&\implies a_{ij}\eta_{j%}={\mathopen{}\mathclose{{}\left\langle{\sum_{t=2}^{{{T}}}{\check{X}}^{i}_{t}{%\check{X}}^{j}_{t-1}}}\right\rangle_{{\bm{\check{X}}}{},{\bm{Y}}{}}}\\&\implies\sum_{i=1}^{{K}}a_{ij}\eta_{j}={\mathopen{}\mathclose{{}\left\langle{%\sum_{t=2}^{{{T}}}\sum_{i=1}^{{K}}{\check{X}}^{i}_{t}{\check{X}}^{j}_{t-1}}}%\right\rangle_{{\bm{\check{X}}}{},{\bm{Y}}{}}}\\&\implies\eta_{j}={\mathopen{}\mathclose{{}\left\langle{\sum_{t=2}^{{{T}}}{%\check{X}}^{j}_{t-1}}}\right\rangle_{{\bm{\check{X}}}{},{\bm{Y}}{}}}\\&\implies a_{ij}=\frac{{\mathopen{}\mathclose{{}\left\langle{\sum_{t=2}^{{{T}}%}{\check{X}}^{i}_{t}{\check{X}}^{j}_{t-1}}}\right\rangle_{{\bm{\check{X}}}{},{%\bm{Y}}{}}}}{{\mathopen{}\mathclose{{}\left\langle{\sum_{t=2}^{{{T}}}{\check{X%}}^{j}_{t-1}}}\right\rangle_{{\bm{\check{X}}}{},{\bm{Y}}{}}}}\end{split}

The optimal parameters again have intuitive interpretations. The class-conditional means and covariance are computed exactly as with the GMM, except that the averages are now across time and independent sequences rather than independent samples. In the case of EM, there is one other crucial distinction from the GMM: The posterior means of the HMM are conditioned on samples from all time; that is, they are the smoother means. This distinction has no relevance in the GMM, which has no temporal dimension.

Similarly, the optimal mixing proportions for the initial state look like the optimal mixing proportions for the latent classes in the GMM, although the sample average for the HMM is over sequences, rather than individual samples. One upshot is that it would be impossible to estimate the initial mixing proportions properly without access to multiple, independent sequences (this makes sense). And again, we must be careful in the case of EM: the expectation should be taken under the smoother distribution over the initial state. That is, to estimate the initial mixing proportions, the inference algorithm should first run all the way to the end (filter) and back (smoother)! This may at first be surprising, but note that future observations should indeed have some (albeit diminishing) influence on our belief about initial state. (We can imagine, colorfully, an unexpected future observation in light of which we revise our belief about the initial state: “Oh, I guess it must have started in state five, then….”)

We turn to the remaining parameters, the elements of the state-transition matrix, 𝐀\mathbf{{A}}. When the states have been fully observed, the optimal aija_{ij} is merely the proportion of times state jj transitioned to state ii (rather than some other state). Under EM, the numerator is the expected frequency of state jj followed by ii, where the expectation is taken under the smoother. To transform this into a transition probability, the base frequency of state jj must be taken into account; under EM, we estimate it with its expected frequency under the smoother (averaged over sequences).

The E step.

Let us again make explicit the expected sufficient statistics for EM:

𝔼Xˇ1k|𝒀[Xˇ1k|𝒀1,,𝒀T]𝒀,\displaystyle{\mathopen{}\mathclose{{}\left\langle{\mathbb{E}_{{\check{X}}^{k}%_{1}{}|{\bm{Y}}}{\mathopen{}\mathclose{{}\left[{\check{X}}^{k}_{1}\middle|{\bm%{Y}}_{1},\ldots,{\bm{Y}}_{{T}}{}}\right]}}}\right\rangle_{{\bm{Y}}{}}}, t=2T𝔼Xˇt-1j|𝒀[Xˇt-1j|𝒀1,,𝒀T]𝒀,\displaystyle{\mathopen{}\mathclose{{}\left\langle{\sum_{t=2}^{{{T}}}\mathbb{E%}_{{\check{X}}^{j}_{t-1}{}|{\bm{Y}}}{\mathopen{}\mathclose{{}\left[{\check{X}}%^{j}_{t-1}\middle|{\bm{Y}}_{1},\ldots,{\bm{Y}}_{{T}}{}}\right]}}}\right\rangle%_{{\bm{Y}}{}}},
t=2T𝔼𝑿ˇ|𝒀[XˇtiXˇt-1j|𝒀1,,𝒀T]𝒀,\displaystyle{\mathopen{}\mathclose{{}\left\langle{\sum_{t=2}^{{{T}}}\mathbb{E%}_{{\bm{\check{X}}}|{\bm{Y}}}{\mathopen{}\mathclose{{}\left[{\check{X}}^{i}_{t%}{\check{X}}^{j}_{t-1}\middle|{\bm{Y}}_{1},\ldots,{\bm{Y}}_{{T}}{}}\right]}}}%\right\rangle_{{\bm{Y}}{}}}, t=1T𝔼Xˇtk|𝒀[Xˇtk|𝒀1,,𝒀T]𝒀t𝒀,\displaystyle{\mathopen{}\mathclose{{}\left\langle{\sum_{t=1}^{{{T}}}\mathbb{E%}_{{\check{X}}^{k}_{t}{}|{\bm{Y}}}{\mathopen{}\mathclose{{}\left[{\check{X}}^{%k}_{t}\middle|{\bm{Y}}_{1},\ldots,{\bm{Y}}_{{T}}{}}\right]}{\bm{Y}}_{t}}}%\right\rangle_{{\bm{Y}}{}}},
t=1T𝔼Xˇtk|𝒀[Xˇtk|𝒀1,,𝒀T]𝒀t𝒀tT𝒀.\displaystyle{\mathopen{}\mathclose{{}\left\langle{\sum_{t=1}^{{{T}}}\mathbb{E%}_{{\check{X}}^{k}_{t}{}|{\bm{Y}}}{\mathopen{}\mathclose{{}\left[{\check{X}}^{%k}_{t}\middle|{\bm{Y}}_{1},\ldots,{\bm{Y}}_{{T}}{}}\right]}{\bm{Y}}_{t}{{\bm{Y%}}_{t}}^{\text{T}}}}\right\rangle_{{\bm{Y}}{}}}.

There are a few points to note. First, there are really only two kinds of expections: over a single latent variable, Xˇtk{\check{X}}^{k}_{t}, and over adjacent random variables, XˇtiXˇt-1j{\check{X}}^{i}_{t}{\check{X}}^{j}_{t-1}. We worked out how to compute these in Section 2.2. In particular, both expectations can be computed with the smoother—that is, after a complete forward and backward pass, since the expectations depend on the observations at all times. As we have lately noted, this is even true for the statistics at the very first time step. Finally, notice that we need not keep the statistics separately at each time step; we can sum across time. However, there is one subtlety: one of these sums omits the very last time step (since it is used to estimate the probability of the states out of which the system transitions).