The E step.
In Eq. 7.4, as in the preceding exampels, the bracketed quantities (including the brackets) are the sufficient statistics.
They are all averages of either vectors or outer products of vectors, reflecting the quadratic structure inherent in normal distributions.
When the states are observed, all these quantities are computed as sample averages under the data distribution.
In the context of the EM algorithm, where the states are unobserved, the relevant averaging distributions are the RTS smoothing distributions, multiplied by the data distribution:
Once again, computing the sufficient statistics belongs to the E step.
So in this case, the E step requires first running the RTS smoother—more specifically, the Kalman filter followed by the RTS smoother—and then computing the sufficient statistics under them.
To make this extremely concrete, we note that having the smoother distribution in hand means having a mean (vector) and covariance matrix at every time step, since the distribution is normal.
To compute expected outer products, then, we have to combine these together.
For and we need
Notice that all sums start at the second index, and (consequently) are normalized by .
That is why the first and third statistics are not identical.
If the sequence is long enough, including all terms in the first and last statistics will probably have little effect, but both averages can be computed with very little computational cost.
For and , in contrast, we collect statistics for all time.
Thus, even though we also need a statistic we have called
,
in this case the average is over all :
We also need average outer products involving the observations, but these can be computed directly rather than via the posterior expectation and covariance:
|
|
|
|
Finally, the sufficient statistics for the initial cumulants, and , require no sums at all, since they rely only on the first time step:
We emphasize here again that, just as in the HMM,
the posterior cumulants
and
depend on the observations for all time: at least in theory, one must run the filter all the way to the end of the sequence and then the smoother all the way back before computing them.††margin:
implementation note: The second of these can yield a low-rank covariance matrix if only a single trajectory has been observed, and care is sometimes required to avoid inverting a singular matrix in the Kalman filter.