7.4 Linear-Gaussian state-space models
Exactly analogously to the extension of GMMs to HMMs, we extend factor analysis through time (or space) into a dynamical system. Now the state is a vector of continuous values, and assumed to be normally distributed about a linear function of its predecessor. We derived the inference algorithm for this model in Section 2.2.2, which we will need for the E step. Still, as in the previous cases, we forebear to specify the averaging distribution, so that the M step can apply equally well to a fully observed model. The cross entropy for the linear-Gaussian state-space model is then written
The M step.
All three summands are Gaussians, so we only consider in detail the differentiation of one, the first.
Optimizing first with respect to
where in the last line we intepret the average to be under samples from within, as well as across, sequences.††margin:
implementation note: Since each sequence contributes only
where again in the penultimate line we changed the interpretation of the average to be within as well as across sequences.
The final simplification was carried out with the equation for
Derivations precisely analogous lead to formalae for the other cumulants. Here are all of the equations:
(7.4) | ||||||
The E step.
In Eq. 7.4, as in the preceding exampels, the bracketed quantities (including the brackets) are the sufficient statistics.
They are all averages of either vectors or outer products of vectors, reflecting the quadratic structure inherent in normal distributions.
When the states are observed, all these quantities are computed as sample averages under the data distribution.
In the context of the EM algorithm, where the states are unobserved, the relevant averaging distributions
Once again, computing the sufficient statistics belongs to the E step. So in this case, the E step requires first running the RTS smoother—more specifically, the Kalman filter followed by the RTS smoother—and then computing the sufficient statistics under them. To make this extremely concrete, we note that having the smoother distribution in hand means having a mean (vector) and covariance matrix at every time step, since the distribution is normal. To compute expected outer products, then, we have to combine these together.
For
Notice that all sums start at the second index, and (consequently) are normalized by
For
We also need average outer products involving the observations, but these can be computed directly rather than via the posterior expectation and covariance:
Finally, the sufficient statistics for the initial cumulants,
We emphasize here again that, just as in the HMM,
the posterior cumulants