2.2 Dynamical models
Consider the graphical model in Fig. LABEL:fig:HMM. Here we finally encounter some interesting statistical-independence structure among the variables. One interpretation is that we have dropped the i.i.d. assumption: the plates that we saw in Figs. 2.1 and 2.3 have been replaced with explicit representation of the source-emission replicates, because now “adjacent” sources are dependent on each other. The joint distribution still factorizes, but in a more complicated way, with individual factors linking together pairs of random variables:
(A few simple remarks on the notation:
(1) There is no
Alternatively, we can interpret Fig. LABEL:fig:HMM as showing structure internal to each of
Note the tree structure in Fig. LABEL:fig:HMM. It implies that inference—although substantially more complicated than that for the models encountered previously, which was essentially a single application of Bayes’s rule—is still only a straightforward application of the sum-product algorithm. Making that connection precise is, however, left as an exercise for the reader. Here we derive, and show the connections between, the classical inference algorithms for the two main models for which this graph structure holds: the “forward-backward” algorithm for the hidden Markov model; and the Kalman filter and Rausch-Tung-Striebel smoother for the state-space model (SSM).
We begin without specializing to either of these models.
The general strategy will be to try to derive recursive procedures.
More specifically, we will compute the filter distribution,
These are not the only possible forward and backward recursions, and indeed the classical version of the forward-backward algorithm for HMMs computes variations on both of these quantities.
We start with the filter and smoother, however, to establish the correspondence with Kalman filtering and RTS smoothing in the case of the SSM.
Furthermore, the filter distribution may be useful per se, as (e.g.) for temporal data (
Filtering.
We proceed by induction.
In particular, we assume that we start with the filter distribution at
To derive the recursion, we work with the two graphical-model fragments shown in Fig. 2.5A and Fig. 2.5B. The distributions we have chosen to parameterize these fragments are those associated with the graphical model in Fig. LABEL:fig:HMM; that is, they are either factors in the parameterization of the joint distribution in Eq. 2.35, or derivable (we hope) from it. In any case, they provide perfectly legitimate parameterizations for the directed graphical models in Fig. 2.5, although why we have chosen precisely these may not yet be clear.
In particular, the filtering process can be thought of as repeated application of two stages, corresponding to operations on these two model fragments: (a) the time update (marginalization in Fig. 2.5A) and (b) the measurement update (“Bayes inversion” in Fig. 2.5B). Thus the time update is:
TIME UPDATE
Marginalize
The second equality follows from properties of the graph in Fig. LABEL:fig:HMM: given the previous state, the current state is independent of all preceding observations.
Thus we have transformed a (filter) distribution over the previous state,
|
|
|
In the measurement update, we will add another measurement to the filter distribution; that is, we will transform a distribution over
MEASUREMENT UPDATE
Bayes-Invert
Here the second equality again follows from independence statements asserted by the graph in Fig. LABEL:fig:HMM: given the current state, the current observation is independent of all past observations.
Thus Eqs. 2.36 and 2.37 together transform the filter distribution at step
Smoothing.
Again we break the recursion into two stages, corresponding to Fig. 2.5B and Fig. 2.5C, although here we reverse the order of “Bayes inversion” and marginalization.
This time we assume that we start with a smoother distribution at step
In the first stage, we return to the graphical model fragment used in the the time update, Fig. 2.5A, only this time we compute a posterior distribution rather than merely marginalizing. Of course, this “inversion” requires computing the marginal along the way, so we may be able to reuse here some of the computations from the time update. We revisit this point below. Faute de mieux, we call this stage “future conditioning”:
FUTURE CONDITIONING
Bayes-Invert
The final equality follows, as in the time update, from the fact that
The second and final stage, which we call the “backward step,” marginalizes out
BACKWARD STEP
Marginalize
yielding the smoother distributions at the previous step (
One interesting fact about the smoother is that it does not invoke
So much for the recursion; what about the initialization?
The answer is trivial: at step
Marginalizing out the states.
The joint distribution
And in fact, we are already required to calculate the factors on the right-hand side for the measurement updates, Eq. 2.37:
We saw that each measurement update is an “inversion” of Fig. 2.5B, which requires along the way computing the marginal distribution over
Inference over sequences.
This observation suggests an interesting symmetry. Marginalizing and Bayes inverting in all three of the graphical-model fragments in Fig. 2.5 yields six total operations, five of which we have now made use of: The time update and future conditioning correspond, respectively, to marginalization and inversion in Fig. 2.5A. The recursion for the data marginal and the measurement update correspond respectively, as we have just seen, to marginalization and inversion in Fig. 2.5B. On the remaining graph fragment, Fig. 2.5C, the backward step corresponds to marginalization. That leaves inversion on this fragment—what does it compute?
Now consider the following recursion for computing the posterior over an entire sequence, starting again from the chain rule of probability and then applying conditional-independence statements asserted by the graph:
So we see that each of the six combinations of marginalization and Bayes inversion on each of the three unique graphical model fragements of Fig. 2.5 is a useful computation in its own right.
2.2.1 The hidden Markov Model and the - algorithm
We have not as yet specified the conditional distributions parameterizing the graph in Fig. LABEL:fig:HMM or Eq. 2.35. Let us specify that the states are discrete and therefore categorically distributed:
where the vector
The most important consequence of this choice for the states of the “backbone” is that the filter and smoother distributions are both again categorical distributions, no matter what the form of the emission densities. This is a result of the “pseudo-conjugacy” of the categorical distribution (see Section 2.1.1 above).
Filtering and smoothing.
Now for concreteness, let us consider in turn each of the four half-updates, referring to the corresponding graphical model, as they apply to the HMM. Since the filter distribution is categorical, the joint distribution represented by Fig. 2.5A is a(nother) mixture model. Marginalizing out the class identities in this model—the time update—therefore amounts to an application of Eq. 2.10. Similarly, the joint distribution of the model in Fig. 2.5B is another mixture model. The measurement update amounts to applying Bayes’ theorem to this mixture model—Eq. 2.6.
The - algorithm.
What if we are not actually interested in the filter distribution, that is, if all we want at the end of the day are the smoother distributions,
Consider, then, calculating only the numerator in Eq. 2.37:
where on the second line we have substituted in the time update, Eq. 2.36, and on the third multiplied through by
with the aid of the definition
This is the well known
It remains to show that the smoother algorithm can be adjusted to work with
and combine the “future conditioning” and “backward step” into a single recursion. We break it into two pieces to show where the smoother distribution over pairs of states is calculated:
On the second line we have simply substituted in the formula for the alpha recursion, Eq. 2.41.
This avoids recalculating the normalizer, as noted above—but at the price of reintroducing a direct dependency on the observations through
Computational complexity of HMM filtering and smoothing.
2.2.2 State-space models, the Kalman filter, and the RTS smoother
If the graphical model of the previous section is parameterized with a jointly Gaussian distribution over all the variables, a different, but equally tractable, and equally popular, model results: the state-space model. More specifically, the model can be interpreted as a discrete-time, linear, time-invariant (LTI) system, with additive Gaussian noise on the state transitions and observations:
To connect with the historical literature, we have allowed the time evolution of the latent state to depend on some inputs or controls
As in the hidden Markov model, inference in the state-space model requires a forward sweep, followed by a backward sweep, through ordered samples (space or time).
Since we have presented the
Differences arise only when we make precise how these steps are to be implemented. That is, in the case of the SSM, applying these four equations amounts to a series of marginalizations and “Bayes inversions” with jointly Gaussian random variables. Fortunately, as we have seen in Section 2.1.2, marginalization and Bayes inversion for jointly Gaussian random variables yield elegant closed-form expressions. Moreover, since the set of jointly Gaussian random variables is closed under these operations, all random variables of interest remain Gaussian throughout. Thus keeping track of filter and smoother distributions amounts (merely) to keeping track of mean vectors and covariance matrices; and inference amounts to recursive application of certain matrix-vector operations: for marginal cumulants, Eq. 2.26; for posterior cumulants, Eq. 2.27; and for the cross-covariance matrix, Eq. 2.19. We discuss their computational complexity below. These algorithms go by the famous names of the Kalman filter and the Rauch-Tung-Striebel smoother.99 9 The term “Kalman smoother” is sometimes encountered in the literature, but is technically a misattribution.
Filtering.
Recall again that the goal is to derive the “filtering” equation,
(2.45) |
and likewise to the posterior mean and covariance given the observations only up to the previous step:
(2.46) |
We hope that these are the only parameters we ever need to keep track of. To show that they are, we proceed by induction.
We begin with the induction step, which amounts to a measurement update followed by a time update.
Assume that the one-step prediction distribution at step
Measurement Update
Let us interpret the cumulants.
The update to our existing estimate of the current hidden state,
We expect the covariance, for its part, to shrink when we add new information.
The more precisely the emission reflects the hidden state (
We turn to the time update, which moves our prediction from
Time Update
Thus, our best estimate of the next state (
We cannot quite say, however, that the uncertainty grows, even though we are not incorporating any new observations and the state transition is noisy:
If the underlying dynamical system is strictly stable, i.e. all its eigenvalues are within the unit circle, then the covariance
To complete the induction, we need to show that the one-step prediction distribution at step
(2.49) |
so that Eq. 2.47 gives the measurement update for all time steps. Now the time update, Eq. 2.48, and the measurement update, Eq. 2.47, completely define the filter.
Smoothing.
Recall the goal: to get, for all
(2.50) |
In deriving our algorithm, it will also be helpful to keep track of the “future-conditioned smoother,”
(2.51) | |||||
(2.52) |
Notice that the “future-conditioned” mean is a random variable, because it depends on the random variable
Since the Kalman filter leaves us with
Beginning with “future conditioning,” Eq. 2.38, we “Bayes invert” the filter distribution,
Future Conditioning
For each quantity, we have provided an alternative expression involving the cumulants of the time-updated filter distribution, Eq. 2.48. The fact that this is possible reflects the fact that future conditioning is an inversion of the same model in which the time update is a marginalization, Fig. 2.5A; and inversion requires the computation of the marginal distribution. Since running the smoother requires first running the filter, we have necessarily already computed these cumulants at this point, although we have not necessarily saved them: the smoother requires the filter only at the final time step. But we can save computation here at the price of those storage costs.
We conclude with the “backward step” of the smoother, Eq. 2.39, marginalizing out
Backward Step
Let us try to interpret these equations. Notice that they relate the means and the covariances of four different posterior distributions. Essentially, we are updating the smoother (stepping backwards in time) by updating a smoother-filter disparity:
Both disparities measure how much an estimate changes when we do without the future observations, from
We emphasize that both filter estimates,
Apparently, then, the gain
Similarly, taking into account the future observations shifts, or rather shrinks, the standard deviations of the posterior covariances of
Finally, to completely characterize the posterior distribution, we also require the cross covariance:
Once more, we note that the relevant marginal distributions are normal, so their joint is as well, and we can apply Eq. 2.19:
Implementation and computational complexity.
Let us consider first a detail of the implementation of the smoother.
We noted above that the “future-conditioned” mean,
Next, consider the case where the observation space is (much) higher-dimensional than the latent space. This occurs, for example, in neural decoders, where the firing rates of hundreds of neurons are observed, but are thought to represent only some low-dimensional dynamical system. When implementing filter and smoother algorithms for such data, we would have the matrix inversions occur in the latent space……
Sampling-based approaches
The tractability—and elegance—of the Kalman filter and RTS smoother result from the adoption of linear-Gaussian dynamics and observations. What do we do if these assumptions are inappropriate? In some cases, in may still be possible to derive closed-form update equations; but there is no reason in general why the two key operations, marginalization and especially “Bayes inversion,” should have such solutions. In particular, if the prior distribution in each inversion is not conjugate to the emission likelihood, i.e., if it does not under Bayes’s rule yield a posterior in the same family, then recursive update equations are unlikely to be derivable.
An alternative is to filter and smooth approximately with samples, a technique known as particle filtering and smoothing.
We review here a simple version that corresponds closely to the exact inference algorithms just discussed (we turn to the broader picture at the end of this section).
It can be summarized picturesquely in terms of one of the original applications of the Kalman filter, tracking airplanes via radar.
Essentially, Kalman used the fact that airplanes do not jump around to reject (filter out) noise in the radar readings, computing the (exact) posterior distribution over the dynamical state of the airplane given all the observations (the smoother), or all until the present time (the filter)—assuming the linear-Gaussian assumptions hold.
In the sampling-based approach, by contrast, one generates 10,000 (imaginary) airplanes, and propagates forward those that are most consistent with the observations at that time step.
Then the process repeats.
More precisely, one draws
This description of the procedure may sound very different from those given above for the exact inference algorithms, but in fact the particle filter differs in only one of the four half-updates from an exact inference algorithm—filtering and smoothing in the HMM. The similarity to the HMM arises from the fact that sample representations of distributions can be interpreted as categorical distributions, with one category for each sample or particle. Indeed, our filter and smoother distributions will assign a “weight” to each particle, which corresponds to the probability of that “category.” Thus we can write:
Some care must be taken with this interpretation, however, because the numerical values associated with each “side of the
The time update, then—marginalization in Fig. 2.5A—is executed by sampling from the joint distribution in the graphical-model fragment, and retaining only the samples of
Time Update
In practice we do this
Notice that this is identical to the filter distribution in Eq. 2.56 except that the weights are the same for all our particles.
This is the only half-update in which we sample. In all the others, we simply follow the inference procedure for an HMM. This is possible because, as we have just seen for the filter, the posterior distributions are categorical throughout. Thus in the measurement update, we invert the mixture model in Fig. 2.5B by applying Eq. 2.6 to Eq. 2.58
Measurement Update
where the definition in the final line guarantees consistency with Eq. 2.56 for all time.
To keep the derivation clean, we simply assert (here and below) the support of this distribution to be the set of particles generated in the preceding time update, rather than appending a delta distribution for each particle and summing over all of them.
The weights in this distribution are quite intuitive: they measure how consistent each particle is with the current observation,
Our smoother will be identical to the HMM smoother. In the “future-conditioning” step, we invert the mixture model in Fig. 2.5A, again applying Eq. 2.6 but now to the filtering distribution in Eq. 2.56:
Future Conditioning
We note that this can be interpreted as another categorical distribution, although we forbear assigning a symbol to the class probabilities. Finally, in the “backward step,” we marginalize the mixture model in Fig. 2.5C with Eq. 2.10, treating this future-conditioned distribution (Eq. 2.60) as the emission, and the previous smoother density (see Eq. 2.56) as the prior distribution:
Backward Step
The definition in the final line guarantees that the smoother distribution satisfies Eq. 2.56 for all time, as long as it is initialized properly.
And indeed, it is also obvious from Eq. 2.56 that
A more general approach to particle filtering [Doucet2012]….