2.2 Dynamical models
Consider the graphical model in Fig. LABEL:fig:HMM. Here we finally encounter some interesting statistical-independence structure among the variables. One interpretation is that we have dropped the i.i.d. assumption: the plates that we saw in Figs. 2.1 and 2.3 have been replaced with explicit representation of the source-emission replicates, because now “adjacent” sources are dependent on each other. The joint distribution still factorizes, but in a more complicated way, with individual factors linking together pairs of random variables:
(A few simple remarks on the notation: (1) There is no , but we define , the initial prior distribution. This allows for a more compact expression of the joint. (2) The joint has been written evaluated at instantiations of to reflect the assumption that they have been observed. (3) The dependence on parameters has been suppressed to reduce clutter.)
Alternatively, we can interpret Fig. LABEL:fig:HMM as showing structure internal to each of source and emission variables, in which case the entire graph could be wrapped in a plate, and Eq. 2.35 would represent just one sample sequence of many. Within each sequence, we assume that the data were “emitted” by some process with Markov dynamics. We will prefer the second interpretation because it allows for multiple, i.i.d. observation sequences, a scenario typically encountered.
Note the tree structure in Fig. LABEL:fig:HMM. It implies that inference—although substantially more complicated than that for the models encountered previously, which was essentially a single application of Bayes’s rule—is still only a straightforward application of the sum-product algorithm. Making that connection precise is, however, left as an exercise for the reader. Here we derive, and show the connections between, the classical inference algorithms for the two main models for which this graph structure holds: the “forward-backward” algorithm for the hidden Markov model; and the Kalman filter and Rausch-Tung-Striebel smoother for the state-space model (SSM).
We begin without specializing to either of these models. The general strategy will be to try to derive recursive procedures. More specifically, we will compute the filter distribution, , i.e. the distribution over hidden state given all the observations up to and including point , in a forward pass. Then in a backward pass starting from sample , we will compute the smoother distributions, and : the distribution over hidden states given all the observations sans phrase. We derive the smoother for pairs of adjacent states as well as single states in anticipation of their usefulness in learning. In particular, it is intuitively clear that it would be impossible to learn state-transition probabilities with only distributions over single states.
These are not the only possible forward and backward recursions, and indeed the classical version of the forward-backward algorithm for HMMs computes variations on both of these quantities. We start with the filter and smoother, however, to establish the correspondence with Kalman filtering and RTS smoothing in the case of the SSM. Furthermore, the filter distribution may be useful per se, as (e.g.) for temporal data ( indexing the time of arrival) that need to be processed on-line. Finally, for compactness, all marginalizations in the development below are written as sums (as for the HMM) rather than integrals (as for the SSM).
Filtering.
We proceed by induction. In particular, we assume that we start with the filter distribution at 1, , and show how to update it to . We will handle the base case at the end.
To derive the recursion, we work with the two graphical-model fragments shown in Fig. 2.5A and Fig. 2.5B. The distributions we have chosen to parameterize these fragments are those associated with the graphical model in Fig. LABEL:fig:HMM; that is, they are either factors in the parameterization of the joint distribution in Eq. 2.35, or derivable (we hope) from it. In any case, they provide perfectly legitimate parameterizations for the directed graphical models in Fig. 2.5, although why we have chosen precisely these may not yet be clear.
In particular, the filtering process can be thought of as repeated application of two stages, corresponding to operations on these two model fragments: (a) the time update (marginalization in Fig. 2.5A) and (b) the measurement update (“Bayes inversion” in Fig. 2.5B). Thus the time update is:
TIME UPDATE
Marginalize
The second equality follows from properties of the graph in Fig. LABEL:fig:HMM: given the previous state, the current state is independent of all preceding observations. Thus we have transformed a (filter) distribution over the previous state, , into a distribution over the current state, (in both cases conditioned on observations up 1)—hence the “time update.”
In the measurement update, we will add another measurement to the filter distribution; that is, we will transform a distribution over by adding the current observation, , to the set of conditioning variables. In particular, using Bayes’s rule on the graph in Fig. 2.5B to compute the posterior over :
MEASUREMENT UPDATE
Bayes-Invert
Here the second equality again follows from independence statements asserted by the graph in Fig. LABEL:fig:HMM: given the current state, the current observation is independent of all past observations.
Thus Eqs. 2.36 and 2.37 together transform the filter distribution at step 1 into the filter distribution at step . The question is how difficult the operations in these equations are to carry out. We precise them to the case of the HMM and SSM below. All that remains is the base case, then, and it is indeed obvious that at step , the measurement update is simply initialized at , the prior distribution over the initial states.
Smoothing.
Again we break the recursion into two stages, corresponding to Fig. 2.5B and Fig. 2.5C, although here we reverse the order of “Bayes inversion” and marginalization. This time we assume that we start with a smoother distribution at step and show how to get the smoother distribution at step 1, i.e. a backwards recursion.
In the first stage, we return to the graphical model fragment used in the the time update, Fig. 2.5A, only this time we compute a posterior distribution rather than merely marginalizing. Of course, this “inversion” requires computing the marginal along the way, so we may be able to reuse here some of the computations from the time update. We revisit this point below. Faute de mieux, we call this stage “future conditioning”:
FUTURE CONDITIONING
Bayes-Invert
The final equality follows, as in the time update, from the fact that is independent of all past observations, conditioned on . The first equality follows from the mirror-image independence statement: conditioned on the next state, the current state is independent of all future observations.
The second and final stage, which we call the “backward step,” marginalizes out from the graph fragment in Fig. 2.5C:
BACKWARD STEP
Marginalize
yielding the smoother distributions at the previous step (1). Notice that the distribution over pairs of states is simply the joint distribution in the penultimate line prior to marginalization.
One interesting fact about the smoother is that it does not invoke ; that is, it does not need to (re-)access the data! Instead, all relevant information about the observations is contained in the filter distribution.
So much for the recursion; what about the initialization? The answer is trivial: at step , the smoother distribution is identical to the filter distribution, .
Marginalizing out the states.
The joint distribution can also be computed with a recursion. By the chain rule of probability,
And in fact, we are already required to calculate the factors on the right-hand side for the measurement updates, Eq. 2.37: We saw that each measurement update is an “inversion” of Fig. 2.5B, which requires along the way computing the marginal distribution over in this graph fragment. It turns out that these marginals are the factors in Eq. 2.40:
Inference over sequences.
This observation suggests an interesting symmetry. Marginalizing and Bayes inverting in all three of the graphical-model fragments in Fig. 2.5 yields six total operations, five of which we have now made use of: The time update and future conditioning correspond, respectively, to marginalization and inversion in Fig. 2.5A. The recursion for the data marginal and the measurement update correspond respectively, as we have just seen, to marginalization and inversion in Fig. 2.5B. On the remaining graph fragment, Fig. 2.5C, the backward step corresponds to marginalization. That leaves inversion on this fragment—what does it compute?
Now consider the following recursion for computing the posterior over an entire sequence, starting again from the chain rule of probability and then applying conditional-independence statements asserted by the graph:
So we see that each of the six combinations of marginalization and Bayes inversion on each of the three unique graphical model fragements of Fig. 2.5 is a useful computation in its own right.
2.2.1 The hidden Markov Model and the - algorithm
We have not as yet specified the conditional distributions parameterizing the graph in Fig. LABEL:fig:HMM or Eq. 2.35. Let us specify that the states are discrete and therefore categorically distributed:
where the vector and matrix are parameters. Note that we have assumed a homogeneous Markov chain, i.e., is independent of .
The most important consequence of this choice for the states of the “backbone” is that the filter and smoother distributions are both again categorical distributions, no matter what the form of the emission densities. This is a result of the “pseudo-conjugacy” of the categorical distribution (see Section 2.1.1 above).
Filtering and smoothing.
Now for concreteness, let us consider in turn each of the four half-updates, referring to the corresponding graphical model, as they apply to the HMM. Since the filter distribution is categorical, the joint distribution represented by Fig. 2.5A is a(nother) mixture model. Marginalizing out the class identities in this model—the time update—therefore amounts to an application of Eq. 2.10. Similarly, the joint distribution of the model in Fig. 2.5B is another mixture model. The measurement update amounts to applying Bayes’ theorem to this mixture model—Eq. 2.6.
The - algorithm.
What if we are not actually interested in the filter distribution, that is, if all we want at the end of the day are the smoother distributions, and (at all steps )? Then it can be wasteful and unnecessary to keep renormalizing in the measurement update, Eq. 2.37. We shall see below that it is not wasteful for Gaussian random variables, for which the posterior (cumulants) can be calculated without any explicit calculation of the normalizer. But for discrete random variables, we are nearly doubling the number of operations performed. On the other hand, normalization has agreeable numerical consequences, preventing the numerical underflow that would results from repeated multiplication of numbers less than one. Therefore, we present the following mostly to connect with the historical literature.
Consider, then, calculating only the numerator in Eq. 2.37:
where on the second line we have substituted in the time update, Eq. 2.36, and on the third multiplied through by . This can be written more compactly still as
-Recursion
with the aid of the definition
This is the well known forward recursion.
It remains to show that the smoother algorithm can be adjusted to work with rather than the filter distribution. This is trivial: simply multiplying the numerator and denominator in the final line of the “future conditioning” stage, Eq. 2.38, by converts the filter densities into ’s. That is to say, the smoother recursion does not need to be altered; it works the same if the filter densities are replaced by ’s. To make this explicit and invoke the classical terminology, we assign the Greek letter to the smoother distribution,
and combine the “future conditioning” and “backward step” into a single recursion. We break it into two pieces to show where the smoother distribution over pairs of states is calculated:
-Recursion
On the second line we have simply substituted in the formula for the alpha recursion, Eq. 2.41. This avoids recalculating the normalizer, as noted above—but at the price of reintroducing a direct dependency on the observations through .
Computational complexity of HMM filtering and smoothing.
2.2.2 State-space models, the Kalman filter, and the RTS smoother
If the graphical model of the previous section is parameterized with a jointly Gaussian distribution over all the variables, a different, but equally tractable, and equally popular, model results: the state-space model. More specifically, the model can be interpreted as a discrete-time, linear, time-invariant (LTI) system, with additive Gaussian noise on the state transitions and observations:
To connect with the historical literature, we have allowed the time evolution of the latent state to depend on some inputs or controls . But notice that the controls are not random variables, since they are assumed to be observed (in theory, we issue them). We could therefore treat all of as a single vector of (time-varying) “parameters.” In this derivation, however, we will be explicit and maintain separate identities for these quantities throughout.
As in the hidden Markov model, inference in the state-space model requires a forward sweep, followed by a backward sweep, through ordered samples (space or time). Since we have presented the - version of the forward-backward algorithm for the HMM—i.e., forward filtering followed by backward smoothing—the SSM algorithms are identical at this level of description: Eqs. 2.36, 2.37, 2.38, and 2.39.
Differences arise only when we make precise how these steps are to be implemented. That is, in the case of the SSM, applying these four equations amounts to a series of marginalizations and “Bayes inversions” with jointly Gaussian random variables. Fortunately, as we have seen in Section 2.1.2, marginalization and Bayes inversion for jointly Gaussian random variables yield elegant closed-form expressions. Moreover, since the set of jointly Gaussian random variables is closed under these operations, all random variables of interest remain Gaussian throughout. Thus keeping track of filter and smoother distributions amounts (merely) to keeping track of mean vectors and covariance matrices; and inference amounts to recursive application of certain matrix-vector operations: for marginal cumulants, Eq. 2.26; for posterior cumulants, Eq. 2.27; and for the cross-covariance matrix, Eq. 2.19. We discuss their computational complexity below. These algorithms go by the famous names of the Kalman filter and the Rauch-Tung-Striebel smoother.99 9 The term “Kalman smoother” is sometimes encountered in the literature, but is technically a misattribution.
Filtering.
Recall again that the goal is to derive the “filtering” equation, . Here we translate the time update, Eq. 2.36, and the measurement update, Eq. 2.37, into operations on Gaussian random variables. We have lately noted that this required keeping track only of posterior mean vectors and covariance matrices, so to facilitate that process we begin by assigning shorthand symbols to the filter cumulants,
(2.45) |
and likewise to the posterior mean and covariance given the observations only up to the previous step:
(2.46) |
We hope that these are the only parameters we ever need to keep track of. To show that they are, we proceed by induction.
We begin with the induction step, which amounts to a measurement update followed by a time update. Assume that the one-step prediction distribution at step 1 is a normal distribution over , . Then, since the emission is likewise a normal distribution, whose mean is an affine function of , we can simply apply our equations for the posterior cumulants, Eq. 2.27:
Measurement Update
Let us interpret the cumulants. The update to our existing estimate of the current hidden state, , is a product of a precision, , and an accuracy, . In particular, the more accurately predicts the current observation (), the less it needs to be updated. On the other hand, we need not be troubled by an inaccurate prediction if the observation is not particularly informative, to wit, when the emission is imprecise relative to the original distribution, . This relative precision is measured by the Kalman gain, , as we saw in Eq. 2.15.
We expect the covariance, for its part, to shrink when we add new information. The more precisely the emission reflects the hidden state (), the larger the fraction of the existing covariance () we expect to remove.
We turn to the time update, which moves our prediction from to , without considering any new evidence. It does so by considering the state transition, in particular multiplying the filter distribution by and marginalizing out (Eq. 2.36). The state transitions are again normally distributed about a mean that is an affine function of , which as we have just seen is itself normally distributed under the filter distribution. Therefore we can compute the marginal distribution over simply by applying Eq. 2.26:
Time Update
Thus, our best estimate of the next state () is simply our estimate of the current state () passed through the state-transition function. The uncertainty (covariance) compounds: it is our uncertainty about the previous state, as transformed by the state-transition matrix, plus the state-transition uncertainty.
We cannot quite say, however, that the uncertainty grows, even though we are not incorporating any new observations and the state transition is noisy: If the underlying dynamical system is strictly stable, i.e. all its eigenvalues are within the unit circle, then the covariance will be smaller than ; and it could in theory be sufficiently smaller even to account for the additional transition noise (). Such systems, however, are heading rapidly toward an equilibrium point, and are (perhaps) less common targets for tracking than marginally stable systems, with eigenvalues on the unit circle. For such systems, the uncertainty increases with the time update.
To complete the induction, we need to show that the one-step prediction distribution at step is a normal distribution over . And indeed, setting this distribution to be the prior distribution over , , achieves that goal. For compactness, then, we define
(2.49) |
so that Eq. 2.47 gives the measurement update for all time steps. Now the time update, Eq. 2.48, and the measurement update, Eq. 2.47, completely define the filter.
Smoothing.
Recall the goal: to get, for all , , where we are conditioning on all the observations. Once again we shall show that this distribution is normal, and thus we need only keep track of its first two cumulants:
(2.50) |
In deriving our algorithm, it will also be helpful to keep track of the “future-conditioned smoother,” , so we give names to its cumulants, as well:
(2.51) | |||||
(2.52) |
Notice that the “future-conditioned” mean is a random variable, because it depends on the random variable . (The future-conditioned covariance, although it also seems to depend on a random variable, will turn out to be non-random.) This raises a minor question of how to represent and store this mean, an issue which did not arise for the HMM. We shall revisit this after the derivation of the smoother.
Since the Kalman filter leaves us with at the last step, we start there and recurse backwards. This filter distribution is normal, which takes care of the base case of the induction. The induction step, then, assumes that at step we have , and concludes by showing that we can get from it.
Beginning with “future conditioning,” Eq. 2.38, we “Bayes invert” the filter distribution, , and the state-transition distribution, . Once again, is normally distributed under the filter distribution (as we showed in the derivation of the Kalman filter), and is normally distributed about an affine function of . So we apply Eq. 2.27 and find:
Future Conditioning
For each quantity, we have provided an alternative expression involving the cumulants of the time-updated filter distribution, Eq. 2.48. The fact that this is possible reflects the fact that future conditioning is an inversion of the same model in which the time update is a marginalization, Fig. 2.5A; and inversion requires the computation of the marginal distribution. Since running the smoother requires first running the filter, we have necessarily already computed these cumulants at this point, although we have not necessarily saved them: the smoother requires the filter only at the final time step. But we can save computation here at the price of those storage costs.
We conclude with the “backward step” of the smoother, Eq. 2.39, marginalizing out . We assume that at step , is normally distributed under the smoother distribution; and we have just shown that is a normal distribution about an affine function of . Therefore we can apply Eq. 2.26 to compute the marginal cumulants:
Backward Step
Let us try to interpret these equations. Notice that they relate the means and the covariances of four different posterior distributions. Essentially, we are updating the smoother (stepping backwards in time) by updating a smoother-filter disparity:
Both disparities measure how much an estimate changes when we do without the future observations, from to . The less these observations matter in estimating (i.e., the smaller the right-hand disparity), the less they matter in estimating (the smaller the left-hand disparity).
We emphasize that both filter estimates, and , are made on the basis of observations only up to time 1; that is, they differ only by a time update, i.e. by taking into account the state transition. Therefore, if the state transition is particularly noisy, relative to the filter’s current precision, we shouldn’t take so seriously, and accordingly the right-hand disparity should not propagate into much of a left-hand disparity. The “gain,” , should be small. On the other hand, if the state transition is particularly precise, relative to the filter’s current precision, then the right-hand disparity probably reflects a problem with the filter (at this point in time), and accordingly should propagate into the left-side disparity. The gain should be large.
Apparently, then, the gain should encode the precision of the state-transition distribution, relative to the filter distribution at time 1. From its definition in Eq. 2.53 and the original definition of the gain, Eq. 2.15, we see that this is exactly what does. It should not be suprising, then, that this gain does not depend on the future observations—indeed, it can be computed in the forward pass (Eq. 2.53)!—even though it is used to update the smoother. The gain doesn’t relate the filter and smoother cumulants; it relates their disparities at two adjacent time steps. And these disparities are related to each other only insofar as time-updating the filter doesn’t much damage its precision.
Similarly, taking into account the future observations shifts, or rather shrinks, the standard deviations of the posterior covariances of and by a proportional amount.
Finally, to completely characterize the posterior distribution, we also require the cross covariance:
Once more, we note that the relevant marginal distributions are normal, so their joint is as well, and we can apply Eq. 2.19:
Implementation and computational complexity.
Let us consider first a detail of the implementation of the smoother. We noted above that the “future-conditioned” mean, , is a random variable. How do we represent it? We could in theory store the numerical values of the parameters that effect the affine transformations, from Eq. 2.53 (along with, at least implicitly, the transformation itself). But in fact most of the future-conditioning step has been absorbed into the backward step, Eq. 2.54, and in practice the former amounts merely to computation of the gain, . Breaking the backward step into two pieces was for us merely a conceptual step, a way to exhibit the parallelism of the filter and smoother.
Next, consider the case where the observation space is (much) higher-dimensional than the latent space. This occurs, for example, in neural decoders, where the firing rates of hundreds of neurons are observed, but are thought to represent only some low-dimensional dynamical system. When implementing filter and smoother algorithms for such data, we would have the matrix inversions occur in the latent space……
Sampling-based approaches
The tractability—and elegance—of the Kalman filter and RTS smoother result from the adoption of linear-Gaussian dynamics and observations. What do we do if these assumptions are inappropriate? In some cases, in may still be possible to derive closed-form update equations; but there is no reason in general why the two key operations, marginalization and especially “Bayes inversion,” should have such solutions. In particular, if the prior distribution in each inversion is not conjugate to the emission likelihood, i.e., if it does not under Bayes’s rule yield a posterior in the same family, then recursive update equations are unlikely to be derivable.
An alternative is to filter and smooth approximately with samples, a technique known as particle filtering and smoothing. We review here a simple version that corresponds closely to the exact inference algorithms just discussed (we turn to the broader picture at the end of this section). It can be summarized picturesquely in terms of one of the original applications of the Kalman filter, tracking airplanes via radar. Essentially, Kalman used the fact that airplanes do not jump around to reject (filter out) noise in the radar readings, computing the (exact) posterior distribution over the dynamical state of the airplane given all the observations (the smoother), or all until the present time (the filter)—assuming the linear-Gaussian assumptions hold. In the sampling-based approach, by contrast, one generates 10,000 (imaginary) airplanes, and propagates forward those that are most consistent with the observations at that time step. Then the process repeats. More precisely, one draws samples or “particles” from the prior distribution over the initial state; weights each particle by its “likelihood” under the observation at the initial time; then steps forward in time by drawing particles from the mixture model whose mixture weights are the (normalized) emission likelihoods of the particles, and whose emission is the state-transition distribution. The smoother, for its part, works backwards through time reweighting these same samples by taking into account the future observations.
This description of the procedure may sound very different from those given above for the exact inference algorithms, but in fact the particle filter differs in only one of the four half-updates from an exact inference algorithm—filtering and smoothing in the HMM. The similarity to the HMM arises from the fact that sample representations of distributions can be interpreted as categorical distributions, with one category for each sample or particle. Indeed, our filter and smoother distributions will assign a “weight” to each particle, which corresponds to the probability of that “category.” Thus we can write:
Some care must be taken with this interpretation, however, because the numerical values associated with each “side of the -sided die” will change with every time step. That is, we shall not simply “grid up” the state space and then ask at each time step for the probability of each of the points of this grid. This would be an inefficient sampling of the space (unless the filter distribution really is roughly uniform throughout!). Instead, we shall draw a new set of numerical values at each step of the filter. (The smoother, on the other hand, will re-use these samples.)
The time update, then—marginalization in Fig. 2.5A—is executed by sampling from the joint distribution in the graphical-model fragment, and retaining only the samples of . Since the source distribution is categorical, the joint is a mixture model. Therefore at time , we pick one of the particles, , where the probability of picking particle is given by the corresponding “mixture weight,” ; then we use the transition probability conditioned on this sample to draw :
Time Update
In practice we do this times, so that the number of particles doesn’t change from step to step, but in theory any number of samples could be chosen (perhaps based on, say, a current estimate of the variance of the distribution). Thus,
Notice that this is identical to the filter distribution in Eq. 2.56 except that the weights are the same for all our particles.
This is the only half-update in which we sample. In all the others, we simply follow the inference procedure for an HMM. This is possible because, as we have just seen for the filter, the posterior distributions are categorical throughout. Thus in the measurement update, we invert the mixture model in Fig. 2.5B by applying Eq. 2.6 to Eq. 2.58
Measurement Update
where the definition in the final line guarantees consistency with Eq. 2.56 for all time. To keep the derivation clean, we simply assert (here and below) the support of this distribution to be the set of particles generated in the preceding time update, rather than appending a delta distribution for each particle and summing over all of them. The weights in this distribution are quite intuitive: they measure how consistent each particle is with the current observation, (and then normalize). It only remains to initialize the recursion, but this is simple: the first samples (at the “first time update”) are simply drawn from the prior distribution over the initial state.
Our smoother will be identical to the HMM smoother. In the “future-conditioning” step, we invert the mixture model in Fig. 2.5A, again applying Eq. 2.6 but now to the filtering distribution in Eq. 2.56:
Future Conditioning
We note that this can be interpreted as another categorical distribution, although we forbear assigning a symbol to the class probabilities. Finally, in the “backward step,” we marginalize the mixture model in Fig. 2.5C with Eq. 2.10, treating this future-conditioned distribution (Eq. 2.60) as the emission, and the previous smoother density (see Eq. 2.56) as the prior distribution:
Backward Step
The definition in the final line guarantees that the smoother distribution satisfies Eq. 2.56 for all time, as long as it is initialized properly. And indeed, it is also obvious from Eq. 2.56 that . Notice that this is the same set of particles used to represent the filter distribution! but, moving backward from time , the weights on those particles () will (typically) depart from the weights of the filter ().
A more general approach to particle filtering [Doucet2012]….