7.4 Linear-Gaussian state-space models

Exactly analogously to the extension of GMMs to HMMs, we extend factor analysis through time (or space) into a dynamical system. Now the state is a vector of continuous values, and assumed to be normally distributed about a linear function of its predecessor. We derived the inference algorithm for this model in Section 2.2.2, which we will need for the E step. Still, as in the previous cases, we forebear to specify the averaging distribution, so that the M step can apply equally well to a fully observed model. The cross entropy for the linear-Gaussian state-space model is then written

\begin{split}{\text{H}_{(p\check{p})\hat{p}}{\mathopen{}\mathclose{{}\left[{%\bm{\check{X}}},{\bm{Y}};\bm{\theta}}\right]}}&\approx{\mathopen{}\mathclose{{%}\left\langle{-\log{\hat{p}\mathopen{}\mathclose{{}\left({\bm{\check{X}}},{\bm%{Y}};\bm{\theta}}\right)}}}\right\rangle_{{\bm{\check{X}}}{},{\bm{Y}}{}}}\\&={\mathopen{}\mathclose{{}\left\langle{-\log\prod_{t=1}^{{{T}}}{\hat{p}%\mathopen{}\mathclose{{}\left({\bm{\check{X}}}_{t}\middle|{\bm{\check{X}}}_{t-%1};\bm{\theta}}\right)}{\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color%[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}_{t}%\middle|{\bm{\check{X}}}_{t};\bm{\theta}}\right)}}}\right\rangle_{{\bm{\check{%X}}}{},{\bm{Y}}{}}}\\&={\mathopen{}\mathclose{{}\left\langle{-\sum_{t=2}^{{{T}}}\log\mathcal{N}%\mathopen{}\mathclose{{}\left(\mathbf{{A}}{\bm{\check{X}}}_{t-1},\>\mathbf{{%\Sigma}}_{\hat{x}}}\right)-\sum_{t=1}^{{{T}}}\log\mathcal{N}\mathopen{}%\mathclose{{}\left(\mathbf{{C}}{\bm{\check{X}}}_{t},\>\mathbf{{\Sigma}}_{\hat{%y}|\hat{x}}}\right)-\log\mathcal{N}\mathopen{}\mathclose{{}\left(\bm{\mu}_{1},%\>\mathbf{{\Sigma}}_{1}}\right)}}\right\rangle_{{\bm{\check{X}}}{},{\bm{Y}}{}}%}.\end{split}

The M step.

All three summands are Gaussians, so we only consider in detail the differentiation of one, the first. Optimizing first with respect to $\mathbf{{A}}$ , we find again the familiar normal equations:

\begin{split}0\stackrel{{\scriptstyle\text{set}}}{{=}}\frac{\mathop{}\!\mathrm%{d}{\text{H}}}{\mathop{}\!\mathrm{d}{\mathbf{{A}}}}&={\mathopen{}\mathclose{{}%\left\langle{-\sum_{t=2}^{{{T}}}\frac{\mathop{}\!\mathrm{d}{}}{\mathop{}\!%\mathrm{d}{\mathbf{{A}}}}\log\mathcal{N}\mathopen{}\mathclose{{}\left(\mathbf{%{A}}{\bm{\check{X}}}_{t-1},\>\mathbf{{\Sigma}}_{\hat{x}}}\right)}}\right%\rangle_{{\bm{\check{X}}}{},{\bm{Y}}{}}}\\&={\mathopen{}\mathclose{{}\left\langle{\sum_{t=2}^{{{T}}}\frac{\mathop{}\!%\mathrm{d}{}}{\mathop{}\!\mathrm{d}{\mathbf{{A}}}}\frac{1}{2}({\bm{\check{X}}}%_{t}-\mathbf{{A}}{\bm{\check{X}}}_{t-1})^{\text{T}}\mathbf{{\Sigma}}^{-1}_{%\hat{x}}({\bm{\check{X}}}_{t}-\mathbf{{A}}{\bm{\check{X}}}_{t-1})}}\right%\rangle_{{\bm{\check{X}}}{},{\bm{Y}}{}}}\\&={\mathopen{}\mathclose{{}\left\langle{\sum_{t=2}^{{{T}}}\mathbf{{\Sigma}}^{-%1}_{\hat{x}}({\bm{\check{X}}}_{t}-\mathbf{{A}}{\bm{\check{X}}}_{t-1}){\bm{%\check{X}}}_{t-1}^{\text{T}}}}\right\rangle_{{\bm{\check{X}}}{},{\bm{Y}}{}}}\\\implies\mathbf{{A}}&=\mathopen{}\mathclose{{}\left(\sum_{t=2}^{{{T}}}{%\mathopen{}\mathclose{{}\left\langle{{\bm{\check{X}}}_{t}{\bm{\check{X}}}_{t-1%}^{\text{T}}}}\right\rangle_{{\bm{\check{X}}}{},{\bm{Y}}{}}}}\right)\mathopen{%}\mathclose{{}\left(\sum_{t=2}^{{{T}}}{\mathopen{}\mathclose{{}\left\langle{{%\bm{\check{X}}}_{t-1}{\bm{\check{X}}}_{t-1}^{\text{T}}}}\right\rangle_{{\bm{%\check{X}}}{},{\bm{Y}}{}}}}\right)^{-1}\\&={\mathopen{}\mathclose{{}\left\langle{{\bm{\check{X}}}_{t}{\bm{\check{X}}}_{%t-1}^{\text{T}}}}\right\rangle_{{\bm{\check{X}}}{},{\bm{Y}}{}}}{\mathopen{}%\mathclose{{}\left\langle{{\bm{\check{X}}}_{t-1}{\bm{\check{X}}}_{t-1}^{\text{%T}}}}\right\rangle_{{\bm{\check{X}}}{},{\bm{Y}}{}}^{-1}},\end{split}

where in the last line we intepret the average to be under samples from within, as well as across, sequences.^†^†margin: implementation note: Since each sequence contributes only ${{T}}-1$ samples, one must remember to subtract $N_{\text{sequences}}$ from the total number of available samples before normalizing. Turning to the covariance matrix,

\begin{split}0\stackrel{{\scriptstyle\text{set}}}{{=}}\frac{\mathop{}\!\mathrm%{d}{\text{H}}}{\mathop{}\!\mathrm{d}{\mathbf{{\Sigma}}^{-1}_{\hat{x}}}}&={%\mathopen{}\mathclose{{}\left\langle{-\sum_{t=2}^{{{T}}}\frac{\mathop{}\!%\mathrm{d}{}}{\mathop{}\!\mathrm{d}{\mathbf{{\Sigma}}^{-1}_{\hat{x}}}}\log%\mathcal{N}\mathopen{}\mathclose{{}\left(\mathbf{{A}}{\bm{\check{X}}}_{t-1},\>%\mathbf{{\Sigma}}_{\hat{x}}}\right)}}\right\rangle_{{\bm{\check{X}}}{}}}\\&={\mathopen{}\mathclose{{}\left\langle{-\sum_{t=2}^{{{T}}}\frac{\mathop{}\!%\mathrm{d}{}}{\mathop{}\!\mathrm{d}{\mathbf{{\Sigma}}^{-1}_{\hat{x}}}}%\mathopen{}\mathclose{{}\left[\frac{1}{2}\log\mathopen{}\mathclose{{}\left%\lvert\mathbf{{\Sigma}}^{-1}_{\hat{x}}}\right\rvert-\frac{1}{2}({\bm{\check{X}%}}_{t}-\mathbf{{A}}{\bm{\check{X}}}_{t-1})^{\text{T}}\mathbf{{\Sigma}}^{-1}_{%\hat{x}}({\bm{\check{X}}}_{t}-\mathbf{{A}}{\bm{\check{X}}}_{t-1})}\right]}}%\right\rangle_{{\bm{\check{X}}}{}}}\\&={\mathopen{}\mathclose{{}\left\langle{-\sum_{t=2}^{{{T}}}\mathopen{}%\mathclose{{}\left[\mathbf{{\Sigma}}_{\hat{x}}-({\bm{\check{X}}}_{t}-\mathbf{{%A}}{\bm{\check{X}}}_{t-1})({\bm{\check{X}}}_{t}-\mathbf{{A}}{\bm{\check{X}}}_{%t-1})^{\text{T}}}\right]}}\right\rangle_{{\bm{\check{X}}}{}}}\\\implies\mathbf{{\Sigma}}_{\hat{x}}&=\frac{1}{{{T}}-1}\sum_{t=2}^{{{T}}}{%\mathopen{}\mathclose{{}\left\langle{({\bm{\check{X}}}_{t}-\mathbf{{A}}{\bm{%\check{X}}}_{t-1})({\bm{\check{X}}}_{t}-\mathbf{{A}}{\bm{\check{X}}}_{t-1})^{%\text{T}}}}\right\rangle_{{\bm{\check{X}}}{}}}\\&={\mathopen{}\mathclose{{}\left\langle{({\bm{\check{X}}}_{t}-\mathbf{{A}}{\bm%{\check{X}}}_{t-1})({\bm{\check{X}}}_{t}-\mathbf{{A}}{\bm{\check{X}}}_{t-1})^{%\text{T}}}}\right\rangle_{{\bm{\check{X}}}{}}}\\&={\mathopen{}\mathclose{{}\left\langle{{\bm{\check{X}}}_{t}{\bm{\check{X}}}_{%t}^{\text{T}}}}\right\rangle_{{\bm{\check{X}}}{}}}-\mathbf{{A}}{\mathopen{}%\mathclose{{}\left\langle{{\bm{\check{X}}}_{t-1}{\bm{\check{X}}}_{t}^{\text{T}%}}}\right\rangle_{{\bm{\check{X}}}{}}}.\end{split}

where again in the penultimate line we changed the interpretation of the average to be within as well as across sequences. The final simplification was carried out with the equation for $\mathbf{{A}}$ , exactly the same as with factor analysis (see above).

Derivations precisely analogous lead to formalae for the other cumulants. Here are all of the equations:

$\displaystyle\mathbf{{A}}={\mathopen{}\mathclose{{}\left\langle{{\bm{\check{X}%}}_{t}{\bm{\check{X}}}_{t-1}^{\text{T}}}}\right\rangle_{{\bm{\check{X}}}{},{%\bm{Y}}{}}}{\mathopen{}\mathclose{{}\left\langle{{\bm{\check{X}}}_{t-1}{\bm{%\check{X}}}_{t-1}^{\text{T}}}}\right\rangle_{{\bm{\check{X}}}{},{\bm{Y}}{}}^{-%1}}$	$\displaystyle\mathbf{{\Sigma}}_{\hat{x}}={\mathopen{}\mathclose{{}\left\langle%{{\bm{\check{X}}}_{t}{\bm{\check{X}}}_{t}^{\text{T}}}}\right\rangle_{{\bm{%\check{X}}}}}-\mathbf{{A}}{\mathopen{}\mathclose{{}\left\langle{{\bm{\check{X}%}}_{t-1}{\bm{\check{X}}}_{t}^{\text{T}}}}\right\rangle_{{\bm{\check{X}}}{}}}$	(7.4)
$\displaystyle\mathbf{{C}}={\mathopen{}\mathclose{{}\left\langle{{\bm{Y}}_{t}{%\bm{\check{X}}}_{t}^{\text{T}}}}\right\rangle_{{\bm{\check{X}}}{},{\bm{Y}}{}}}%{\mathopen{}\mathclose{{}\left\langle{{\bm{\check{X}}}_{t}{\bm{\check{X}}}_{t}%^{\text{T}}}}\right\rangle_{{\bm{\check{X}}}{},{\bm{Y}}{}}^{-1}}$	$\displaystyle\mathbf{{\Sigma}}_{\hat{y}\|\hat{x}}={\mathopen{}\mathclose{{}%\left\langle{{\bm{Y}}_{t}{\bm{Y}}_{t}^{\text{T}}}}\right\rangle_{{\bm{Y}}}}-%\mathbf{{C}}{\mathopen{}\mathclose{{}\left\langle{{\bm{\check{X}}}_{t}{\bm{Y}}%_{t}^{\text{T}}}}\right\rangle_{{\bm{\check{X}}}{},{\bm{Y}}{}}}$
$\displaystyle\bm{\mu}_{1}={\mathopen{}\mathclose{{}\left\langle{{\bm{\check{X}%}}_{1}}}\right\rangle_{{\bm{\check{X}}}{},{\bm{Y}}{}}}$	$\displaystyle\mathbf{{\Sigma}}_{1}={\mathopen{}\mathclose{{}\left\langle{{\bm{%\check{X}}}_{1}{\bm{\check{X}}}_{1}^{\text{T}}}}\right\rangle_{{\bm{\check{X}}%}{},{\bm{Y}}{}}}-\bm{\mu}_{1}\bm{\mu}_{1}^{\text{T}}.$

The E step.

In Eq. 7.4, as in the preceding exampels, the bracketed quantities (including the brackets) are the sufficient statistics. They are all averages of either vectors or outer products of vectors, reflecting the quadratic structure inherent in normal distributions. When the states are observed, all these quantities are computed as sample averages under the data distribution. In the context of the EM algorithm, where the states are unobserved, the relevant averaging distributions $\check{p}$ are the RTS smoothing distributions, multiplied by the data distribution:

\begin{split}{\check{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\check{x}}_{t},%\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{%.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y}_{1},%\ldots,\bm{y}_{{T}}}\right)}&\rightarrow{\hat{p}\mathopen{}\mathclose{{}\left(%\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{%.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}_%{t}\middle|\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}%{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y}_%{1},\ldots,\bm{y}_{{T}}{};\theta^{\text{old}}}\right)}{p\mathopen{}\mathclose{%{}\left(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{%rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y}}%\right)}\\{\check{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\check{x}}_{t},\leavevmode\color[rgb]{.5,.5,.5%}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.%5}\pgfsys@color@gray@fill{.5}\bm{\check{x}}_{t-1},\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y}_{1},\ldots,\bm{%y}_{{T}}}\right)}&\rightarrow{\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode%\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}_{t},%\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{%.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}_%{t-1}\middle|\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{%pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}%\pgfsys@color@gray@fill{.5}\bm{y}_{1},\ldots,\bm{y}_{{T}};\theta^{\text{old}}}%\right)}{p\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{y}}\right)}.\end{split}

Once again, computing the sufficient statistics belongs to the E step. So in this case, the E step requires first running the RTS smoother—more specifically, the Kalman filter followed by the RTS smoother—and then computing the sufficient statistics under them. To make this extremely concrete, we note that having the smoother distribution in hand means having a mean (vector) and covariance matrix at every time step, since the distribution is normal. To compute expected outer products, then, we have to combine these together.

For $\mathbf{{A}}$ and $\mathbf{{\Sigma}}_{\hat{x}}$ we need

\begin{split}{\mathopen{}\mathclose{{}\left\langle{\mathbb{E}_{{\bm{\check{X}}%}_{t-1}{}|{\bm{Y}}}{\mathopen{}\mathclose{{}\left[{\bm{\check{X}}}_{t-1}{\bm{%\check{X}}}_{t-1}^{\text{T}}\middle|{\bm{Y}}{}}\right]}}}\right\rangle_{{\bm{Y%}}{}}}&=\frac{1}{{{T}}-1}\sum_{t=2}^{{{T}}}{\mathopen{}\mathclose{{}\left%\langle{\text{Cov}_{{\bm{\check{X}}}_{t-1}{}|{\bm{Y}}}{\mathopen{}\mathclose{{%}\left[{\bm{\check{X}}}_{t-1}\middle|{\bm{Y}}{}}\right]}+\mathbb{E}_{{\bm{%\check{X}}}_{t-1}{}|{\bm{Y}}}{\mathopen{}\mathclose{{}\left[{\bm{\check{X}}}_{%t-1}\middle|{\bm{Y}}{}}\right]}\mathbb{E}_{{\bm{\check{X}}}_{t-1}{}|{\bm{Y}}}{%\mathopen{}\mathclose{{}\left[{\bm{\check{X}}}_{t-1}^{\text{T}}\middle|{\bm{Y}%}{}}\right]}}}\right\rangle_{{\bm{Y}}{}}},\\{\mathopen{}\mathclose{{}\left\langle{\mathbb{E}_{{\bm{\check{X}}}{}|{\bm{Y}}}%{\mathopen{}\mathclose{{}\left[{\bm{\check{X}}}_{t}{\bm{\check{X}}}_{t-1}^{%\text{T}}\middle|{\bm{Y}}{}}\right]}}}\right\rangle_{{\bm{Y}}{}}}&=\frac{1}{{{%T}}-1}\sum_{t=2}^{{{T}}}{\mathopen{}\mathclose{{}\left\langle{\text{Cov}_{{\bm%{\check{X}}}{}|{\bm{Y}}}{\mathopen{}\mathclose{{}\left[{\bm{\check{X}}}_{t},{%\bm{\check{X}}}_{t-1}\middle|{\bm{Y}}{}}\right]}+\mathbb{E}_{{\bm{\check{X}}}{%}|{\bm{Y}}}{\mathopen{}\mathclose{{}\left[{\bm{\check{X}}}_{t}\middle|{\bm{Y}}%{}}\right]}\mathbb{E}_{{\bm{\check{X}}}{}|{\bm{Y}}}{\mathopen{}\mathclose{{}%\left[{\bm{\check{X}}}_{t-1}^{\text{T}}\middle|{\bm{Y}}{}}\right]}}}\right%\rangle_{{\bm{Y}}{}}},\\{\mathopen{}\mathclose{{}\left\langle{\mathbb{E}_{{\bm{\check{X}}}_{t}{}|{\bm{%Y}}}{\mathopen{}\mathclose{{}\left[{\bm{\check{X}}}_{t}{\bm{\check{X}}}_{t}^{%\text{T}}\middle|{\bm{Y}}{}}\right]}}}\right\rangle_{{\bm{Y}}{}}}&=\frac{1}{{{%T}}-1}\sum_{t=2}^{{{T}}}{\mathopen{}\mathclose{{}\left\langle{\text{Cov}_{{\bm%{\check{X}}}_{t}{}|{\bm{Y}}}{\mathopen{}\mathclose{{}\left[{\bm{\check{X}}}_{t%}\middle|{\bm{Y}}{}}\right]}+\mathbb{E}_{{\bm{\check{X}}}_{t}{}|{\bm{Y}}}{%\mathopen{}\mathclose{{}\left[{\bm{\check{X}}}_{t}\middle|{\bm{Y}}{}}\right]}%\mathbb{E}_{{\bm{\check{X}}}_{t}{}|{\bm{Y}}}{\mathopen{}\mathclose{{}\left[{%\bm{\check{X}}}_{t}^{\text{T}}\middle|{\bm{Y}}{}}\right]}}}\right\rangle_{{\bm%{Y}}{}}}.\end{split}

Notice that all sums start at the second index, and (consequently) are normalized by ${{T}}-1$ . That is why the first and third statistics are not identical. If the sequence is long enough, including all ${{T}}$ terms in the first and last statistics will probably have little effect, but both averages can be computed with very little computational cost.

For $\mathbf{{C}}$ and $\mathbf{{\Sigma}}_{\hat{y}|\hat{x}}$ , in contrast, we collect statistics for all time. Thus, even though we also need a statistic we have called ${\mathopen{}\mathclose{{}\left\langle{{\bm{\hat{X}}}_{t}{{}{\bm{\hat{X}}}_{t}}%^{\text{T}}}}\right\rangle_{{\bm{\hat{X}}}}}$ , in this case the average is over all $t$ :

{\mathopen{}\mathclose{{}\left\langle{\mathbb{E}_{{\bm{\check{X}}}_{t}{}|{\bm{%Y}}}{\mathopen{}\mathclose{{}\left[{\bm{\check{X}}}_{t}{\bm{\check{X}}}_{t}^{%\text{T}}\middle|{\bm{Y}}{}}\right]}}}\right\rangle_{{\bm{Y}}{}}}=1\frac{1}{{{%T}}}\sum_{t=1}^{{{T}}}{\mathopen{}\mathclose{{}\left\langle{\text{Cov}_{{\bm{%\check{X}}}_{t}{}|{\bm{Y}}}{\mathopen{}\mathclose{{}\left[{\bm{\check{X}}}_{t}%\middle|\bm{y}}\right]}+\mathbb{E}_{{\bm{\check{X}}}_{t}{}|{\bm{Y}}}{\mathopen%{}\mathclose{{}\left[{\bm{\check{X}}}_{t}\middle|\bm{y}}\right]}\mathbb{E}_{{%\bm{\check{X}}}_{t}{}|{\bm{Y}}}{\mathopen{}\mathclose{{}\left[{\bm{\check{X}}}%_{t}^{\text{T}}\middle|\bm{y}}\right]}}}\right\rangle_{{\bm{Y}}{}}}.

We also need average outer products involving the observations, but these can be computed directly rather than via the posterior expectation and covariance:

\displaystyle\frac{1}{{{T}}}\sum_{t=1}^{{{T}}}{\mathopen{}\mathclose{{}\left%\langle{{\bm{Y}}_{t}\mathbb{E}_{{\bm{\check{X}}}_{t}{}|{\bm{Y}}}{\mathopen{}%\mathclose{{}\left[{\bm{\check{X}}}_{t}^{\text{T}}\middle|{\bm{Y}}{}}\right]}}%}\right\rangle_{{\bm{Y}}{}}},

\displaystyle\frac{1}{{{T}}}\sum_{t=1}^{{{T}}}{\mathopen{}\mathclose{{}\left%\langle{{\bm{Y}}_{t}{\bm{Y}}_{t}^{\text{T}}}}\right\rangle_{{\bm{Y}}_{t}{}}}.

Finally, the sufficient statistics for the initial cumulants, $\bm{\mu}_{1}$ and $\mathbf{{\Sigma}}_{1}$ , require no sums at all, since they rely only on the first time step:

\begin{split}{\mathopen{}\mathclose{{}\left\langle{\mathbb{E}_{{\bm{\check{X}}%}_{1}{}|{\bm{Y}}}{\mathopen{}\mathclose{{}\left[{\bm{\check{X}}}_{1}{\bm{%\check{X}}}_{1}^{\text{T}}\middle|{\bm{Y}}{}}\right]}}}\right\rangle_{{\bm{Y}}%{}}}&={\mathopen{}\mathclose{{}\left\langle{\text{Cov}_{{\bm{\check{X}}}_{1}{}%|{\bm{Y}}}{\mathopen{}\mathclose{{}\left[{\bm{\check{X}}}_{1}\middle|{\bm{Y}}{%}}\right]}+\mathbb{E}_{{\bm{\check{X}}}_{1}{}|{\bm{Y}}}{\mathopen{}\mathclose{%{}\left[{\bm{\check{X}}}_{1}\middle|{\bm{Y}}{}}\right]}\mathbb{E}_{{\bm{\check%{X}}}_{1}{}|{\bm{Y}}}{\mathopen{}\mathclose{{}\left[{\bm{\check{X}}}_{1}^{%\text{T}}\middle|{\bm{Y}}{}}\right]}}}\right\rangle_{{\bm{Y}}{}}}\\\end{split}

We emphasize here again that, just as in the HMM, the posterior cumulants $\mathbb{E}_{{\bm{\check{X}}}_{1}{}|{\bm{Y}}}{\mathopen{}\mathclose{{}\left[{%\bm{\check{X}}}_{1}\middle|\bm{y}_{1},\ldots,\bm{y}_{{T}}{}}\right]}$ and $\text{Cov}_{{\bm{\check{X}}}_{1}{}|{\bm{Y}}}{\mathopen{}\mathclose{{}\left[{%\bm{\check{X}}}_{1}\middle|\bm{y}_{1},\ldots,\bm{y}_{{T}}{}}\right]}$ depend on the observations for all time: at least in theory, one must run the filter all the way to the end of the sequence and then the smoother all the way back before computing them.^†^†margin: implementation note: The second of these can yield a low-rank covariance matrix if only a single trajectory has been observed, and care is sometimes required to avoid inverting a singular matrix in the Kalman filter.