7.4 Linear-Gaussian state-space models

Exactly analogously to the extension of GMMs to HMMs, we extend factor analysis through time (or space) into a dynamical system. Now the state is a vector of continuous values, and assumed to be normally distributed about a linear function of its predecessor. We derived the inference algorithm for this model in Section 2.2.2, which we will need for the E step. Still, as in the previous cases, we forebear to specify the averaging distribution, so that the M step can apply equally well to a fully observed model. The cross entropy for the linear-Gaussian state-space model is then written

H(ppˇ)p^[𝑿ˇ,𝒀;𝜽]-logp^(𝑿ˇ,𝒀;𝜽)𝑿ˇ,𝒀=-logt=1Tp^(𝑿ˇt|𝑿ˇt-1;𝜽)p^(𝒚^t|𝑿ˇt;𝜽)𝑿ˇ,𝒀=-t=2Tlog𝒩(𝐀𝑿ˇt-1,𝚺x^)-t=1Tlog𝒩(𝐂𝑿ˇt,𝚺y^|x^)-log𝒩(𝝁1,𝚺1)𝑿ˇ,𝒀.\begin{split}{\text{H}_{(p\check{p})\hat{p}}{\mathopen{}\mathclose{{}\left[{% \bm{\check{X}}},{\bm{Y}};\bm{\theta}}\right]}}&\approx{\mathopen{}\mathclose{{% }\left\langle{-\log{\hat{p}\mathopen{}\mathclose{{}\left({\bm{\check{X}}},{\bm% {Y}};\bm{\theta}}\right)}}}\right\rangle_{{\bm{\check{X}}}{},{\bm{Y}}{}}}\\ &={\mathopen{}\mathclose{{}\left\langle{-\log\prod_{t=1}^{{{T}}}{\hat{p}% \mathopen{}\mathclose{{}\left({\bm{\check{X}}}_{t}\middle|{\bm{\check{X}}}_{t-% 1};\bm{\theta}}\right)}{\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color% [rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}_{t}% \middle|{\bm{\check{X}}}_{t};\bm{\theta}}\right)}}}\right\rangle_{{\bm{\check{% X}}}{},{\bm{Y}}{}}}\\ &={\mathopen{}\mathclose{{}\left\langle{-\sum_{t=2}^{{{T}}}\log\mathcal{N}% \mathopen{}\mathclose{{}\left(\mathbf{{A}}{\bm{\check{X}}}_{t-1},\>\mathbf{{% \Sigma}}_{\hat{x}}}\right)-\sum_{t=1}^{{{T}}}\log\mathcal{N}\mathopen{}% \mathclose{{}\left(\mathbf{{C}}{\bm{\check{X}}}_{t},\>\mathbf{{\Sigma}}_{\hat{% y}|\hat{x}}}\right)-\log\mathcal{N}\mathopen{}\mathclose{{}\left(\bm{\mu}_{1},% \>\mathbf{{\Sigma}}_{1}}\right)}}\right\rangle_{{\bm{\check{X}}}{},{\bm{Y}}{}}% }.\end{split}

The M step.

All three summands are Gaussians, so we only consider in detail the differentiation of one, the first. Optimizing first with respect to 𝐀\mathbf{{A}}, we find again the familiar normal equations:

0=setdHd𝐀=-t=2Tdd𝐀log𝒩(𝐀𝑿ˇt-1,𝚺x^)𝑿ˇ,𝒀=t=2Tdd𝐀12(𝑿ˇt-𝐀𝑿ˇt-1)T𝚺x^-1(𝑿ˇt-𝐀𝑿ˇt-1)𝑿ˇ,𝒀=t=2T𝚺x^-1(𝑿ˇt-𝐀𝑿ˇt-1)𝑿ˇt-1T𝑿ˇ,𝒀𝐀=(t=2T𝑿ˇt𝑿ˇt-1T𝑿ˇ,𝒀)(t=2T𝑿ˇt-1𝑿ˇt-1T𝑿ˇ,𝒀)-1=𝑿ˇt𝑿ˇt-1T𝑿ˇ,𝒀𝑿ˇt-1𝑿ˇt-1T𝑿ˇ,𝒀-1,\begin{split}0\stackrel{{\scriptstyle\text{set}}}{{=}}\frac{\mathop{}\!\mathrm% {d}{\text{H}}}{\mathop{}\!\mathrm{d}{\mathbf{{A}}}}&={\mathopen{}\mathclose{{}% \left\langle{-\sum_{t=2}^{{{T}}}\frac{\mathop{}\!\mathrm{d}{}}{\mathop{}\!% \mathrm{d}{\mathbf{{A}}}}\log\mathcal{N}\mathopen{}\mathclose{{}\left(\mathbf{% {A}}{\bm{\check{X}}}_{t-1},\>\mathbf{{\Sigma}}_{\hat{x}}}\right)}}\right% \rangle_{{\bm{\check{X}}}{},{\bm{Y}}{}}}\\ &={\mathopen{}\mathclose{{}\left\langle{\sum_{t=2}^{{{T}}}\frac{\mathop{}\!% \mathrm{d}{}}{\mathop{}\!\mathrm{d}{\mathbf{{A}}}}\frac{1}{2}({\bm{\check{X}}}% _{t}-\mathbf{{A}}{\bm{\check{X}}}_{t-1})^{\text{T}}\mathbf{{\Sigma}}^{-1}_{% \hat{x}}({\bm{\check{X}}}_{t}-\mathbf{{A}}{\bm{\check{X}}}_{t-1})}}\right% \rangle_{{\bm{\check{X}}}{},{\bm{Y}}{}}}\\ &={\mathopen{}\mathclose{{}\left\langle{\sum_{t=2}^{{{T}}}\mathbf{{\Sigma}}^{-% 1}_{\hat{x}}({\bm{\check{X}}}_{t}-\mathbf{{A}}{\bm{\check{X}}}_{t-1}){\bm{% \check{X}}}_{t-1}^{\text{T}}}}\right\rangle_{{\bm{\check{X}}}{},{\bm{Y}}{}}}\\ \implies\mathbf{{A}}&=\mathopen{}\mathclose{{}\left(\sum_{t=2}^{{{T}}}{% \mathopen{}\mathclose{{}\left\langle{{\bm{\check{X}}}_{t}{\bm{\check{X}}}_{t-1% }^{\text{T}}}}\right\rangle_{{\bm{\check{X}}}{},{\bm{Y}}{}}}}\right)\mathopen{% }\mathclose{{}\left(\sum_{t=2}^{{{T}}}{\mathopen{}\mathclose{{}\left\langle{{% \bm{\check{X}}}_{t-1}{\bm{\check{X}}}_{t-1}^{\text{T}}}}\right\rangle_{{\bm{% \check{X}}}{},{\bm{Y}}{}}}}\right)^{-1}\\ &={\mathopen{}\mathclose{{}\left\langle{{\bm{\check{X}}}_{t}{\bm{\check{X}}}_{% t-1}^{\text{T}}}}\right\rangle_{{\bm{\check{X}}}{},{\bm{Y}}{}}}{\mathopen{}% \mathclose{{}\left\langle{{\bm{\check{X}}}_{t-1}{\bm{\check{X}}}_{t-1}^{\text{% T}}}}\right\rangle_{{\bm{\check{X}}}{},{\bm{Y}}{}}^{-1}},\end{split}

where in the last line we intepret the average to be under samples from within, as well as across, sequences.margin: implementation note: Since each sequence contributes only T-1{{T}}-1 samples, one must remember to subtract NsequencesN_{\text{sequences}} from the total number of available samples before normalizing. Turning to the covariance matrix,

0=setdHd𝚺x^-1=-t=2Tdd𝚺x^-1log𝒩(𝐀𝑿ˇt-1,𝚺x^)𝑿ˇ=-t=2Tdd𝚺x^-1[12log|𝚺x^-1|-12(𝑿ˇt-𝐀𝑿ˇt-1)T𝚺x^-1(𝑿ˇt-𝐀𝑿ˇt-1)]𝑿ˇ=-t=2T[𝚺x^-(𝑿ˇt-𝐀𝑿ˇt-1)(𝑿ˇt-𝐀𝑿ˇt-1)T]𝑿ˇ𝚺x^=1T-1t=2T(𝑿ˇt-𝐀𝑿ˇt-1)(𝑿ˇt-𝐀𝑿ˇt-1)T𝑿ˇ=(𝑿ˇt-𝐀𝑿ˇt-1)(𝑿ˇt-𝐀𝑿ˇt-1)T𝑿ˇ=𝑿ˇt𝑿ˇtT𝑿ˇ-𝐀𝑿ˇt-1𝑿ˇtT𝑿ˇ.\begin{split}0\stackrel{{\scriptstyle\text{set}}}{{=}}\frac{\mathop{}\!\mathrm% {d}{\text{H}}}{\mathop{}\!\mathrm{d}{\mathbf{{\Sigma}}^{-1}_{\hat{x}}}}&={% \mathopen{}\mathclose{{}\left\langle{-\sum_{t=2}^{{{T}}}\frac{\mathop{}\!% \mathrm{d}{}}{\mathop{}\!\mathrm{d}{\mathbf{{\Sigma}}^{-1}_{\hat{x}}}}\log% \mathcal{N}\mathopen{}\mathclose{{}\left(\mathbf{{A}}{\bm{\check{X}}}_{t-1},\>% \mathbf{{\Sigma}}_{\hat{x}}}\right)}}\right\rangle_{{\bm{\check{X}}}{}}}\\ &={\mathopen{}\mathclose{{}\left\langle{-\sum_{t=2}^{{{T}}}\frac{\mathop{}\!% \mathrm{d}{}}{\mathop{}\!\mathrm{d}{\mathbf{{\Sigma}}^{-1}_{\hat{x}}}}% \mathopen{}\mathclose{{}\left[\frac{1}{2}\log\mathopen{}\mathclose{{}\left% \lvert\mathbf{{\Sigma}}^{-1}_{\hat{x}}}\right\rvert-\frac{1}{2}({\bm{\check{X}% }}_{t}-\mathbf{{A}}{\bm{\check{X}}}_{t-1})^{\text{T}}\mathbf{{\Sigma}}^{-1}_{% \hat{x}}({\bm{\check{X}}}_{t}-\mathbf{{A}}{\bm{\check{X}}}_{t-1})}\right]}}% \right\rangle_{{\bm{\check{X}}}{}}}\\ &={\mathopen{}\mathclose{{}\left\langle{-\sum_{t=2}^{{{T}}}\mathopen{}% \mathclose{{}\left[\mathbf{{\Sigma}}_{\hat{x}}-({\bm{\check{X}}}_{t}-\mathbf{{% A}}{\bm{\check{X}}}_{t-1})({\bm{\check{X}}}_{t}-\mathbf{{A}}{\bm{\check{X}}}_{% t-1})^{\text{T}}}\right]}}\right\rangle_{{\bm{\check{X}}}{}}}\\ \implies\mathbf{{\Sigma}}_{\hat{x}}&=\frac{1}{{{T}}-1}\sum_{t=2}^{{{T}}}{% \mathopen{}\mathclose{{}\left\langle{({\bm{\check{X}}}_{t}-\mathbf{{A}}{\bm{% \check{X}}}_{t-1})({\bm{\check{X}}}_{t}-\mathbf{{A}}{\bm{\check{X}}}_{t-1})^{% \text{T}}}}\right\rangle_{{\bm{\check{X}}}{}}}\\ &={\mathopen{}\mathclose{{}\left\langle{({\bm{\check{X}}}_{t}-\mathbf{{A}}{\bm% {\check{X}}}_{t-1})({\bm{\check{X}}}_{t}-\mathbf{{A}}{\bm{\check{X}}}_{t-1})^{% \text{T}}}}\right\rangle_{{\bm{\check{X}}}{}}}\\ &={\mathopen{}\mathclose{{}\left\langle{{\bm{\check{X}}}_{t}{\bm{\check{X}}}_{% t}^{\text{T}}}}\right\rangle_{{\bm{\check{X}}}{}}}-\mathbf{{A}}{\mathopen{}% \mathclose{{}\left\langle{{\bm{\check{X}}}_{t-1}{\bm{\check{X}}}_{t}^{\text{T}% }}}\right\rangle_{{\bm{\check{X}}}{}}}.\end{split}

where again in the penultimate line we changed the interpretation of the average to be within as well as across sequences. The final simplification was carried out with the equation for 𝐀\mathbf{{A}}, exactly the same as with factor analysis (see above).

Derivations precisely analogous lead to formalae for the other cumulants. Here are all of the equations:

𝐀=𝑿ˇt𝑿ˇt-1T𝑿ˇ,𝒀𝑿ˇt-1𝑿ˇt-1T𝑿ˇ,𝒀-1\displaystyle\mathbf{{A}}={\mathopen{}\mathclose{{}\left\langle{{\bm{\check{X}% }}_{t}{\bm{\check{X}}}_{t-1}^{\text{T}}}}\right\rangle_{{\bm{\check{X}}}{},{% \bm{Y}}{}}}{\mathopen{}\mathclose{{}\left\langle{{\bm{\check{X}}}_{t-1}{\bm{% \check{X}}}_{t-1}^{\text{T}}}}\right\rangle_{{\bm{\check{X}}}{},{\bm{Y}}{}}^{-% 1}} 𝚺x^=𝑿ˇt𝑿ˇtT𝑿ˇ-𝐀𝑿ˇt-1𝑿ˇtT𝑿ˇ\displaystyle\mathbf{{\Sigma}}_{\hat{x}}={\mathopen{}\mathclose{{}\left\langle% {{\bm{\check{X}}}_{t}{\bm{\check{X}}}_{t}^{\text{T}}}}\right\rangle_{{\bm{% \check{X}}}}}-\mathbf{{A}}{\mathopen{}\mathclose{{}\left\langle{{\bm{\check{X}% }}_{t-1}{\bm{\check{X}}}_{t}^{\text{T}}}}\right\rangle_{{\bm{\check{X}}}{}}} (7.4)
𝐂=𝒀t𝑿ˇtT𝑿ˇ,𝒀𝑿ˇt𝑿ˇtT𝑿ˇ,𝒀-1\displaystyle\mathbf{{C}}={\mathopen{}\mathclose{{}\left\langle{{\bm{Y}}_{t}{% \bm{\check{X}}}_{t}^{\text{T}}}}\right\rangle_{{\bm{\check{X}}}{},{\bm{Y}}{}}}% {\mathopen{}\mathclose{{}\left\langle{{\bm{\check{X}}}_{t}{\bm{\check{X}}}_{t}% ^{\text{T}}}}\right\rangle_{{\bm{\check{X}}}{},{\bm{Y}}{}}^{-1}} 𝚺y^|x^=𝒀t𝒀tT𝒀-𝐂𝑿ˇt𝒀tT𝑿ˇ,𝒀\displaystyle\mathbf{{\Sigma}}_{\hat{y}|\hat{x}}={\mathopen{}\mathclose{{}% \left\langle{{\bm{Y}}_{t}{\bm{Y}}_{t}^{\text{T}}}}\right\rangle_{{\bm{Y}}}}-% \mathbf{{C}}{\mathopen{}\mathclose{{}\left\langle{{\bm{\check{X}}}_{t}{\bm{Y}}% _{t}^{\text{T}}}}\right\rangle_{{\bm{\check{X}}}{},{\bm{Y}}{}}}
𝝁1=𝑿ˇ1𝑿ˇ,𝒀\displaystyle\bm{\mu}_{1}={\mathopen{}\mathclose{{}\left\langle{{\bm{\check{X}% }}_{1}}}\right\rangle_{{\bm{\check{X}}}{},{\bm{Y}}{}}} 𝚺1=𝑿ˇ1𝑿ˇ1T𝑿ˇ,𝒀-𝝁1𝝁1T.\displaystyle\mathbf{{\Sigma}}_{1}={\mathopen{}\mathclose{{}\left\langle{{\bm{% \check{X}}}_{1}{\bm{\check{X}}}_{1}^{\text{T}}}}\right\rangle_{{\bm{\check{X}}% }{},{\bm{Y}}{}}}-\bm{\mu}_{1}\bm{\mu}_{1}^{\text{T}}.

The E step.

In Eq. 7.4, as in the preceding exampels, the bracketed quantities (including the brackets) are the sufficient statistics. They are all averages of either vectors or outer products of vectors, reflecting the quadratic structure inherent in normal distributions. When the states are observed, all these quantities are computed as sample averages under the data distribution. In the context of the EM algorithm, where the states are unobserved, the relevant averaging distributions pˇ\check{p} are the RTS smoothing distributions, multiplied by the data distribution:

pˇ(𝒙ˇt,𝒚1,,𝒚T)p^(𝒙^t|𝒚1,,𝒚T;θold)p(𝒚)pˇ(𝒙ˇt,𝒙ˇt-1,𝒚1,,𝒚T)p^(𝒙^t,𝒙^t-1|𝒚1,,𝒚T;θold)p(𝒚).\begin{split}{\check{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\check{x}}_{t},% \leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y}_{1},% \ldots,\bm{y}_{{T}}}\right)}&\rightarrow{\hat{p}\mathopen{}\mathclose{{}\left(% \leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}_% {t}\middle|\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}% {rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y}_% {1},\ldots,\bm{y}_{{T}}{};\theta^{\text{old}}}\right)}{p\mathopen{}\mathclose{% {}\left(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{% rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y}}% \right)}\\ {\check{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\check{x}}_{t},\leavevmode\color[rgb]{.5,.5,.5% }\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.% 5}\pgfsys@color@gray@fill{.5}\bm{\check{x}}_{t-1},\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y}_{1},\ldots,\bm{% y}_{{T}}}\right)}&\rightarrow{\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode% \color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}_{t},% \leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}_% {t-1}\middle|\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{% pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{y}_{1},\ldots,\bm{y}_{{T}};\theta^{\text{old}}}% \right)}{p\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{y}}\right)}.\end{split}

Once again, computing the sufficient statistics belongs to the E step. So in this case, the E step requires first running the RTS smoother—more specifically, the Kalman filter followed by the RTS smoother—and then computing the sufficient statistics under them. To make this extremely concrete, we note that having the smoother distribution in hand means having a mean (vector) and covariance matrix at every time step, since the distribution is normal. To compute expected outer products, then, we have to combine these together.

For 𝐀\mathbf{{A}} and 𝚺x^\mathbf{{\Sigma}}_{\hat{x}} we need

𝔼𝑿ˇt-1|𝒀[𝑿ˇt-1𝑿ˇt-1T|𝒀]𝒀=1T-1t=2TCov𝑿ˇt-1|𝒀[𝑿ˇt-1|𝒀]+𝔼𝑿ˇt-1|𝒀[𝑿ˇt-1|𝒀]𝔼𝑿ˇt-1|𝒀[𝑿ˇt-1T|𝒀]𝒀,𝔼𝑿ˇ|𝒀[𝑿ˇt𝑿ˇt-1T|𝒀]𝒀=1T-1t=2TCov𝑿ˇ|𝒀[𝑿ˇt,𝑿ˇt-1|𝒀]+𝔼𝑿ˇ|𝒀[𝑿ˇt|𝒀]𝔼𝑿ˇ|𝒀[𝑿ˇt-1T|𝒀]𝒀,𝔼𝑿ˇt|𝒀[𝑿ˇt𝑿ˇtT|𝒀]𝒀=1T-1t=2TCov𝑿ˇt|𝒀[𝑿ˇt|𝒀]+𝔼𝑿ˇt|𝒀[𝑿ˇt|𝒀]𝔼𝑿ˇt|𝒀[𝑿ˇtT|𝒀]𝒀.\begin{split}{\mathopen{}\mathclose{{}\left\langle{\mathbb{E}_{{\bm{\check{X}}% }_{t-1}{}|{\bm{Y}}}{\mathopen{}\mathclose{{}\left[{\bm{\check{X}}}_{t-1}{\bm{% \check{X}}}_{t-1}^{\text{T}}\middle|{\bm{Y}}{}}\right]}}}\right\rangle_{{\bm{Y% }}{}}}&=\frac{1}{{{T}}-1}\sum_{t=2}^{{{T}}}{\mathopen{}\mathclose{{}\left% \langle{\text{Cov}_{{\bm{\check{X}}}_{t-1}{}|{\bm{Y}}}{\mathopen{}\mathclose{{% }\left[{\bm{\check{X}}}_{t-1}\middle|{\bm{Y}}{}}\right]}+\mathbb{E}_{{\bm{% \check{X}}}_{t-1}{}|{\bm{Y}}}{\mathopen{}\mathclose{{}\left[{\bm{\check{X}}}_{% t-1}\middle|{\bm{Y}}{}}\right]}\mathbb{E}_{{\bm{\check{X}}}_{t-1}{}|{\bm{Y}}}{% \mathopen{}\mathclose{{}\left[{\bm{\check{X}}}_{t-1}^{\text{T}}\middle|{\bm{Y}% }{}}\right]}}}\right\rangle_{{\bm{Y}}{}}},\\ {\mathopen{}\mathclose{{}\left\langle{\mathbb{E}_{{\bm{\check{X}}}{}|{\bm{Y}}}% {\mathopen{}\mathclose{{}\left[{\bm{\check{X}}}_{t}{\bm{\check{X}}}_{t-1}^{% \text{T}}\middle|{\bm{Y}}{}}\right]}}}\right\rangle_{{\bm{Y}}{}}}&=\frac{1}{{{% T}}-1}\sum_{t=2}^{{{T}}}{\mathopen{}\mathclose{{}\left\langle{\text{Cov}_{{\bm% {\check{X}}}{}|{\bm{Y}}}{\mathopen{}\mathclose{{}\left[{\bm{\check{X}}}_{t},{% \bm{\check{X}}}_{t-1}\middle|{\bm{Y}}{}}\right]}+\mathbb{E}_{{\bm{\check{X}}}{% }|{\bm{Y}}}{\mathopen{}\mathclose{{}\left[{\bm{\check{X}}}_{t}\middle|{\bm{Y}}% {}}\right]}\mathbb{E}_{{\bm{\check{X}}}{}|{\bm{Y}}}{\mathopen{}\mathclose{{}% \left[{\bm{\check{X}}}_{t-1}^{\text{T}}\middle|{\bm{Y}}{}}\right]}}}\right% \rangle_{{\bm{Y}}{}}},\\ {\mathopen{}\mathclose{{}\left\langle{\mathbb{E}_{{\bm{\check{X}}}_{t}{}|{\bm{% Y}}}{\mathopen{}\mathclose{{}\left[{\bm{\check{X}}}_{t}{\bm{\check{X}}}_{t}^{% \text{T}}\middle|{\bm{Y}}{}}\right]}}}\right\rangle_{{\bm{Y}}{}}}&=\frac{1}{{{% T}}-1}\sum_{t=2}^{{{T}}}{\mathopen{}\mathclose{{}\left\langle{\text{Cov}_{{\bm% {\check{X}}}_{t}{}|{\bm{Y}}}{\mathopen{}\mathclose{{}\left[{\bm{\check{X}}}_{t% }\middle|{\bm{Y}}{}}\right]}+\mathbb{E}_{{\bm{\check{X}}}_{t}{}|{\bm{Y}}}{% \mathopen{}\mathclose{{}\left[{\bm{\check{X}}}_{t}\middle|{\bm{Y}}{}}\right]}% \mathbb{E}_{{\bm{\check{X}}}_{t}{}|{\bm{Y}}}{\mathopen{}\mathclose{{}\left[{% \bm{\check{X}}}_{t}^{\text{T}}\middle|{\bm{Y}}{}}\right]}}}\right\rangle_{{\bm% {Y}}{}}}.\end{split}

Notice that all sums start at the second index, and (consequently) are normalized by T-1{{T}}-1. That is why the first and third statistics are not identical. If the sequence is long enough, including all T{{T}} terms in the first and last statistics will probably have little effect, but both averages can be computed with very little computational cost.

For 𝐂\mathbf{{C}} and 𝚺y^|x^\mathbf{{\Sigma}}_{\hat{y}|\hat{x}}, in contrast, we collect statistics for all time. Thus, even though we also need a statistic we have called 𝑿^t𝑿^tT𝑿^{\mathopen{}\mathclose{{}\left\langle{{\bm{\hat{X}}}_{t}{{}{\bm{\hat{X}}}_{t}}% ^{\text{T}}}}\right\rangle_{{\bm{\hat{X}}}}}, in this case the average is over all tt:

𝔼𝑿ˇt|𝒀[𝑿ˇt𝑿ˇtT|𝒀]𝒀=11Tt=1TCov𝑿ˇt|𝒀[𝑿ˇt|𝒚]+𝔼𝑿ˇt|𝒀[𝑿ˇt|𝒚]𝔼𝑿ˇt|𝒀[𝑿ˇtT|𝒚]𝒀.{\mathopen{}\mathclose{{}\left\langle{\mathbb{E}_{{\bm{\check{X}}}_{t}{}|{\bm{% Y}}}{\mathopen{}\mathclose{{}\left[{\bm{\check{X}}}_{t}{\bm{\check{X}}}_{t}^{% \text{T}}\middle|{\bm{Y}}{}}\right]}}}\right\rangle_{{\bm{Y}}{}}}=1\frac{1}{{{% T}}}\sum_{t=1}^{{{T}}}{\mathopen{}\mathclose{{}\left\langle{\text{Cov}_{{\bm{% \check{X}}}_{t}{}|{\bm{Y}}}{\mathopen{}\mathclose{{}\left[{\bm{\check{X}}}_{t}% \middle|\bm{y}}\right]}+\mathbb{E}_{{\bm{\check{X}}}_{t}{}|{\bm{Y}}}{\mathopen% {}\mathclose{{}\left[{\bm{\check{X}}}_{t}\middle|\bm{y}}\right]}\mathbb{E}_{{% \bm{\check{X}}}_{t}{}|{\bm{Y}}}{\mathopen{}\mathclose{{}\left[{\bm{\check{X}}}% _{t}^{\text{T}}\middle|\bm{y}}\right]}}}\right\rangle_{{\bm{Y}}{}}}.

We also need average outer products involving the observations, but these can be computed directly rather than via the posterior expectation and covariance:

1Tt=1T𝒀t𝔼𝑿ˇt|𝒀[𝑿ˇtT|𝒀]𝒀,\displaystyle\frac{1}{{{T}}}\sum_{t=1}^{{{T}}}{\mathopen{}\mathclose{{}\left% \langle{{\bm{Y}}_{t}\mathbb{E}_{{\bm{\check{X}}}_{t}{}|{\bm{Y}}}{\mathopen{}% \mathclose{{}\left[{\bm{\check{X}}}_{t}^{\text{T}}\middle|{\bm{Y}}{}}\right]}}% }\right\rangle_{{\bm{Y}}{}}}, 1Tt=1T𝒀t𝒀tT𝒀t.\displaystyle\frac{1}{{{T}}}\sum_{t=1}^{{{T}}}{\mathopen{}\mathclose{{}\left% \langle{{\bm{Y}}_{t}{\bm{Y}}_{t}^{\text{T}}}}\right\rangle_{{\bm{Y}}_{t}{}}}.

Finally, the sufficient statistics for the initial cumulants, 𝝁1\bm{\mu}_{1} and 𝚺1\mathbf{{\Sigma}}_{1}, require no sums at all, since they rely only on the first time step:

𝔼𝑿ˇ1|𝒀[𝑿ˇ1𝑿ˇ1T|𝒀]𝒀=Cov𝑿ˇ1|𝒀[𝑿ˇ1|𝒀]+𝔼𝑿ˇ1|𝒀[𝑿ˇ1|𝒀]𝔼𝑿ˇ1|𝒀[𝑿ˇ1T|𝒀]𝒀\begin{split}{\mathopen{}\mathclose{{}\left\langle{\mathbb{E}_{{\bm{\check{X}}% }_{1}{}|{\bm{Y}}}{\mathopen{}\mathclose{{}\left[{\bm{\check{X}}}_{1}{\bm{% \check{X}}}_{1}^{\text{T}}\middle|{\bm{Y}}{}}\right]}}}\right\rangle_{{\bm{Y}}% {}}}&={\mathopen{}\mathclose{{}\left\langle{\text{Cov}_{{\bm{\check{X}}}_{1}{}% |{\bm{Y}}}{\mathopen{}\mathclose{{}\left[{\bm{\check{X}}}_{1}\middle|{\bm{Y}}{% }}\right]}+\mathbb{E}_{{\bm{\check{X}}}_{1}{}|{\bm{Y}}}{\mathopen{}\mathclose{% {}\left[{\bm{\check{X}}}_{1}\middle|{\bm{Y}}{}}\right]}\mathbb{E}_{{\bm{\check% {X}}}_{1}{}|{\bm{Y}}}{\mathopen{}\mathclose{{}\left[{\bm{\check{X}}}_{1}^{% \text{T}}\middle|{\bm{Y}}{}}\right]}}}\right\rangle_{{\bm{Y}}{}}}\\ \end{split}

We emphasize here again that, just as in the HMM, the posterior cumulants 𝔼𝑿ˇ1|𝒀[𝑿ˇ1|𝒚1,,𝒚T]\mathbb{E}_{{\bm{\check{X}}}_{1}{}|{\bm{Y}}}{\mathopen{}\mathclose{{}\left[{% \bm{\check{X}}}_{1}\middle|\bm{y}_{1},\ldots,\bm{y}_{{T}}{}}\right]} and Cov𝑿ˇ1|𝒀[𝑿ˇ1|𝒚1,,𝒚T]\text{Cov}_{{\bm{\check{X}}}_{1}{}|{\bm{Y}}}{\mathopen{}\mathclose{{}\left[{% \bm{\check{X}}}_{1}\middle|\bm{y}_{1},\ldots,\bm{y}_{{T}}{}}\right]} depend on the observations for all time: at least in theory, one must run the filter all the way to the end of the sequence and then the smoother all the way back before computing them.margin: implementation note: The second of these can yield a low-rank covariance matrix if only a single trajectory has been observed, and care is sometimes required to avoid inverting a singular matrix in the Kalman filter.