A.1 Leave-one-out cross validation for linear regression in one step

As we have seen (Section LABEL:sec:), in vanilla linear regression, given a (fat) data matrix 𝐗\mathbf{X} whose columns are samples and whose rows are variables or “features,” we solve 𝒚=𝐗T𝒘\bm{y}=\mathbf{X}^{\text{T}}\bm{w} for the regression coefficients via the normal equations,

^𝒘=(𝐗𝐗T)-1𝐗𝒚.\hat{}\bm{w}={\mathopen{}\mathclose{{}\left(\mathbf{X}\mathbf{X}^{\text{T}}}% \right)}^{-1}\mathbf{X}\bm{y}. (A.1)

Our predicted outputs are then:

𝒚^=𝐇𝒚,\bm{\hat{y}}=\mathbf{H}\bm{y}, (A.2)

where we have defined the so-called hat matrix:

𝐇 . . =𝐗T(𝐗𝐗T)-1𝐗.\mathbf{H}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.} }}=\mathbf{X}^{\text{T}}{\mathopen{}\mathclose{{}\left(\mathbf{X}\mathbf{X}^{% \text{T}}}\right)}^{-1}\mathbf{X}. (A.3)

If we define the “residual matrix”

𝐌 . . =𝐈-𝐇,\mathbf{M}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.} }}=\mathbf{I}-\mathbf{H}, (A.4)

then the residuals are likewise expressed succinctly:

𝒆 . . =𝒚-𝒚^=𝒚-𝐇𝒚=𝐌𝒚.\begin{split}\bm{e}&\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.} }}=\bm{y}-\bm{\hat{y}}\\ &=\bm{y}-\mathbf{H}\bm{y}\\ &=\mathbf{M}\bm{y}.\end{split} (A.5)

The leave-out-out residuals.

Now, in each step of the leave-one-out procedure, one column of 𝐗\mathbf{X} and the corresponding element of 𝒚\bm{y} are omitted. If we call the missing column 𝒙n\bm{x}_{n}, this omission evidently turns 𝐗𝒚\mathbf{X}\bm{y}, the second factor in the normal equations (Eq. A.1), into (𝐗𝒚-𝒙nyn)(\mathbf{X}\bm{y}-\bm{x}_{n}y_{n}). The first term, 𝐗𝐗T\mathbf{X}\mathbf{X}^{\text{T}}, can be expressed as a sum of outer products, n𝒙n𝒙nT\sum_{n}\bm{x}_{n}\bm{x}_{n}^{\text{T}}, so losing 𝒙n\bm{x}_{n} simply omits one outer product: 𝐗𝐗T-𝒙n𝒙nT\mathbf{X}\mathbf{X}^{\text{T}}-\bm{x}_{n}\bm{x}_{n}^{\text{T}}. Inverting this quantity with the Sherman-Morrison formula, Eq. B.15, and bringing together with the second term, gives:

^𝒘LOO,n=((𝐗𝐗T)-1+(𝐗𝐗T)-1𝒙n𝒙nT(𝐗𝐗T)-11-𝒙nT(𝐗𝐗T)-1𝒙n)(𝐗𝒚-𝒙nyn)\hat{}\bm{w}_{\text{LOO},n}=\mathopen{}\mathclose{{}\left({\mathopen{}% \mathclose{{}\left(\mathbf{X}\mathbf{X}^{\text{T}}}\right)}^{-1}+\frac{{% \mathopen{}\mathclose{{}\left(\mathbf{X}\mathbf{X}^{\text{T}}}\right)}^{-1}\bm% {x}_{n}\bm{x}_{n}^{\text{T}}{\mathopen{}\mathclose{{}\left(\mathbf{X}\mathbf{X% }^{\text{T}}}\right)}^{-1}}{1-\bm{x}_{n}^{\text{T}}{\mathopen{}\mathclose{{}% \left(\mathbf{X}\mathbf{X}^{\text{T}}}\right)}^{-1}\bm{x}_{n}}}\right)% \mathopen{}\mathclose{{}\left(\mathbf{X}\bm{y}-\bm{x}_{n}y_{n}}\right)

for the regression coefficients at the nthn^{\text{th}} step.

Let us write the prediction at the nthn^{\text{th}} step for the held out datum, namely y^LOO,n . . =𝒙nT𝒘LOO,n{\hat{y}}_{\text{LOO},n}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.} }}=\bm{x}_{n}^{\text{T}}\bm{w}_{\text{LOO},n}. It is easily seen that 𝒙nT(𝐗𝐗T)-1𝒙n= . . hnn\bm{x}_{n}^{\text{T}}{\mathopen{}\mathclose{{}\left(\mathbf{X}\mathbf{X}^{% \text{T}}}\right)}^{-1}\bm{x}_{n}=\mathrel{\vbox{\hbox{\scriptsize.}\hbox{% \scriptsize.} }}h_{nn} is the nthn^{\text{th}} diagonal element of the hat matrix, Eq. A.3, which allows us to simplify the expression for the prediction:

y^LOO,n=𝒙nT𝒘LOO,n=(𝒙nT(𝐗𝐗T)-1+hnn𝒙nT(𝐗𝐗T)-11-hnn)(𝐗𝒚-𝒙nyn)=11-hnn𝒙nT(𝐗𝐗T)-1(𝐗𝒚-𝒙nyn)=11-hnn(𝒙nT(𝐗𝐗T)-1𝐗𝒚-hnnyn)=1mnn(𝒙nT(𝐗𝐗T)-1𝐗𝒚-hnnyn),\begin{split}{\hat{y}}_{\text{LOO},n}=\bm{x}_{n}^{\text{T}}\bm{w}_{\text{LOO},% n}&=\mathopen{}\mathclose{{}\left(\bm{x}_{n}^{\text{T}}{\mathopen{}\mathclose{% {}\left(\mathbf{X}\mathbf{X}^{\text{T}}}\right)}^{-1}+\frac{h_{nn}\bm{x}_{n}^{% \text{T}}{\mathopen{}\mathclose{{}\left(\mathbf{X}\mathbf{X}^{\text{T}}}\right% )}^{-1}}{1-h_{nn}}}\right)\mathopen{}\mathclose{{}\left(\mathbf{X}\bm{y}-\bm{x% }_{n}y_{n}}\right)\\ &=\frac{1}{1-h_{nn}}\bm{x}_{n}^{\text{T}}{\mathopen{}\mathclose{{}\left(% \mathbf{X}\mathbf{X}^{\text{T}}}\right)}^{-1}\mathopen{}\mathclose{{}\left(% \mathbf{X}\bm{y}-\bm{x}_{n}y_{n}}\right)\\ &=\frac{1}{1-h_{nn}}\mathopen{}\mathclose{{}\left(\bm{x}_{n}^{\text{T}}{% \mathopen{}\mathclose{{}\left(\mathbf{X}\mathbf{X}^{\text{T}}}\right)}^{-1}% \mathbf{X}\bm{y}-h_{nn}y_{n}}\right)\\ &=\frac{1}{m_{nn}}\mathopen{}\mathclose{{}\left(\bm{x}_{n}^{\text{T}}{% \mathopen{}\mathclose{{}\left(\mathbf{X}\mathbf{X}^{\text{T}}}\right)}^{-1}% \mathbf{X}\bm{y}-h_{nn}y_{n}}\right),\end{split}

where on the final line we use the fact, from Eq. A.4, that the diagonal entries of the residual matrix are mnn=1-hnnm_{nn}=1-h_{nn}. Collecting together the predictions for the held-out data at all the steps, we have

(y^LOO,1y^LOO,2y^LOO,N)=(1m11(𝒙1T(𝐗𝐗T)-1𝐗𝒚-h11y1)1m22(𝒙2T(𝐗𝐗T)-1𝐗𝒚-h22y2)1mNN(𝒙NT(𝐗𝐗T)-1𝐗𝒚-hNNyN))=(1m110001m220001mNN)(𝐗T(𝐗𝐗T)-1𝐗𝒚-(h11000h22000hNN)𝒚).\begin{split}\begin{pmatrix}{\hat{y}}_{\text{LOO},1}\\ {\hat{y}}_{\text{LOO},2}\\ \vdots\\ {\hat{y}}_{\text{LOO},N}\end{pmatrix}&=\begin{pmatrix}\frac{1}{m_{11}}% \mathopen{}\mathclose{{}\left(\bm{x}_{1}^{\text{T}}{\mathopen{}\mathclose{{}% \left(\mathbf{X}\mathbf{X}^{\text{T}}}\right)}^{-1}\mathbf{X}\bm{y}-h_{11}y_{1% }}\right)\\ \frac{1}{m_{22}}\mathopen{}\mathclose{{}\left(\bm{x}_{2}^{\text{T}}{\mathopen{% }\mathclose{{}\left(\mathbf{X}\mathbf{X}^{\text{T}}}\right)}^{-1}\mathbf{X}\bm% {y}-h_{22}y_{2}}\right)\\ \vdots\\ \frac{1}{m_{NN}}\mathopen{}\mathclose{{}\left(\bm{x}_{N}^{\text{T}}{\mathopen{% }\mathclose{{}\left(\mathbf{X}\mathbf{X}^{\text{T}}}\right)}^{-1}\mathbf{X}\bm% {y}-h_{NN}y_{N}}\right)\\ \end{pmatrix}\\ &=\begin{pmatrix}\frac{1}{m_{11}}&0&\cdots&0\\ 0&\frac{1}{m_{22}}&\cdots&0\\ \vdots&\vdots&\ddots&\vdots\\ 0&0&\cdots&\frac{1}{m_{NN}}\\ \end{pmatrix}\mathopen{}\mathclose{{}\left(\mathbf{X}^{\text{T}}{\mathopen{}% \mathclose{{}\left(\mathbf{X}\mathbf{X}^{\text{T}}}\right)}^{-1}\mathbf{X}\bm{% y}-\begin{pmatrix}h_{11}&0&\cdots&0\\ 0&h_{22}&\cdots&0\\ \vdots&\vdots&\ddots&\vdots\\ 0&0&\cdots&h_{NN}\\ \end{pmatrix}\bm{y}}\right).\end{split}

We can clean up this equation by giving names to the diagonal matrices constructed from the diagonals of 𝐌\mathbf{M} and 𝐇\mathbf{H}; that is,

𝐌d . . =(m11000m22000mNN),𝐇d=(h11000h22000hNN),\mathbf{M}_{\text{d}}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.} }}=\begin{pmatrix}m_{11}&0&\cdots&0\\ 0&m_{22}&\cdots&0\\ \vdots&\vdots&\ddots&\vdots\\ 0&0&\cdots&m_{NN}\\ \end{pmatrix},\>\>\>\>\>\mathbf{H}_{\text{d}}=\begin{pmatrix}h_{11}&0&\cdots&0% \\ 0&h_{22}&\cdots&0\\ \vdots&\vdots&\ddots&\vdots\\ 0&0&\cdots&h_{NN}\\ \end{pmatrix},

in which case

𝒚^LOO=𝐌d-1(𝐇𝒚-𝐇d𝒚)=𝐌d-1(𝒚^-𝐇d𝒚).\bm{\hat{y}}_{\text{LOO}}=\mathbf{M}_{\text{d}}^{-1}\mathopen{}\mathclose{{}% \left(\mathbf{H}\bm{y}-\mathbf{H}_{\text{d}}\bm{y}}\right)=\mathbf{M}_{\text{d% }}^{-1}\mathopen{}\mathclose{{}\left(\bm{\hat{y}}-\mathbf{H}_{\text{d}}\bm{y}}% \right). (A.6)

From the leave-out-out predictions it is simple to calculate the leave-one-out residuals:

𝒆LOO=𝒚-𝒚^LOO=𝒚-𝐌d-1(𝒚^-𝐇d𝒚)=𝒚-𝐌d-1(𝒚^-(𝐈-𝐌d)𝒚)=𝐌d-1(𝒚-𝒚^)=𝐌d-1𝒆.\begin{split}\bm{e}_{\text{LOO}}&=\bm{y}-\bm{\hat{y}}_{\text{LOO}}\\ &=\bm{y}-\mathbf{M}_{\text{d}}^{-1}\mathopen{}\mathclose{{}\left(\bm{\hat{y}}-% \mathbf{H}_{\text{d}}\bm{y}}\right)\\ &=\bm{y}-\mathbf{M}_{\text{d}}^{-1}\mathopen{}\mathclose{{}\left(\bm{\hat{y}}-% \mathopen{}\mathclose{{}\left(\mathbf{I}-\mathbf{M}_{\text{d}}}\right)\bm{y}}% \right)\\ &=\mathbf{M}_{\text{d}}^{-1}\mathopen{}\mathclose{{}\left(\bm{y}-\bm{\hat{y}}}% \right)\\ &=\mathbf{M}_{\text{d}}^{-1}\bm{e}.\end{split} (A.7)

Interpretation of the leave-out-out residuals.

Eqs. A.6 and A.7 are pleasingly elegant, and yield intuitive interpretations. To begin with, note that in a standard linear regression, the element hnnh_{nn} of the hat matrix 𝐇\mathbf{H} tells us how much weight the observation yny_{n} gets in predicting itself under the normal equations.11 1 Put this way, the claim may sound strange—shouldn’t yny_{n} get a weight of 1, and all other samples be weighted with zeros? Remember: the model requires that the vector of samples be “compressed” into the space of the inputs, 𝒙\bm{x}, before expanding back out into predictions. As long as there are fewer dimensions than samples (as is typically the case in regression), the hat matrix cannot be the identity matrix: 𝐗T\mathbf{X}^{\text{T}} will be “tall,” in which case (see Eq. A.3) there does not exist a matrix 𝐏\mathbf{P} such that 𝐗T𝐏=𝐈\mathbf{X}^{\text{T}}\mathbf{P}=\mathbf{I}. Intuitively, we expect this weight to be between 0 and 1, and indeed we prove this in the next section. Consequently, although perhaps less intuitively, the diagonal elements of the residual matrix, mnnm_{nn}, are also between 0 and 1 (since mnn=1-hnnm_{nn}=1-h_{nn}). They tell us what fraction of yny_{n} turns into error in the prediction of itself.

Eq. A.6 tells us how to transform the standard predictions (𝒚^\bm{\hat{y}}) into leave-one-out predictions (𝒚^LOO\bm{\hat{y}}_{\text{LOO}}). First, we remove from each prediction 𝒚^n\bm{\hat{y}}_{n} the contribution from yny_{n} (to which by assumption we have no access!), namely hnnynh_{nn}y_{n}. The vector of all contributions to be removed is 𝐇d𝒚\mathbf{H}_{\text{d}}\bm{y}, so the complete adjustment is 𝒚^-𝐇d𝒚\bm{\hat{y}}-\mathbf{H}_{\text{d}}\bm{y}. Since 0hnn10\leq h_{nn}\leq 1 for all nn, this shrinks 𝒚^\bm{\hat{y}}, and therefore we will have to scale it back up to the right size. This is accomplished by the pre-multiplication with 𝐌d-1\mathbf{M}_{\text{d}}^{-1}, which scales the nthn^{\text{th}} entry of 𝒚^-𝐇d𝒚\bm{\hat{y}}-\mathbf{H}_{\text{d}}\bm{y} by 1/mnn1/m_{nn}. Since 0mnn10\leq m_{nn}\leq 1, this scaling will indeed increase the magnitude of the prediction. And the amount of scaling applied is sensible: If, for example, the original (non-cross-validated) prediction of yny_{n} relied heavily on yny_{n}—that is, if hnnh_{nn} is close to 1—then the corresponding mnnm_{nn} would be close to zero, which makes sense: the prediction would be scaled up considerably to make up for the missing yny_{n} from the inputs. At the other extreme, if the original prediction of yny_{n} relied mostly on the other samples ymny_{m\neq n}—that is, if hnnh_{nn} is close to zero—then the corresponding mnnm_{nn} would be close to 1: very little scaling would be applied to y^n-hnnyn{\hat{y}}_{n}-h_{nn}y_{n}, which is already close to y^n{\hat{y}}_{n}.

The diagonal elements of the hat and residual matrices.

It is easily verified that 𝐇T𝐇=𝐇\mathbf{H}^{\text{T}}\mathbf{H}=\mathbf{H}, and therefore

𝒗T𝐇𝒗=𝒗T𝐇T𝐇𝒗=(𝐇𝒗)T(𝐇𝒗)0.\bm{v}^{\text{T}}\mathbf{H}\bm{v}=\bm{v}^{\text{T}}\mathbf{H}^{\text{T}}% \mathbf{H}\bm{v}=(\mathbf{H}\bm{v})^{\text{T}}(\mathbf{H}\bm{v})\geq 0.

Letting 𝒗\bm{v} be a “one-hot” vector, so that the first term just picks out a diagonal element, we see that hnn0h_{nn}\geq 0. Furthermore, for all vectors 𝒗\bm{v} of length 1, the quadratic form 𝒗T𝐇𝒗\bm{v}^{\text{T}}\mathbf{H}\bm{v} is maximal when 𝒗\bm{v} is the leading eigenvector, λmax\lambda_{\text{max}}, in which case 𝒗T𝐇𝒗=λmax\bm{v}^{\text{T}}\mathbf{H}\bm{v}=\lambda_{\text{max}}. Since one-hot vectors have length 1, 𝒗T𝐇𝒗λmax\bm{v}^{\text{T}}\mathbf{H}\bm{v}\leq\lambda_{\text{max}} for any one-hot vector (with equality only when the leading eigenvector is precisely this one-hot vector). It only remains to show that λmax\lambda_{\text{max}} is no more than 1, which follows from the fact that 𝐇\mathbf{H} is idempotent, i.e. 𝐇𝐇=𝐇\mathbf{H}\mathbf{H}=\mathbf{H}. For any eigenvector 𝒖\bm{u}, we have

λ𝒖=𝐇𝒖=𝐇𝐇𝒖=𝐇λ𝒖=λ2𝒖,\lambda\bm{u}=\mathbf{H}\bm{u}=\mathbf{H}\mathbf{H}\bm{u}=\mathbf{H}\lambda\bm% {u}=\lambda^{2}\bm{u}, (A.8)

which can only be satisfied when λ\lambda is -1, 0, or 1 (assuming non-trivial eigenvectors). Therefore, 0hnn10\leq h_{nn}\leq 1.

Since the diagonal elements of the residual matrix, mnnm_{nn}, are just 1 minus the diagonal elements of hat matrix, we see that 0mnn10\leq m_{nn}\leq 1 as well.

KK-fold cross validation.

With a little effort, the analysis can be extended to more general folded cross validation, in which the folds can consist of more than one sample, although must still be disjoint. The left out samples from fold kk then form a matrix which we will call 𝐗k\mathbf{X}_{k}, and which for concreteness we imagine (without loss of generality) to comprise consecutive columns of the data matrix 𝐗\mathbf{X}. Note that the derivation will not assume that 𝐗k\mathbf{X}_{k} contains the same number of columns as block 𝐗j\mathbf{X}_{j} for any other jj, although usually one lets all folds except the last contain an equal number of samples. Now, to invert a matrix involving this matrix 𝐗k\mathbf{X}_{k} rather than vector 𝒙n\bm{x}_{n}, we will need the full-blown Woodbury inversion lemma (Eq. B.14):

𝒚^LOFO,k=𝐗kT(𝐗𝐗T-𝐗k𝐗kT)-1(𝐗𝒚-𝐗k𝒚k)=𝐗kT[(𝐗𝐗T)-1+(𝐗𝐗T)-1𝐗k(𝐈-𝐗kT(𝐗𝐗T)-1𝐗k)-1𝐗kT(𝐗𝐗T)-1](𝐗𝒚-𝐗k𝒚k)=[𝐈+𝐇k(𝐈-𝐇k)-1]𝐗kT(𝐗𝐗T)-1(𝐗𝒚-𝐗k𝒚k),\begin{split}\bm{\hat{y}}_{\text{LOFO},k}&=\mathbf{X}_{k}^{\text{T}}\mathopen{% }\mathclose{{}\left(\mathbf{X}\mathbf{X}^{\text{T}}-\mathbf{X}_{k}\mathbf{X}_{% k}^{\text{T}}}\right)^{-1}\mathopen{}\mathclose{{}\left(\mathbf{X}\bm{y}-% \mathbf{X}_{k}\bm{y}_{k}}\right)\\ &=\mathbf{X}_{k}^{\text{T}}\mathopen{}\mathclose{{}\left[{\mathopen{}% \mathclose{{}\left(\mathbf{X}\mathbf{X}^{\text{T}}}\right)}^{-1}+{\mathopen{}% \mathclose{{}\left(\mathbf{X}\mathbf{X}^{\text{T}}}\right)}^{-1}\mathbf{X}_{k}% \mathopen{}\mathclose{{}\left(\mathbf{I}-\mathbf{X}_{k}^{\text{T}}{\mathopen{}% \mathclose{{}\left(\mathbf{X}\mathbf{X}^{\text{T}}}\right)}^{-1}\mathbf{X}_{k}% }\right)^{-1}\mathbf{X}_{k}^{\text{T}}{\mathopen{}\mathclose{{}\left(\mathbf{X% }\mathbf{X}^{\text{T}}}\right)}^{-1}}\right]\mathopen{}\mathclose{{}\left(% \mathbf{X}\bm{y}-\mathbf{X}_{k}\bm{y}_{k}}\right)\\ &=\mathopen{}\mathclose{{}\left[\mathbf{I}+\mathbf{H}_{k}\mathopen{}\mathclose% {{}\left(\mathbf{I}-\mathbf{H}_{k}}\right)^{-1}}\right]\mathbf{X}_{k}^{\text{T% }}{\mathopen{}\mathclose{{}\left(\mathbf{X}\mathbf{X}^{\text{T}}}\right)}^{-1}% \mathopen{}\mathclose{{}\left(\mathbf{X}\bm{y}-\mathbf{X}_{k}\bm{y}_{k}}\right% ),\end{split}

where we have given a name to the kthk^{\text{th}} block on the diagonal of the hat matrix:

𝐇k . . =𝐗kT(𝐗𝐗T)-1𝐗k.\mathbf{H}_{k}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.} }}=\mathbf{X}_{k}^{\text{T}}{\mathopen{}\mathclose{{}\left(\mathbf{X}\mathbf{X% }^{\text{T}}}\right)}^{-1}\mathbf{X}_{k}.

Now, it is easily verified that the bracketed quantity on the final line is merely (𝐈-𝐇k)-1(\mathbf{I}-\mathbf{H}_{k})^{-1} (since multiplying it by 𝐈-𝐇k\mathbf{I}-\mathbf{H}_{k} yields the identity), so it follows that the kthk^{\text{th}} leave-out-fold-out predictions are

𝒚^LOFO,k=(𝐈-𝐇k)-1𝐗kT(𝐗𝐗T)-1(𝐗𝒚-𝐗k𝒚k)=(𝐈-𝐇k)-1(𝐗kT𝒘-𝐇k𝒚k)=(𝐈-𝐇k)-1(𝒚^k-𝐇k𝒚k).\begin{split}\bm{\hat{y}}_{\text{LOFO},k}&=(\mathbf{I}-\mathbf{H}_{k})^{-1}% \mathbf{X}_{k}^{\text{T}}{\mathopen{}\mathclose{{}\left(\mathbf{X}\mathbf{X}^{% \text{T}}}\right)}^{-1}\mathopen{}\mathclose{{}\left(\mathbf{X}\bm{y}-\mathbf{% X}_{k}\bm{y}_{k}}\right)\\ &=(\mathbf{I}-\mathbf{H}_{k})^{-1}\mathopen{}\mathclose{{}\left(\mathbf{X}_{k}% ^{\text{T}}\bm{w}-\mathbf{H}_{k}\bm{y}_{k}}\right)\\ &=(\mathbf{I}-\mathbf{H}_{k})^{-1}\mathopen{}\mathclose{{}\left(\bm{\hat{y}}_{% k}-\mathbf{H}_{k}\bm{y}_{k}}\right).\end{split}

Correspondingly, the kthk^{\text{th}} leave-out-fold-out residuals are

𝒆LOFO,k=𝒚-𝒚^LOFO,k=𝒚-(𝐈-𝐇k)-1(𝒚^k-𝐇k𝒚k)=(𝐈-𝐇k)-1((𝐈-𝐇k)𝒚k-𝒚^k+𝐇k𝒚k)=(𝐈-𝐇k)-1(𝒚k-𝒚^k)=𝐌k-1𝒆k,\begin{split}\bm{e}_{\text{LOFO},k}=\bm{y}-\bm{\hat{y}}_{\text{LOFO},k}&=\bm{y% }-(\mathbf{I}-\mathbf{H}_{k})^{-1}\mathopen{}\mathclose{{}\left(\bm{\hat{y}}_{% k}-\mathbf{H}_{k}\bm{y}_{k}}\right)\\ &=(\mathbf{I}-\mathbf{H}_{k})^{-1}\mathopen{}\mathclose{{}\left((\mathbf{I}-% \mathbf{H}_{k})\bm{y}_{k}-\bm{\hat{y}}_{k}+\mathbf{H}_{k}\bm{y}_{k}}\right)\\ &=(\mathbf{I}-\mathbf{H}_{k})^{-1}\mathopen{}\mathclose{{}\left(\bm{y}_{k}-\bm% {\hat{y}}_{k}}\right)\\ &=\mathbf{M}_{k}^{-1}\bm{e}_{k},\end{split}

where we have named the kthk^{\text{th}} block 𝐈-𝐇k= . . 𝐌k\mathbf{I}-\mathbf{H}_{k}=\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.% } }}\mathbf{M}_{k}, analogously to the leave-one-out residual matrix. Now if we define 𝐌bd\mathbf{M}_{\text{bd}} to be the block diagonal matrix with the blocks 𝐌k\mathbf{M}_{k} on the diagonal, we can write all the residuals at once as

𝒆LOFO=𝐌bd-1𝒆,\bm{e}_{\text{LOFO}}=\mathbf{M}_{\text{bd}}^{-1}\bm{e}, (A.9)

since a block diagonal matrix can be inverted by inverting its blocks.

Eq. A.9 looks reassuringly like Eq. A.7, but the resemblance is somewhat misleading: In practice we would not (typically) invert 𝐌bd\mathbf{M}_{\text{bd}}, but would rather invert each block individually. Thus although it is possible to compute these residuals in “one step,” one typically takes KK steps, like the naïve calculation. But Eq. A.9 will still be cheaper than the naïve calculation, which inverts (much) larger matrices at each step.