A.1 Leave-one-out cross validation for linear regression in one step
As we have seen (Section LABEL:sec:), in vanilla linear regression, given a (fat) data matrix whose columns are samples and whose rows are variables or “features,” we solve for the regression coefficients via the normal equations,
Our predicted outputs are then:
where we have defined the so-called hat matrix:
If we define the “residual matrix”
then the residuals are likewise expressed succinctly:
The leave-out-out residuals.
Now, in each step of the leave-one-out procedure, one column of and the corresponding element of are omitted. If we call the missing column , this omission evidently turns , the second factor in the normal equations (Eq. A.1), into . The first term, , can be expressed as a sum of outer products, , so losing simply omits one outer product: . Inverting this quantity with the Sherman-Morrison formula, Eq. B.15, and bringing together with the second term, gives:
for the regression coefficients at the step.
Let us write the prediction at the step for the held out datum, namely . It is easily seen that is the diagonal element of the hat matrix, Eq. A.3, which allows us to simplify the expression for the prediction:
where on the final line we use the fact, from Eq. A.4, that the diagonal entries of the residual matrix are . Collecting together the predictions for the held-out data at all the steps, we have
We can clean up this equation by giving names to the diagonal matrices constructed from the diagonals of and ; that is,
in which case
From the leave-out-out predictions it is simple to calculate the leave-one-out residuals:
Interpretation of the leave-out-out residuals.
Eqs. A.6 and A.7 are pleasingly elegant, and yield intuitive interpretations. To begin with, note that in a standard linear regression, the element of the hat matrix tells us how much weight the observation gets in predicting itself under the normal equations.11 1 Put this way, the claim may sound strange—shouldn’t get a weight of 1, and all other samples be weighted with zeros? Remember: the model requires that the vector of samples be “compressed” into the space of the inputs, , before expanding back out into predictions. As long as there are fewer dimensions than samples (as is typically the case in regression), the hat matrix cannot be the identity matrix: will be “tall,” in which case (see Eq. A.3) there does not exist a matrix such that . Intuitively, we expect this weight to be between 0 and 1, and indeed we prove this in the next section. Consequently, although perhaps less intuitively, the diagonal elements of the residual matrix, , are also between 0 and 1 (since ). They tell us what fraction of turns into error in the prediction of itself.
Eq. A.6 tells us how to transform the standard predictions () into leave-one-out predictions (). First, we remove from each prediction the contribution from (to which by assumption we have no access!), namely . The vector of all contributions to be removed is , so the complete adjustment is . Since for all , this shrinks , and therefore we will have to scale it back up to the right size. This is accomplished by the pre-multiplication with , which scales the entry of by . Since , this scaling will indeed increase the magnitude of the prediction. And the amount of scaling applied is sensible: If, for example, the original (non-cross-validated) prediction of relied heavily on —that is, if is close to 1—then the corresponding would be close to zero, which makes sense: the prediction would be scaled up considerably to make up for the missing from the inputs. At the other extreme, if the original prediction of relied mostly on the other samples —that is, if is close to zero—then the corresponding would be close to 1: very little scaling would be applied to , which is already close to .
The diagonal elements of the hat and residual matrices.
It is easily verified that , and therefore
Letting be a “one-hot” vector, so that the first term just picks out a diagonal element, we see that . Furthermore, for all vectors of length 1, the quadratic form is maximal when is the leading eigenvector, , in which case . Since one-hot vectors have length 1, for any one-hot vector (with equality only when the leading eigenvector is precisely this one-hot vector). It only remains to show that is no more than 1, which follows from the fact that is idempotent, i.e. . For any eigenvector , we have
which can only be satisfied when is -1, 0, or 1 (assuming non-trivial eigenvectors). Therefore, .
Since the diagonal elements of the residual matrix, , are just 1 minus the diagonal elements of hat matrix, we see that as well.
-fold cross validation.
With a little effort, the analysis can be extended to more general folded cross validation, in which the folds can consist of more than one sample, although must still be disjoint. The left out samples from fold then form a matrix which we will call , and which for concreteness we imagine (without loss of generality) to comprise consecutive columns of the data matrix . Note that the derivation will not assume that contains the same number of columns as block for any other , although usually one lets all folds except the last contain an equal number of samples. Now, to invert a matrix involving this matrix rather than vector , we will need the full-blown Woodbury inversion lemma (Eq. B.14):
where we have given a name to the block on the diagonal of the hat matrix:
Now, it is easily verified that the bracketed quantity on the final line is merely (since multiplying it by yields the identity), so it follows that the leave-out-fold-out predictions are
Correspondingly, the leave-out-fold-out residuals are
where we have named the block , analogously to the leave-one-out residual matrix. Now if we define to be the block diagonal matrix with the blocks on the diagonal, we can write all the residuals at once as
since a block diagonal matrix can be inverted by inverting its blocks.
Eq. A.9 looks reassuringly like Eq. A.7, but the resemblance is somewhat misleading: In practice we would not (typically) invert , but would rather invert each block individually. Thus although it is possible to compute these residuals in “one step,” one typically takes steps, like the naïve calculation. But Eq. A.9 will still be cheaper than the naïve calculation, which inverts (much) larger matrices at each step.