$$ \newcommand{\mybold}[1]{\boldsymbol{#1}} \newcommand{\trans}{\intercal} \newcommand{\norm}[1]{\left\Vert#1\right\Vert} \newcommand{\abs}[1]{\left|#1\right|} \newcommand{\bbr}{\mathbb{R}} \newcommand{\bbz}{\mathbb{Z}} \newcommand{\bbc}{\mathbb{C}} \newcommand{\gauss}[1]{\mathcal{N}\left(#1\right)} \newcommand{\chisq}[1]{\mathcal{\chi}^2_{#1}} \newcommand{\studentt}[1]{\mathrm{StudentT}_{#1}} \newcommand{\fdist}[2]{\mathrm{FDist}_{#1,#2}} \newcommand{\iid}{\overset{\mathrm{IID}}{\sim}} \newcommand{\argmin}[1]{\underset{#1}{\mathrm{argmin}}\,} \newcommand{\projop}[1]{\underset{#1}{\mathrm{Proj}}\,} \newcommand{\proj}[1]{\underset{#1}{\mybold{P}}} \newcommand{\expect}[1]{\mathbb{E}\left[#1\right]} \newcommand{\prob}[1]{\mathbb{P}\left(#1\right)} \newcommand{\dens}[1]{\mathit{p}\left(#1\right)} \newcommand{\var}[1]{\mathrm{Var}\left(#1\right)} \newcommand{\cov}[1]{\mathrm{Cov}\left(#1\right)} \newcommand{\sumn}{\sum_{n=1}^N} \newcommand{\meann}{\frac{1}{N} \sumn} \newcommand{\cltn}{\frac{1}{\sqrt{N}} \sumn} \newcommand{\trace}[1]{\mathrm{trace}\left(#1\right)} \newcommand{\diag}[1]{\mathrm{Diag}\left(#1\right)} \newcommand{\grad}[2]{\nabla_{#1} \left. #2 \right.} \newcommand{\gradat}[3]{\nabla_{#1} \left. #2 \right|_{#3}} \newcommand{\fracat}[3]{\left. \frac{#1}{#2} \right|_{#3}} \newcommand{\W}{\mybold{W}} \newcommand{\w}{w} \newcommand{\wbar}{\bar{w}} \newcommand{\wv}{\mybold{w}} \newcommand{\X}{\mybold{X}} \newcommand{\x}{x} \newcommand{\xbar}{\bar{x}} \newcommand{\xv}{\mybold{x}} \newcommand{\Xcov}{\mybold{M}_{\X}} \newcommand{\Xcovhat}{\hat{\mybold{M}}_{\X}} \newcommand{\Covsand}{\Sigmam_{\mathrm{sand}}} \newcommand{\Covsandhat}{\hat{\Sigmam}_{\mathrm{sand}}} \newcommand{\Z}{\mybold{Z}} \newcommand{\z}{z} \newcommand{\zv}{\mybold{z}} \newcommand{\zbar}{\bar{z}} \newcommand{\Y}{\mybold{Y}} \newcommand{\Yhat}{\hat{\Y}} \newcommand{\y}{y} \newcommand{\yv}{\mybold{y}} \newcommand{\yhat}{\hat{\y}} \newcommand{\ybar}{\bar{y}} \newcommand{\res}{\varepsilon} \newcommand{\resv}{\mybold{\res}} \newcommand{\resvhat}{\hat{\mybold{\res}}} \newcommand{\reshat}{\hat{\res}} \newcommand{\betav}{\mybold{\beta}} \newcommand{\betavhat}{\hat{\betav}} \newcommand{\betahat}{\hat{\beta}} \newcommand{\betastar}{{\beta^{*}}} \newcommand{\betavstar}{{\betav^{*}}} \newcommand{\loss}{\mathscr{L}} \newcommand{\losshat}{\hat{\loss}} \newcommand{\f}{f} \newcommand{\fhat}{\hat{f}} \newcommand{\bv}{\mybold{\b}} \newcommand{\bvhat}{\hat{\bv}} \newcommand{\alphav}{\mybold{\alpha}} \newcommand{\alphavhat}{\hat{\av}} \newcommand{\alphahat}{\hat{\alpha}} \newcommand{\omegav}{\mybold{\omega}} \newcommand{\gv}{\mybold{\gamma}} \newcommand{\gvhat}{\hat{\gv}} \newcommand{\ghat}{\hat{\gamma}} \newcommand{\hv}{\mybold{\h}} \newcommand{\hvhat}{\hat{\hv}} \newcommand{\hhat}{\hat{\h}} \newcommand{\gammav}{\mybold{\gamma}} \newcommand{\gammavhat}{\hat{\gammav}} \newcommand{\gammahat}{\hat{\gamma}} \newcommand{\new}{\mathrm{new}} \newcommand{\zerov}{\mybold{0}} \newcommand{\onev}{\mybold{1}} \newcommand{\id}{\mybold{I}} \newcommand{\sigmahat}{\hat{\sigma}} \newcommand{\etav}{\mybold{\eta}} \newcommand{\muv}{\mybold{\mu}} \newcommand{\Sigmam}{\mybold{\Sigma}} \newcommand{\rdom}[1]{\mathbb{R}^{#1}} \newcommand{\RV}[1]{{#1}} \def\A{\mybold{A}} \def\A{\mybold{A}} \def\av{\mybold{a}} \def\a{a} \def\B{\mybold{B}} \def\b{b} \def\S{\mybold{S}} \def\sv{\mybold{s}} \def\s{s} \def\R{\mybold{R}} \def\rv{\mybold{r}} \def\r{r} \def\V{\mybold{V}} \def\vv{\mybold{v}} \def\v{v} \def\vhat{\hat{v}} \def\U{\mybold{U}} \def\uv{\mybold{u}} \def\u{u} \def\W{\mybold{W}} \def\wv{\mybold{w}} \def\w{w} \def\tv{\mybold{t}} \def\t{t} \def\Sc{\mathcal{S}} \def\ev{\mybold{e}} \def\Lammat{\mybold{\Lambda}} \def\Q{\mybold{Q}} \def\eps{\varepsilon} $$

Bias and variance tradeoff in prediction

$\,$

Goals

Derive estimates of mean squared error that can be used for prediction
- Bias-variance tradeoff and mean squared error
- Briefly discuss AIC, BIC, adjusted $R^2$
- Cross-validation
- Exact leave-one-out CV for linear regression

Choosing which regressors to include

Let us assume that we’re operating under homoskedastic assumptions, so that $\xv_n \in \rdom{P}$, and \[ \begin{aligned} \expect{\y_n \vert \x_n} ={}& \betav^\trans \xv_n \Rightarrow \\ \y_n ={}& \betav^\trans \xv_n + \res_n \textrm{ where}\\ \res_n :={}& \y_n - \expect{\y_n \vert \xv_n}. \end{aligned} \]

Suppose we are trying to decide which regressors to include in our regression. Specifically, suppose that we have ordered our regressors somehow, and we want to select the first $K$ regressors, and want to choose $K$. In the linear model, limiting to $K$ regressors is equivalent to defining

\[ \betav_K := \begin{pmatrix} \beta_1 \\ \vdots \\ \beta_K \\ 0 \\ \vdots \\ 0 \end{pmatrix}. \] Let $\X_K = \X_{(1:N)(1:K)}$ denote the $N\times P$ matrix consisting of the first $K$ columns of $\X$.

How should we choose $K$? Just as we chose $\betavhat$ to minimize a notion of “loss” or “error”, we want to use the same notion of “loss” to try to choose $K$. As we shall see, some extra complications arise.

Empirical versus observed loss

Early in the class we motivated OLS by minimizing the sum of squared errors through the observed data: \[ \betavhat := \argmin{\betav} \losshat(\betav) \quad\textrm{where}\quad \losshat(\betav) := \meann \left(\y_n - \betav^\trans \xv_n\right)^2. \] Recall that this is a proxy for the “ideal” estimator \[ \betavstar := \argmin{\betav} \loss(\betav) \quad\textrm{where}\quad \loss(\betav) := \expect{\left(\y - \betav^\trans \xv \right)^2}, \] since, for any particular $\betav$, $\losshat(\betav) \approx \loss(\betav)$ by the LLN. (Note: to make this rigorous, you need to show that the two are close uniformly in $\betav$, but this stronger version of the LLN is beyond the scope of the present class.) Recall also that $\betavstar$ is the true value of $\betav$ under the LE assumption.

The bias-variance tradeoff

Let us imagine that we could actually compute $\loss(\betav)$, and suppose that we use $\betavhat_K := (\X_K^\trans \X_K)^{-1}\X_K^\trans\Y$ denote the estimator using only the first $K$ regressors. Let us directly compute how $\loss(\betavhat_K)$ depends on $K$. Note that \[ \expect{\betavhat_K} := \bv_K \ne \betavstar, \] since we assume that we may have an incorrectly specified model when using only $K$ regressors. That is, $\betavhat_K$ is biased as an estimator of $\betavstar$ — both because it sets to zero components in $\betavstar$ which might not be zero, and because it may be biased even for the non–zero components, to the extent that there is omitted variable bias due to correlation between the omitted and included regressors.

Suppose that $\y_\new$ and $\xv_\new$ are new datapoints that were not used in the regression, and so are independent of $\betavhat_K$. Then

\[ \begin{aligned} \loss(\betavhat_K) :={}& \expect{(\y_\new - \betahat_K^\trans \xv_\new)^2} \\={}& \expect{\left(\y_\new - \bv_K^\trans \xv_\new + \bv_K^\trans \xv_\new - \betahat_K^\trans \xv_\new \right)^2} \\ ={}& \expect{\left(\y_\new - \bv_K^\trans \xv_\new \right)^2} + \expect{\left(\bv_K^\trans \xv_\new - \betahat_K^\trans \xv_\new \right)^2} + \\& 2 \expect{\y_\new - \bv_K^\trans \xv_\new} \expect{\left(\bv_K^\trans - \betahat_K^\trans\right) \xv_\new} \\={}& \expect{\left(\y_\new - \bv_K^\trans \xv_\new \right)^2} + \expect{\left( \left(\bv_K - \betahat_K\right)^\trans \xv_\new \right)^2} \end{aligned} \]

because the training and test data are independent by assumption. Now, the first term can be further decomposed using the fact that $\y_\new = \beta^\trans \xv_\new + \res_\new$:

\[ \begin{aligned} \expect{\left(\y_\new - \bv_K^\trans \xv_\new \right)^2} ={}& \expect{\left(\beta^\trans \xv_\new + \res_\new - \bv_K^\trans \xv_\new \right)^2} \\={}& \expect{\res_\new^2} + \expect{\left(\left(\beta - \bv_K\right)^\trans \xv_\new \right)^2} + 2 \expect{\res_\new \left(\beta - \bv_K\right)^\trans \xv_\new} \\={}& \expect{\res_\new^2} + \expect{\left(\left(\beta - \bv_K\right)^\trans \xv_\new \right)^2}, \end{aligned} \]

where the final line follows from $\expect{\xv_\new \res_\new} = \zerov$, independently of the training data. We have now decomposed the MSE into three terms, each with a distint intepretation:

\[ \begin{aligned} \loss(\betavhat_K) :={}& \expect{\res_\new^2} &+& \expect{\left(\left(\beta - \bv_K\right)^\trans \xv_\new \right)^2} &+& \expect{\left( \left(\bv_K - \betahat_K\right)^\trans \xv_\new \right)^2} \\={}& \expect{\res_\new^2} &+& \trace{ \left(\beta - \bv_K\right) \left(\beta - \bv_K\right)^\trans \expect{\xv_\new \xv_\new^\trans} } &+& \trace{ \cov{\betahat_K} \expect{\xv_\new \xv_\new^\trans} }\\ =& \textrm{Irreducible error} &+& \textrm{Squared bias of }\beta_K &+&\textrm{Variance of }\betahat_K. \end{aligned} \]

As $K$ gets larger, these terms can be expected to behave differently:

The irreducible error will not change
The bias will tend to decrease
The variance will tend to increase.

In particular, note that if $\cov{\betahat_K} = \sigma^2 (\X_K^\trans \X_K)^{-1}$, then

\[ \begin{aligned} \trace{ \cov{\betahat_K} \expect{\xv_\new \xv_\new^\trans} } \approx{}& \trace{ \begin{pmatrix} \frac{\sigma^2}{N} (\expect{\xv_{K} \xv_{K}^\trans})^{-1} & \zerov_{K(P-K)} \\ \zerov_{(P-K)K} & \zerov_{(P-K)(P-K)} \end{pmatrix} \begin{pmatrix} \expect{\xv_K^\trans \xv_K} & \expect{\xv_K^\trans \xv_{P-K}} \\ \expect{\xv_{P-K}^\trans \xv_K} & \expect{\xv_{P-K}^\trans \xv_{P-K}} \end{pmatrix} } \\={}& \trace{ \begin{pmatrix} \frac{\sigma^2}{N} \id_K & \zerov_{K(P-K)} \\ \zerov_{(P-K)K} & \zerov_{(P-K)(P-K)} \end{pmatrix} } \\={}& \frac{K \sigma^2}{N}. \end{aligned} \]

Further, notice that at $K=0$ the variance is zero, and the bias is simply $\betavstar \betavstar^\trans$. Consequently, we might expect a “sweet spot” where the bias has decreased some but the variance has not yet increased too much.

The observed loss is biased downwards

We have seen that $\betavhat$ is a good estimate of $\betavstar$, in the sense that $\expect{\betavhat} = \betavstar$ and $\betavhat \rightarrow \betavstar$ as $N \rightarrow \infty$. But is $\losshat(\betavhat)$ a good estimate of $\loss(\betavstar)$?

Note that we cannot use the LLN becuase $\betavhat$ depends on all the data, so $\y_n - \betavhat^\trans \xv_n$ are not independent. However, using results we’ve already shown, there are several ways to see that $\losshat(\betavhat)$ is downwardly biased as an estimator of $\loss(\betavstar)$.

Unbiasedness of the residual variance estimator

\[ \begin{aligned} \loss(\betavstar) ={}& \expect{(\y_n - \xv_n^\trans \betavstar)} = \sigma^2 \quad\textrm{and}\\ \losshat(\betavhat) ={}& \meann (\y_n - \betavhat^\trans \xv_n)^2 = \meann \reshat_n^2 = \frac{N - P}{N} \sigmahat^2. \end{aligned} \]

Since $\expect{\sigmahat^2} = \sigma^2$, we have

\[ \expect{\losshat(\betavhat)} = \frac{N - P}{N} \expect{\sigmahat^2} = \frac{N - P}{N} \sigma^2 < \sigma^2 = \loss(\betavstar). \]

The larger $P$ is, the greater the bias. In fact, when $P = N$, $\losshat(\betavhat) = 0$, no matter what $\sigma^2$ is.

Equivalence to $R^2$

Equivalently, $\losshat(\betavhat) = ESS = TSS(1 - R^2)$, and we have already seen that we cannot use $R^2$ to choose which regressors to include, since it will always choose more regressors.

Aligning with residuals

We can also show by direct computation that $\losshat(\betavhat) < \losshat(\betavstar)$ by direct computation: \[ \begin{aligned} \losshat(\betavhat) ={}& \meann (\y_n - \xv_n^\trans \betavhat)^2 ={}& \frac{1}{N} (\Y - \X\betavhat)^T (\Y - \X\betavhat) \\={}& \frac{1}{N} (\Y - \X \betav + \X (\betavstar - \betavhat))^T (\Y - \X \betav + \X(\betavstar - \betavhat)) \\={}& \frac{1}{N} \norm{\Y - \X \betavstar}^2 + \frac{1}{N} (\betavstar - \betavhat)^\trans \X^\trans \X (\betavstar - \betavhat) + \frac{1}{N} 2 (\Y - \X \betavstar)^\trans \X (\betavstar - \betavhat) \\={}& \frac{1}{N} \norm{\resv}^2 + \frac{1}{N} \resv^\trans \X (\X^\trans \X)^{-1} \X^\trans \X (\X^\trans \X)^{-1} \X^\trans \resv - 2\frac{1}{N} \resv^\trans \X (\X^\trans \X)^{-1} \X^\trans \resv \\={}& \frac{1}{N} \norm{\resv}^2 - \frac{1}{N} \norm{(\X^\trans \X)^{-1/2} \X^\trans \resv}^2 \\\le{}& \frac{1}{N} \norm{\resv}^2 \\={}& \losshat(\betavstar) \approx \loss(\betavstar), \end{aligned} \] where the final line follows from the LLN (because $\betavstar$ does not depend on the data).

Note that the under-estimation is precisely to the extent that the columns of $\X$ are able to align with the true unobserved residual, $\resv$.

Cross-validation and model selection criteria

We might first think to use our F-test to test the null hypothesis that the last $K$ elements of $\beta$ are zero. However, the problem is that these hypothesis tests are not independent of one another. (Why?) Choosing the $K$ that minimizes the $F$ statistic will use the data twice in a way that invalidates the hypothesis test.

Instead, we can form estimates of the MSE using held-out \[ \meann (\y_n - \xv_n^\trans \betavhat)^2 \le \meann (\y_n - \xv_n^\trans \betav)^2. \]

In fact,

\[ \begin{aligned} \meann (\y_n - \xv_n^\trans \betavhat)^2 ={}& \frac{1}{N} (\Y - \X\betavhat)^T (\Y - \X\betavhat) \\={}& \frac{1}{N} (\Y - \X \betav + \X (\betav - \betavhat))^T (\Y - \X \betav + \X(\betav - \betavhat)) \\={}& \frac{1}{N} \norm{\Y - \X \betav}^2 + \frac{1}{N} (\betav - \betavhat)^\trans \X^\trans \X (\betav - \betavhat) + \frac{1}{N} 2 (\Y - \X \betav)^\trans \X (\betav - \betavhat) \\={}& \frac{1}{N} \norm{\resv}^2 + \frac{1}{N} \resv^\trans \X (\X^\trans \X)^{-1} \X^\trans \X (\X^\trans \X)^{-1} \X^\trans \resv - 2\frac{1}{N} \resv^\trans \X (\X^\trans \X)^{-1} \X^\trans \resv \\={}& \frac{1}{N} \norm{\resv}^2 - \frac{1}{N} \norm{(\X^\trans \X)^{-1/2} \X^\trans \resv}^2 \\\le{}& \frac{1}{N} \norm{\resv}^2 \approx \textrm{MSE}. \end{aligned} \]

Note that the under-estimation is precisely to the extent that the columns of $\X$ are able to align with the true unobserved residual, $\resv$.data. Note that the in-sample error underestimates the true MSE, simply because we find $\betahat$ to minimize it:

There are a number of approximate criteria that try to correct the MSE for overfitting (AIC, BIC, adjusted $R^2$), but they hold only approximately and asymptotically. Instead, one can approximate the MSE directly by holding out some data.

In particular, if we leave out a single datapoint, and denote by $\betavhat_{-n}$ the fit without the $n$–th datapoint, then the error on the left-out point is approximately unbiased for MSE, since \[ \begin{aligned} \textrm{MSE}_{-n} ={}& (\y_n - \xv_n^\trans \betavhat_{-n})^2 \\ \expect{\textrm{MSE}_{-n} \vert \y_n, \xv_n} ={}& \y_n^2 - 2 \y_n \xv_n^\trans \expect{\betavhat_{-n}} + \xv_n^\trans \expect{\betavhat_{-n} \betavhat_{-n}^\trans} \xv_n \\ \expect{\textrm{MSE}_{-n}} ={}& \expect{\y_n^2} - 2 \expect{\y_n \xv_n^\trans} \expect{\betavhat_{-n}} + \trace{\expect{\betavhat_{-n} \betavhat_{-n}^\trans} \expect{\xv_n \xv_n^\trans} } \\={}& \expect{\y_\new^2} - 2 \expect{\y_\new \xv_\new^\trans} \expect{\betavhat_{-n}} + \trace{\expect{\betavhat_{-n} \betavhat_{-n}^\trans} \expect{\xv_\new \xv_\new^\trans} } \\={}& \expect{(\y_\new - \xv_\new^\trans \betavhat_{-n})^2}, \end{aligned} \]

where the final line follows from the fact that $\x_n,\y_n$ and $\x_\new,\y_\new$ have the same distribution. The only bias occurs from the fact that $\betavhat_{-n}$ has a different distribution from $\betavhat$ on the full data (in particular, it has a slightly larger variance). It follows that we can estimate the MSE using

\[ \hat{MSE}_{LOO} := \meann \textrm{MSE}_{-n}, \]

the “leave-one-out” cross-validation estimator.