$$ \newcommand{\mybold}[1]{\boldsymbol{#1}} \newcommand{\trans}{\intercal} \newcommand{\norm}[1]{\left\Vert#1\right\Vert} \newcommand{\abs}[1]{\left|#1\right|} \newcommand{\bbr}{\mathbb{R}} \newcommand{\bbz}{\mathbb{Z}} \newcommand{\bbc}{\mathbb{C}} \newcommand{\gauss}[1]{\mathcal{N}\left(#1\right)} \newcommand{\chisq}[1]{\mathcal{\chi}^2_{#1}} \newcommand{\studentt}[1]{\mathrm{StudentT}_{#1}} \newcommand{\fdist}[2]{\mathrm{FDist}_{#1,#2}} \newcommand{\argmin}[1]{\underset{#1}{\mathrm{argmin}}\,} \newcommand{\projop}[1]{\underset{#1}{\mathrm{Proj}}\,} \newcommand{\proj}[1]{\underset{#1}{\mybold{P}}} \newcommand{\expect}[1]{\mathbb{E}\left[#1\right]} \newcommand{\prob}[1]{\mathbb{P}\left(#1\right)} \newcommand{\dens}[1]{\mathit{p}\left(#1\right)} \newcommand{\var}[1]{\mathrm{Var}\left(#1\right)} \newcommand{\cov}[1]{\mathrm{Cov}\left(#1\right)} \newcommand{\sumn}{\sum_{n=1}^N} \newcommand{\meann}{\frac{1}{N} \sumn} \newcommand{\cltn}{\frac{1}{\sqrt{N}} \sumn} \newcommand{\trace}[1]{\mathrm{trace}\left(#1\right)} \newcommand{\diag}[1]{\mathrm{Diag}\left(#1\right)} \newcommand{\grad}[2]{\nabla_{#1} \left. #2 \right.} \newcommand{\gradat}[3]{\nabla_{#1} \left. #2 \right|_{#3}} \newcommand{\fracat}[3]{\left. \frac{#1}{#2} \right|_{#3}} \newcommand{\W}{\mybold{W}} \newcommand{\w}{w} \newcommand{\wbar}{\bar{w}} \newcommand{\wv}{\mybold{w}} \newcommand{\X}{\mybold{X}} \newcommand{\x}{x} \newcommand{\xbar}{\bar{x}} \newcommand{\xv}{\mybold{x}} \newcommand{\Xcov}{\Sigmam_{\X}} \newcommand{\Xcovhat}{\hat{\Sigmam}_{\X}} \newcommand{\Covsand}{\Sigmam_{\mathrm{sand}}} \newcommand{\Covsandhat}{\hat{\Sigmam}_{\mathrm{sand}}} \newcommand{\Z}{\mybold{Z}} \newcommand{\z}{z} \newcommand{\zv}{\mybold{z}} \newcommand{\zbar}{\bar{z}} \newcommand{\Y}{\mybold{Y}} \newcommand{\Yhat}{\hat{\Y}} \newcommand{\y}{y} \newcommand{\yv}{\mybold{y}} \newcommand{\yhat}{\hat{\y}} \newcommand{\ybar}{\bar{y}} \newcommand{\res}{\varepsilon} \newcommand{\resv}{\mybold{\res}} \newcommand{\resvhat}{\hat{\mybold{\res}}} \newcommand{\reshat}{\hat{\res}} \newcommand{\betav}{\mybold{\beta}} \newcommand{\betavhat}{\hat{\betav}} \newcommand{\betahat}{\hat{\beta}} \newcommand{\betastar}{{\beta^{*}}} \newcommand{\bv}{\mybold{\b}} \newcommand{\bvhat}{\hat{\bv}} \newcommand{\alphav}{\mybold{\alpha}} \newcommand{\alphavhat}{\hat{\av}} \newcommand{\alphahat}{\hat{\alpha}} \newcommand{\omegav}{\mybold{\omega}} \newcommand{\gv}{\mybold{\gamma}} \newcommand{\gvhat}{\hat{\gv}} \newcommand{\ghat}{\hat{\gamma}} \newcommand{\hv}{\mybold{\h}} \newcommand{\hvhat}{\hat{\hv}} \newcommand{\hhat}{\hat{\h}} \newcommand{\gammav}{\mybold{\gamma}} \newcommand{\gammavhat}{\hat{\gammav}} \newcommand{\gammahat}{\hat{\gamma}} \newcommand{\new}{\mathrm{new}} \newcommand{\zerov}{\mybold{0}} \newcommand{\onev}{\mybold{1}} \newcommand{\id}{\mybold{I}} \newcommand{\sigmahat}{\hat{\sigma}} \newcommand{\etav}{\mybold{\eta}} \newcommand{\muv}{\mybold{\mu}} \newcommand{\Sigmam}{\mybold{\Sigma}} \newcommand{\rdom}[1]{\mathbb{R}^{#1}} \newcommand{\RV}[1]{\tilde{#1}} \def\A{\mybold{A}} \def\A{\mybold{A}} \def\av{\mybold{a}} \def\a{a} \def\B{\mybold{B}} \def\S{\mybold{S}} \def\sv{\mybold{s}} \def\s{s} \def\R{\mybold{R}} \def\rv{\mybold{r}} \def\r{r} \def\V{\mybold{V}} \def\vv{\mybold{v}} \def\v{v} \def\U{\mybold{U}} \def\uv{\mybold{u}} \def\u{u} \def\W{\mybold{W}} \def\wv{\mybold{w}} \def\w{w} \def\tv{\mybold{t}} \def\t{t} \def\Sc{\mathcal{S}} \def\ev{\mybold{e}} \def\Lammat{\mybold{\Lambda}} $$

Variable selection and the F-test

$\,$

Goals

Derive estimates of mean squared error that can be used for prediction
- Bias-variance tradeoff and mean squared error
- Briefly discuss AIC, BIC, adjusted $R^2$
- Cross-validation
- Exact leave-one-out CV for linear regression

An ordering of regressors

Let’s consider scalar $\x_n$, and IID pairs $(\x_n, \y_n)$ (the “machine learning” assumptions). Recall our general prediction problem,

\[ f^*(\cdot) := \argmin{f} \expect{(\y_\new - f(\x_\new))^2}, \]

with solution $f^*(\x_\new) = \expect{\y_\new \vert \x_\new}$. In general, this conditional expectation need not be a linear function of any particular $\x_\new$ (in fact, in reality, it surely isn’t). We motivated the least squares solution by finding the best linear approximation of the form $\beta \x_n$.

But that was motivated by a fixed $\x_n$. If we allow ourselves to additionally consider generic transformations of $\x_n$, then we can form arbitrarily good linear approximations to generic functions with series expansions. For example, for sufficiently regular $\expect{\y_\new \vert \x_\new}$, we could form the Taylor series,

\[ \expect{\y_\new \vert \x_\new} = \sum_{k=0}^\infty \frac{1}{k!} \beta_k \x_\new^k \approx \sum_{k=0}^K \frac{1}{k!} \beta_k \x_\new^k. \]

We could then form the new $K$-dimensional regressor

\[ \zv_n := \begin{pmatrix} 1 \\ \x_n \\ \x_n^2 \\ \vdots \\ \x_n^K \end{pmatrix} \]

and so have $\expect{\y_\new \vert \x_\new} \approx \betav^\trans \zv_n$, and find $\betav$ by ordinary least squares. (Note that, in practice, the matrix $\Z^\trans \Z$ will be very badly conditioned, so you would actually want to use Legendre or Hermite polynomials. But this is beyond the scope of the present discussion, which is only to motivate variable selection in contexts of increasing complexity.)

Alternatively, for $\xv_n$ in a bounded set (such as $[-1,1]$), we can form the Fourier series approximation

\[ \expect{\y_\new \vert \x_\new} = \beta_0 + \sum_{k=1}^\infty (\beta_{2k - 1} \sin(\pi k \x_n) + \beta_{2k} \cos(\pi k \x_n)) \approx \beta_0 + \sum_{k=1}^K (\beta_{2k - 1} \sin(\pi k \x_n) + \beta_{2k} \cos(\pi k \x_n)) = \zv_n^\trans \betav \]

with

\[ \zv_n := \begin{pmatrix} 1 \\ \sin(\pi \x_n) \\ \cos(\pi \x_n) \\ \sin(2\pi \x_n) \\ \cos(2\pi \x_n) \\ \vdots \\ \sin(K \pi \x_n) \\ \cos(K \pi \x_n) \\ \end{pmatrix}. \]

This is a general technique called “basis expansions.” Though further discussion is beyond the scope of the course, it’s worth knowing that it exists as a well-studied subject in machine learning (see, e.g., Chapter 5 of The Elements of Statistical Learning).

Alternatively, in a more general regression problem, you might have a natural ordering of regressors. Most naively, for example, you could sort your $P$ regressors by their marginal correlation with $\Y$, and attempt to choose which $K \le P$ regressors to include in a predictive model.

In each case, the problem then becomes to choose which level of $K$ to truncate at. To do so, we will attempt to estimate — and then minimize — the mean squared error. Interestingly, it is not always best to choose the maximum possible $K$.

The bias-variance tradeoff

Let us now assume that $\xv_n \in \rdom{P}$ is a vector again, possibly formed via basis expansions, and ordered so that we are trying to decide which of the $\xv_{n(1:K)}$ regressors to choose. Let us assume that $P$ is very large (practically infinite) so that we can safely write \[ \begin{aligned} \expect{\y_n \vert \x_n} ={}& \betav^\trans \xv_n \Rightarrow \\ \y_n ={}& \betav^\trans \xv_n + \res_n \textrm{ where}\\ \res_n :={}& \y_n - \expect{\y_n \vert \xv_n}. \end{aligned} \]

The preceding expression is just a definition of the residual, but note that the definition satisfies $\expect{\res_n} = 0$ and $\expect{\xv_n \res_n} = \zerov$ by definition, not by assumption. The only assumption needed is that $\xv_n$ is expressive enough that $\expect{\y_n \vert \x_n} = \betav^\trans \xv_n$.

In the linear model, limiting to $K$ regressors is equivalent to defining

\[ \betav_K := \begin{pmatrix} \beta_1 \\ \vdots \\ \beta_K \\ 0 \\ \vdots \\ 0 \end{pmatrix}, \]

and choosing which $\betav^K$ gives us the best mean squared error. Let $\X_K = \X_{(1:N)(1:K)}$ denote the $N\times P$ matrix consisting of the first $K$ columns of $\X$.

What is the mean squared error? Let $\betavhat_K := (\X_K^\trans \X_K)^{-1}\X_K^\trans\Y$ denote the estimator using only the first $K$ regressors. Then the mean squared error is

\[ \begin{aligned} \textrm{MSE} :={}& \expect{(\y_\new - \betahat_K^\trans \xv_\new)^2} \\={}& \expect{\left(\y_\new - \expect{\betahat_K}^\trans \xv_\new + \expect{\betahat_K}^\trans \xv_\new - \betahat_K^\trans \xv_\new \right)^2} \\ ={}& \expect{\left(\y_\new - \expect{\betahat_K}^\trans \xv_\new \right)^2} + \expect{\left(\expect{\betahat_K}^\trans \xv_\new - \betahat_K^\trans \xv_\new \right)^2} + \\& 2 \expect{\y_\new - \expect{\betahat_K}^\trans \xv_\new} \expect{\left(\expect{\betahat_K}^\trans - \betahat_K^\trans\right) \xv_\new} \\={}& \expect{\left(\y_\new - \expect{\betahat_K}^\trans \xv_\new \right)^2} + \expect{\left( \left(\expect{\betahat_K} - \betahat_K\right)^\trans \xv_\new \right)^2} \end{aligned} \]

because the training and test data are independent by assumption. Now, the first term can be further decomposed using the fact that $\y_\new = \beta^\trans \xv_\new + \res_\new$:

\[ \begin{aligned} \expect{\left(\y_\new - \expect{\betahat_K}^\trans \xv_\new \right)^2} ={}& \expect{\left(\beta^\trans \xv_\new + \res_\new - \expect{\betahat_K}^\trans \xv_\new \right)^2} \\={}& \expect{\res_\new^2} + \expect{\left(\left(\beta - \expect{\betahat_K}\right)^\trans \xv_\new \right)^2} + 2 \expect{\res_\new \left(\beta - \expect{\betahat_K}\right)^\trans \xv_\new} \\={}& \expect{\res_\new^2} + \expect{\left(\left(\beta - \expect{\betahat_K}\right)^\trans \xv_\new \right)^2}, \end{aligned} \]

where the final line follows from $\expect{\xv_\new \res_\new} = \zerov$, independently of the training data. We have now decomposed the MSE into three terms, each with a distint intepretation:

\[ \begin{aligned} \textrm{MSE} :={}& \expect{\res_\new^2} &+& \expect{\left(\left(\beta - \expect{\betahat_K}\right)^\trans \xv_\new \right)^2} &+& \expect{\left( \left(\expect{\betahat_K} - \betahat_K\right)^\trans \xv_\new \right)^2} \\={}& \expect{\res_\new^2} &+& \trace{ \left(\beta - \expect{\betahat_K}\right) \left(\beta - \expect{\betahat_K}\right)^\trans \expect{\xv_\new \xv_\new^\trans} } &+& \trace{ \cov{\betahat_K} \expect{\xv_\new \xv_\new^\trans} }\\ =& \textrm{Irreducible error} &+& \textrm{Squared bias of }\beta_K &+&\textrm{Variance of }\betahat_K. \end{aligned} \]

As $K$ gets larger, these terms can be expected to behave differently:

The irreducible error will not change
The bias will tend to decrease
The variance will tend to increase.

In particular, note that if $\cov{\betahat_K} = \sigma^2 (\X_K^\trans \X)^{-1}$, then

\[ \begin{aligned} \trace{ \cov{\betahat_K} \expect{\xv_\new \xv_\new^\trans} } \approx{}& \trace{ \begin{pmatrix} \frac{\sigma^2}{N} (\expect{\xv_{K} \xv_{K}^\trans})^{-1} & \zerov_{K(P-K)} \\ \zerov_{(P-K)K} & \zerov_{(P-K)(P-K)} \end{pmatrix} \begin{pmatrix} \expect{\xv_K^\trans \xv_K} & \expect{\xv_K^\trans \xv_{P-K}} \\ \expect{\xv_{P-K}^\trans \xv_K} & \expect{\xv_{P-K}^\trans \xv_{P-K}} \end{pmatrix} } \\={}& \trace{ \begin{pmatrix} \frac{\sigma^2}{N} \id_K & \zerov_{K(P-K)} \\ \zerov_{(P-K)K} & \zerov_{(P-K)(P-K)} \end{pmatrix} } \\={}& \frac{K \sigma^2}{N}. \end{aligned} \]

Further, notice that at $K=0$ the variance is zero, and the bias is simply $\beta \beta^\trans$. Consequently, we might expect a “sweet spot” where the bias has decreased some but the variance has not yet increased too much.

Cross-validation and model selection criteria

We might first think to use our F-test to test the null hypothesis that the last $K$ elements of $\beta$ are zero. However, the problem is that these hypothesis tests are not independent of one another. (Why?) Choosing the $K$ that minimizes the $F$ statistic will use the data twice in a way that invalidates the hypothesis test.

Instead, we can form estimates of the MSE using held-out data. Note that the in-sample error underestimates the true MSE, simply because we find $\betahat$ to minimize it:

\[ \meann (\y_n - \xv_n^\trans \betavhat)^2 \le \meann (\y_n - \xv_n^\trans \betav)^2. \]

In fact,

\[ \begin{aligned} \meann (\y_n - \xv_n^\trans \betavhat)^2 ={}& \frac{1}{N} (\Y - \X\betavhat)^T (\Y - \X\betavhat) \\={}& \frac{1}{N} (\Y - \X \betav + \X (\betav - \betavhat))^T (\Y - \X \betav + \X(\betav - \betavhat)) \\={}& \frac{1}{N} \norm{\Y - \X \betav}^2 + \frac{1}{N} (\betav - \betavhat)^\trans \X^\trans \X (\betav - \betavhat) + \frac{1}{N} 2 (\Y - \X \betav)^\trans \X (\betav - \betavhat) \\={}& \frac{1}{N} \norm{\resv}^2 + \frac{1}{N} \resv^\trans \X (\X^\trans \X)^{-1} \X^\trans \X (\X^\trans \X)^{-1} \X^\trans \resv - 2\frac{1}{N} \resv^\trans \X (\X^\trans \X)^{-1} \X^\trans \resv \\={}& \frac{1}{N} \norm{\resv}^2 - \frac{1}{N} \norm{(\X^\trans \X)^{-1/2} \X^\trans \resv}^2 \\\le{}& \frac{1}{N} \norm{\resv}^2 \approx \textrm{MSE}. \end{aligned} \]

Note that the under-estimation is precisely to the extent that the columns of $\X$ are able to align with the true unobserved residual, $\resv$.

There are a number of approximate criteria that try to correct the MSE for overfitting (AIC, BIC, adjusted $R^2$), but they hold only approximately and asymptotically. Instead, one can approximate the MSE directly by holding out some data.

In particular, if we leave out a single datapoint, and denote by $\betavhat_{-n}$ the fit without the $n$–th datapoint, then the error on the left-out point is approximately unbiased for MSE, since \[ \begin{aligned} \textrm{MSE}_{-n} ={}& (\y_n - \xv_n^\trans \betavhat_{-n})^2 \\ \expect{\textrm{MSE}_{-n} \vert \y_n, \xv_n} ={}& \y_n^2 - 2 \y_n \xv_n^\trans \expect{\betavhat_{-n}} + \xv_n^\trans \expect{\betavhat_{-n} \betavhat_{-n}^\trans} \xv_n \\ \expect{\textrm{MSE}_{-n}} ={}& \expect{\y_n^2} - 2 \expect{\y_n \xv_n^\trans} \expect{\betavhat_{-n}} + \trace{\expect{\betavhat_{-n} \betavhat_{-n}^\trans} \expect{\xv_n \xv_n^\trans} } \\={}& \expect{\y_\new^2} - 2 \expect{\y_\new \xv_\new^\trans} \expect{\betavhat_{-n}} + \trace{\expect{\betavhat_{-n} \betavhat_{-n}^\trans} \expect{\xv_\new \xv_\new^\trans} } \\={}& \expect{(\y_\new - \xv_\new^\trans \betavhat_{-n})^2}, \end{aligned} \]

where the final line follows from the fact that $\x_n,\y_n$ and $\x_\new,\y_\new$ have the same distribution. The only bias occurs from the fact that $\betavhat_{-n}$ has a different distribution from $\betavhat$ on the full data (in particular, it has a slightly larger variance). It follows that we can estimate the MSE using

\[ \hat{MSE}_{LOO} := \meann \textrm{MSE}_{-n}, \]

the “leave-one-out” cross-validation estimator.