$$ \newcommand{\mybold}[1]{\boldsymbol{#1}} \newcommand{\trans}{\intercal} \newcommand{\norm}[1]{\left\Vert#1\right\Vert} \newcommand{\abs}[1]{\left|#1\right|} \newcommand{\bbr}{\mathbb{R}} \newcommand{\bbz}{\mathbb{Z}} \newcommand{\bbc}{\mathbb{C}} \newcommand{\gauss}[1]{\mathcal{N}\left(#1\right)} \newcommand{\chisq}[1]{\mathcal{\chi}^2_{#1}} \newcommand{\studentt}[1]{\mathrm{StudentT}_{#1}} \newcommand{\fdist}[2]{\mathrm{FDist}_{#1,#2}} \newcommand{\iid}{\overset{\mathrm{IID}}{\sim}} \newcommand{\argmin}[1]{\underset{#1}{\mathrm{argmin}}\,} \newcommand{\projop}[1]{\underset{#1}{\mathrm{Proj}}\,} \newcommand{\proj}[1]{\underset{#1}{\mybold{P}}} \newcommand{\expect}[1]{\mathbb{E}\left[#1\right]} \newcommand{\prob}[1]{\mathbb{P}\left(#1\right)} \newcommand{\dens}[1]{\mathit{p}\left(#1\right)} \newcommand{\var}[1]{\mathrm{Var}\left(#1\right)} \newcommand{\cov}[1]{\mathrm{Cov}\left(#1\right)} \newcommand{\sumn}{\sum_{n=1}^N} \newcommand{\meann}{\frac{1}{N} \sumn} \newcommand{\cltn}{\frac{1}{\sqrt{N}} \sumn} \newcommand{\trace}[1]{\mathrm{trace}\left(#1\right)} \newcommand{\diag}[1]{\mathrm{Diag}\left(#1\right)} \newcommand{\grad}[2]{\nabla_{#1} \left. #2 \right.} \newcommand{\gradat}[3]{\nabla_{#1} \left. #2 \right|_{#3}} \newcommand{\fracat}[3]{\left. \frac{#1}{#2} \right|_{#3}} \newcommand{\W}{\mybold{W}} \newcommand{\w}{w} \newcommand{\wbar}{\bar{w}} \newcommand{\wv}{\mybold{w}} \newcommand{\X}{\mybold{X}} \newcommand{\x}{x} \newcommand{\xbar}{\bar{x}} \newcommand{\xv}{\mybold{x}} \newcommand{\Xcov}{\mybold{M}_{\X}} \newcommand{\Xcovhat}{\hat{\mybold{M}}_{\X}} \newcommand{\Covsand}{\Sigmam_{\mathrm{sand}}} \newcommand{\Covsandhat}{\hat{\Sigmam}_{\mathrm{sand}}} \newcommand{\Z}{\mybold{Z}} \newcommand{\z}{z} \newcommand{\zv}{\mybold{z}} \newcommand{\zbar}{\bar{z}} \newcommand{\Y}{\mybold{Y}} \newcommand{\Yhat}{\hat{\Y}} \newcommand{\y}{y} \newcommand{\yv}{\mybold{y}} \newcommand{\yhat}{\hat{\y}} \newcommand{\ybar}{\bar{y}} \newcommand{\res}{\varepsilon} \newcommand{\resv}{\mybold{\res}} \newcommand{\resvhat}{\hat{\mybold{\res}}} \newcommand{\reshat}{\hat{\res}} \newcommand{\betav}{\mybold{\beta}} \newcommand{\betavhat}{\hat{\betav}} \newcommand{\betahat}{\hat{\beta}} \newcommand{\betastar}{{\beta^{*}}} \newcommand{\betavstar}{{\betav^{*}}} \newcommand{\loss}{\mathscr{L}} \newcommand{\losshat}{\hat{\loss}} \newcommand{\f}{f} \newcommand{\fhat}{\hat{f}} \newcommand{\bv}{\mybold{\b}} \newcommand{\bvhat}{\hat{\bv}} \newcommand{\alphav}{\mybold{\alpha}} \newcommand{\alphavhat}{\hat{\av}} \newcommand{\alphahat}{\hat{\alpha}} \newcommand{\omegav}{\mybold{\omega}} \newcommand{\gv}{\mybold{\gamma}} \newcommand{\gvhat}{\hat{\gv}} \newcommand{\ghat}{\hat{\gamma}} \newcommand{\hv}{\mybold{\h}} \newcommand{\hvhat}{\hat{\hv}} \newcommand{\hhat}{\hat{\h}} \newcommand{\gammav}{\mybold{\gamma}} \newcommand{\gammavhat}{\hat{\gammav}} \newcommand{\gammahat}{\hat{\gamma}} \newcommand{\new}{\mathrm{new}} \newcommand{\zerov}{\mybold{0}} \newcommand{\onev}{\mybold{1}} \newcommand{\id}{\mybold{I}} \newcommand{\sigmahat}{\hat{\sigma}} \newcommand{\etav}{\mybold{\eta}} \newcommand{\muv}{\mybold{\mu}} \newcommand{\Sigmam}{\mybold{\Sigma}} \newcommand{\rdom}[1]{\mathbb{R}^{#1}} \newcommand{\RV}[1]{{#1}} \def\A{\mybold{A}} \def\A{\mybold{A}} \def\av{\mybold{a}} \def\a{a} \def\B{\mybold{B}} \def\b{b} \def\S{\mybold{S}} \def\sv{\mybold{s}} \def\s{s} \def\R{\mybold{R}} \def\rv{\mybold{r}} \def\r{r} \def\V{\mybold{V}} \def\vv{\mybold{v}} \def\v{v} \def\vhat{\hat{v}} \def\U{\mybold{U}} \def\uv{\mybold{u}} \def\u{u} \def\W{\mybold{W}} \def\wv{\mybold{w}} \def\w{w} \def\tv{\mybold{t}} \def\t{t} \def\Sc{\mathcal{S}} \def\ev{\mybold{e}} \def\Lammat{\mybold{\Lambda}} \def\Q{\mybold{Q}} \def\eps{\varepsilon} $$

Variable selection and the F-test

$\,$

Goals

Derive tests for multiple coefficients at once under normality
- The F-test and F-distribution
- As a special case test whether a group of variables should be included in the model
- Review $R^2$, discuss its limitations, and how the F-test approximately corrects them

Testing multiple coefficients at once

So far we have devised tests for one coefficient at a time — or, more precisely, for one linear combination of test statistics at a time. We derived tests under normality, and asserted that they made sense under homoskedasticity for large $N$.

Suppose we want to test multiple coefficients at a time. A particular use case we will work up to is asking whether our regression did anything at all. In particular, we might ask whether the fit $\yhat = \X \betahat$ is any better than the fit $\yhat = 0$ that did not run a regression at all.

(Note that, by the FWL theorem, we can equivalently compare $\yhat = \X \betahat$ with $\yhat = \ybar$, simply by centering everything first.)

The F-test

Formally, we’ll be testing the null hypothesis that

\[ \textrm{Null hypothesis }H_0: \A \beta = \av \]

where $A$ is a $Q \times P$ matrix, and $c$ is a $Q$–vector. As a special case, we will test our “do nothing” hypothesis by taking $\A = \id$ and $\av=\zerov$.

Our trick will be the same as always: transform into known distributions. Specifically, under $H_0$ and the normal assumptions, we know that

\[ \A \betahat \sim \gauss{\A \beta, \A (\X^\trans \X)^{-1} \A^\trans \sigma^2} = \gauss{\a, \A (\X^\trans \X)^{-1} \A^\trans \sigma^2}. \]

It follows that

\[ \zv := \sigma^{-1} \left(\A (\X^\trans \X)^{-1} \A^\trans \right)^{-1/2} \left(\A \betahat - \av \right) \sim \gauss{\zerov, \id}, \]

which is a $Q$–dimensional standard normal. Now, to test $H_0$, we need a single test statistic, and this is a $Q$–vector. We can get a single number by taking the norm to get a $\chisq{P}$ distribution:

\[ \zeta := \zv^\trans \zv = \sigma^{-2} \left(\A \betahat - \av \right)^\trans \left(\A (\X^\trans \X)^{-1} \A^\trans \right)^{-1} \left(\A \betahat - \av \right) \sim \chisq{Q}. \]

If $\zeta$ is large, it’s evidence against $H_0$. However, we still don’t know $\sigma^2$. So we can do our usual trick of normalizing by $\sigmahat^2$, which is independent of $\zeta$ under normality, giving our final computable statistic:

\[ \phi := \frac{1}{Q} \frac{\left(\A \betahat - \av \right)^\trans \left(\A (\X^\trans \X)^{-1} \A^\trans \right)^{-1} \left(\A \betahat - \av \right)}{\sigmahat^2} = \frac{\s_1 / Q}{\s_2 / (N-P)} \sim \fdist{Q}{N-P}. \]

where $\s_1$ and $\s_2$ are independent chi–squared statistics. This quantity is called an “F-statistic” and its distribution is known as the “F-distribution,” and its quantiles (and other properties) are computable in R using qf (etc).

Definition

Let $\s_1 \sim \chisq{K_1}$ and $\s_2 \sim \chisq{K_2}$, independently of one another. Then the distribution of \[ \phi := \frac{\s_1 / K_1}{s_2 / K_2} \]

is called an ``F distribution with $K_1$ and $K_2$ degrees of freedom.’’ We write $\phi \sim \fdist{K_1}{K_2}$. As long as $K_2 > 2$, $\expect{\phi} = \frac{K_2}{K_2 - 2}$.

Exercise

For large $K_2$, show that the mean of the F distribution is approximately $1$.

Exercise

For large $K_2$, show that the variance of the F distribution is approximately $2 / K_1$. (Hint: use the definition of a $\chisq{K_1}$ as the sum of squares of independent standard normals.)

Exercise

Given our particular $H_0$, argue why it only makes sense to reject for large values of our F–statistic.

Total regression as a special case

What do we get when we apply the F-test to $\beta = \zerov$? The statistic is

\[ \begin{aligned} \zeta :={}& \sigma^{-2} \betahat^\trans \X^\trans \X \betahat \\ \phi :={}& \frac{\frac{1}{P}\betahat^\trans \X^\trans \X \betahat}{\sigmahat^2} \sim \fdist{P}{N-P}. \end{aligned} \]

(Note that this would commonly be done after centering, so that by the FWL theorem we are not trying to evaluate whether a constant should be included. See, for example, the output of the lm function in R.)

This can be re-written in terms of some interpretable quantities. Let’s define

\[ \begin{aligned} TSS :={}& \Y^\trans \Y = \sumn \y_n^2 & \textrm{"total sum of squares"}\\ ESS :={}& \Yhat^\trans \Yhat = \sumn \yhat_n^2 & \textrm{"explained sum of squares"}\\ RSS :={}& \resvhat^\trans \resvhat = \sumn \reshat_n^2 & \textrm{"residual sum of squares"}. \end{aligned} \]

Exercise

Prove that $TSS = RSS + ESS$.

A commonly used measure of “fit” is

\[ R^2 = 1 - \frac{RSS}{TSS} = \frac{ESS}{TSS}. \]

Exercise

What is $R^2$ when we include no regressors?
What is $R^2$ when we include $N$ linearly independent regressors?
Can $R^2$ ever decrease when we add a regressor?
Can $R^2$ ever stay the same when we add a regressor?

We might look at all our regressors and choose the set that gives the highest $R^2$. But this does not make sense, since it would always choose to include every linearly independent regressor.

Our F test statistic behaves better, though. Note that if we are testing the null that $\betav = \zerov$, then we can write the test statistic in terms of $R^2$ and $P / N$:

\[ \phi = \frac{N-P}{P} \frac{\Yhat^\trans \Yhat}{\resvhat^\trans \resvhat} = \frac{N-P}{P} \frac{ESS}{RSS} = \frac{1 - P / N}{P / N} \frac{R^2}{1 - R^2} = \frac{\frac{1}{P/N} - 1} {\frac{1}{R^2} - 1}. \]

As $P$, the number of included regressors, goes from $0$ to $N$, both $P/N$ and $R^2$ increase monotonically from $0$ to $1$. Therefore both $\frac{1}{P/N} - 1$ and $\frac{1}{R^2} - 1$ decrease monotonically from $\infty$ to $0$. In particular,

Increasing $P$ increases $P/N$, decreases $\frac{1}{P/N} - 1$, and decreases $\phi$, and
Increasing $P$ increases $R^2$, decreases $\frac{1}{R^2} - 1$, and increases $\phi$.

But they may do so at different rates! Adding a regressor increases $P/N$ by $1/N$, but it may increase $R^2$ by more than that (if it is a very highly explanatory regressor) or less than that (if it is not very good at explaining the residuals from the previous fit). In other words, adding a “good” regressor will increase the F-test statistic and adding a “bad” regressor will decrease the F-test statistic.

Forward stepwise regression

One might be tempted to perform the following procedure:

Start with no regressors
One at a time, add the regressor that increases the F-statistic the most
Stop when the F-test does not increase with the next regressor

Although sensible, this no longer has any coverage guarantees, since the set of variables is not independent of the actual F-statistics themselves. So in order to choose where to stop, we need a different criterion: an estimate of the predictive error, which we will talk about next time.