Variable selection and the F-test
\(\,\)
Goals
- Derive tests for multiple coefficients at once under normality
- The F-test and F-distribution
- As a special case test whether a group of variables should be included in the model
- Introduce \(R^2\), discuss its limitations, and how the F-test approximately corrects them
Testing multiple coefficients at once
So far we have devised tests for one coefficient at a time — or, more precisely, for one linear combination of test statistics at a time. We derived tests under normality, and asserted that they made sense under homoskedasticity for large \(N\).
Suppose we want to test multiple coefficients at a time. A particular use case we will work up to is asking whether our regression did anything at all. In particular, we might ask whether the fit \(\yhat = \X \betahat\) is any better than the fit \(\yhat = 0\) that did not run a regression at all.
(Note that, by the FWL theorem, we can equivalently compare \(\yhat = \X \betahat\) with \(\yhat = \ybar\), simply by centering everything first.)
The F-test
Formally, we’ll be testing the null hypothesis that
\[ \textrm{Null hypothesis }H_0: \A \beta = \av \]
where \(A\) is a \(Q \times P\) matrix, and \(c\) is a \(Q\)–vector. As a special case, we will test our “do nothing” hypothesis by taking \(\A = \id\) and \(\av=\zerov\).
Our trick will be the same as always: transform into known distributions. Specifically, under \(H_0\) and the normal assumptions, we know that
\[ \A \betahat \sim \gauss{\A \beta, \A (\X^\trans \X)^{-1} \A^\trans \sigma^2} = \gauss{\a, \A (\X^\trans \X)^{-1} \A^\trans \sigma^2}. \]
It follows that
\[ \zv := \sigma^{-1} \left(\A (\X^\trans \X)^{-1} \A^\trans \right)^{-1/2} \left(\A \betahat - \av \right) \sim \gauss{\zerov, \id}, \]
which is a \(Q\)–dimensional standard normal. Now, to test \(H_0\), we need a single test statistic, and this is a \(Q\)–vector. We can get a single number by taking the norm to get a \(\chisq{P}\) distribution:
\[ \zeta := \zv^\trans \zv = \sigma^{-2} \left(\A \betahat - \av \right)^\trans \left(\A (\X^\trans \X)^{-1} \A^\trans \right)^{-1} \left(\A \betahat - \av \right) \sim \chisq{Q}. \]
If \(\zeta\) is large, it’s evidence against \(H_0\). However, we still don’t know \(\sigma^2\). So we can do our usual trick of normalizing by \(\sigmahat^2\), which is independent of \(\zeta\) under normality, giving our final computable statistic:
\[ \phi := \frac{1}{Q} \frac{\left(\A \betahat - \av \right)^\trans \left(\A (\X^\trans \X)^{-1} \A^\trans \right)^{-1} \left(\A \betahat - \av \right)}{\sigmahat^2} = \frac{\s_1 / P}{\s_2 / (N-P)} \sim \fdist{Q}{N-P}. \]
where \(\s_1\) and \(\s_2\) are independent chi–squared statistics. This quantity is called an “F-statistic” and its distribution is known as the “F-distribution,” and its quantiles (and other properties) are computable in R
using qf
(etc).
Let \(\s_1 \sim \chisq{K_1}\) and \(\s_2 \sim \chisq{K_2}\), independently of one another. Then the distribution of \[ \phi := \frac{\s_1 / K_1}{s_2 / K_2} \]
is called an ``F distribution with \(K_1\) and \(K_2\) degrees of freedom.’’ We write \(\phi \sim \fdist{K_1}{K_2}\). As long as \(K_2 > 2\), \(\expect{\phi} = \frac{K_2}{K_2 - 2}\).
For large \(K_2\), show that the mean of the F distribution is approximately \(1\).
For large \(K_2\), show that the variance of the F distribution is approximately \(2 / K_1\). (Hint: use the definition of a \(\chisq{K_1}\) as the sum of squares of independent standard normals.)
Given our particular \(H_0\), argue why it only makes sense to reject for large values of our F–statistic.
Total regression as a special case
What do we get when we apply the F-test to \(\beta = \zerov\)? The statistic is
\[ \begin{aligned} \zeta :={}& \sigma^{-2} \betahat^\trans \X^\trans \X \betahat \\ \phi :={}& \frac{\frac{1}{P}\betahat^\trans \X^\trans \X \betahat}{\sigmahat^2} \sim \fdist{P}{N-P}. \end{aligned} \]
(Note that this would commonly be done after centering, so that by the FWL theorem we are not trying to evaluate whether a constant should be included. See, for example, the output of the lm
function in R
.)
This can be re-written in terms of some interpretable quantities. Let’s define
\[ \begin{aligned} TSS :={}& \Y^\trans \Y = \sumn \y_n^2 & \textrm{"total sum of squares"}\\ ESS :={}& \Yhat^\trans \Yhat = \sumn \yhat_n^2 & \textrm{"explained sum of squares"}\\ RSS :={}& \resvhat^\trans \resvhat = \sumn \reshat_n^2 & \textrm{"residual sum of squares"}. \end{aligned} \]
Prove that \(TSS = RSS + ESS\).
A commonly used measure of “fit” is
\[ R^2 = 1 - \frac{RSS}{TSS} = \frac{ESS}{TSS}. \]
- What is \(R^2\) when we include no regressors?
- What is \(R^2\) when we include \(N\) linearly independent regressors?
- Can \(R^2\) ever decrease when we add a regressor?
- Can \(R^2\) ever stay the same when we add a regressor?
We might look at all our regressors and choose the set that gives the highest \(R^2\). But this does not make sense, since it would always choose to include every linearly independent regressor.
Our F test statistic behaves better, though.
\[ \phi = \frac{N-P}{P} \frac{\Yhat^\trans \Yhat}{\resvhat^\trans \resvhat} = \frac{N-P}{P} \frac{ESS}{RSS} = \frac{N-P}{P} \left(\frac{TSS}{RSS} - 1\right) \]
Although it does not count as a formal hypothesis test, it makes sense informally that “good” regressors have large \(\phi\), i.e., small \(1 / \phi\). We can re-write
\[ \frac{1}{\phi} = \frac{P}{N - P}\frac{RSS}{ESS} = \frac{P}{N - P}\frac{TSS / TSS - ESS / TSS}{ESS / TSS} = \frac{P}{N - P}\frac{1 - R^2}{R^2} \approx \frac{P}{N} (1 - R^2) \quad\textrm{for }N \gg P\textrm{ and }R^2 \approx 1. \]
To make \(\phi\) large, we want to make \(P(1 - R^2)\) small. When you add a regressor, \(P \rightarrow P + 1\) and \(1 - R^2\) decreases to a degree that the added regressor helps explain the current residuals. If \(1 - R^2\) doesn’t increase at all — that is, if the added regressor doesn’t improve the fit — then \(P\) increases, \(1 / \phi\) increases, and \(\phi\) decreases, making it less likely to reject the F-test.
Next time, we’ll study better ways to make the decision of what regressors to include.