Variable selection and the F-test
\(\,\)
Goals
- Derive tests for multiple coefficients at once under normality
- The F-test and F-distribution
- As a special case test whether a group of variables should be included in the model
- Review \(R^2\), discuss its limitations, and how the F-test approximately corrects them
Testing multiple coefficients at once
So far we have devised tests for one coefficient at a time — or, more precisely, for one linear combination of test statistics at a time. We derived tests under normality, and asserted that they made sense under homoskedasticity for large \(N\).
Suppose we want to test multiple coefficients at a time. A particular use case we will work up to is asking whether our regression did anything at all. In particular, we might ask whether the fit \(\yhat = \X \betahat\) is any better than the fit \(\yhat = 0\) that did not run a regression at all.
(Note that, by the FWL theorem, we can equivalently compare \(\yhat = \X \betahat\) with \(\yhat = \ybar\), simply by centering everything first.)
The F-test
Formally, we’ll be testing the null hypothesis that
\[ \textrm{Null hypothesis }H_0: \A \beta = \av \]
where \(A\) is a \(Q \times P\) matrix, and \(c\) is a \(Q\)–vector. As a special case, we will test our “do nothing” hypothesis by taking \(\A = \id\) and \(\av=\zerov\).
Our trick will be the same as always: transform into known distributions. Specifically, under \(H_0\) and the normal assumptions, we know that
\[ \A \betahat \sim \gauss{\A \beta, \A (\X^\trans \X)^{-1} \A^\trans \sigma^2} = \gauss{\a, \A (\X^\trans \X)^{-1} \A^\trans \sigma^2}. \]
It follows that
\[ \zv := \sigma^{-1} \left(\A (\X^\trans \X)^{-1} \A^\trans \right)^{-1/2} \left(\A \betahat - \av \right) \sim \gauss{\zerov, \id}, \]
which is a \(Q\)–dimensional standard normal. Now, to test \(H_0\), we need a single test statistic, and this is a \(Q\)–vector. We can get a single number by taking the norm to get a \(\chisq{P}\) distribution:
\[ \zeta := \zv^\trans \zv = \sigma^{-2} \left(\A \betahat - \av \right)^\trans \left(\A (\X^\trans \X)^{-1} \A^\trans \right)^{-1} \left(\A \betahat - \av \right) \sim \chisq{Q}. \]
If \(\zeta\) is large, it’s evidence against \(H_0\). However, we still don’t know \(\sigma^2\). So we can do our usual trick of normalizing by \(\sigmahat^2\), which is independent of \(\zeta\) under normality, giving our final computable statistic:
\[ \phi := \frac{1}{Q} \frac{\left(\A \betahat - \av \right)^\trans \left(\A (\X^\trans \X)^{-1} \A^\trans \right)^{-1} \left(\A \betahat - \av \right)}{\sigmahat^2} = \frac{\s_1 / Q}{\s_2 / (N-P)} \sim \fdist{Q}{N-P}. \]
where \(\s_1\) and \(\s_2\) are independent chi–squared statistics. This quantity is called an “F-statistic” and its distribution is known as the “F-distribution,” and its quantiles (and other properties) are computable in R
using qf
(etc).
Let \(\s_1 \sim \chisq{K_1}\) and \(\s_2 \sim \chisq{K_2}\), independently of one another. Then the distribution of \[ \phi := \frac{\s_1 / K_1}{s_2 / K_2} \]
is called an ``F distribution with \(K_1\) and \(K_2\) degrees of freedom.’’ We write \(\phi \sim \fdist{K_1}{K_2}\). As long as \(K_2 > 2\), \(\expect{\phi} = \frac{K_2}{K_2 - 2}\).
For large \(K_2\), show that the mean of the F distribution is approximately \(1\).
For large \(K_2\), show that the variance of the F distribution is approximately \(2 / K_1\). (Hint: use the definition of a \(\chisq{K_1}\) as the sum of squares of independent standard normals.)
Given our particular \(H_0\), argue why it only makes sense to reject for large values of our F–statistic.
Total regression as a special case
What do we get when we apply the F-test to \(\beta = \zerov\)? The statistic is
\[ \begin{aligned} \zeta :={}& \sigma^{-2} \betahat^\trans \X^\trans \X \betahat \\ \phi :={}& \frac{\frac{1}{P}\betahat^\trans \X^\trans \X \betahat}{\sigmahat^2} \sim \fdist{P}{N-P}. \end{aligned} \]
(Note that this would commonly be done after centering, so that by the FWL theorem we are not trying to evaluate whether a constant should be included. See, for example, the output of the lm
function in R
.)
This can be re-written in terms of some interpretable quantities. Let’s define
\[ \begin{aligned} TSS :={}& \Y^\trans \Y = \sumn \y_n^2 & \textrm{"total sum of squares"}\\ ESS :={}& \Yhat^\trans \Yhat = \sumn \yhat_n^2 & \textrm{"explained sum of squares"}\\ RSS :={}& \resvhat^\trans \resvhat = \sumn \reshat_n^2 & \textrm{"residual sum of squares"}. \end{aligned} \]
Prove that \(TSS = RSS + ESS\).
A commonly used measure of “fit” is
\[ R^2 = 1 - \frac{RSS}{TSS} = \frac{ESS}{TSS}. \]
- What is \(R^2\) when we include no regressors?
- What is \(R^2\) when we include \(N\) linearly independent regressors?
- Can \(R^2\) ever decrease when we add a regressor?
- Can \(R^2\) ever stay the same when we add a regressor?
We might look at all our regressors and choose the set that gives the highest \(R^2\). But this does not make sense, since it would always choose to include every linearly independent regressor.
Our F test statistic behaves better, though. Note that if we are testing the null that \(\betav = \zerov\), then we can write the test statistic in terms of \(R^2\) and \(P / N\):
\[ \phi = \frac{N-P}{P} \frac{\Yhat^\trans \Yhat}{\resvhat^\trans \resvhat} = \frac{N-P}{P} \frac{ESS}{RSS} = \frac{1 - P / N}{P / N} \frac{R^2}{1 - R^2} = \frac{\frac{1}{P/N} - 1} {\frac{1}{R^2} - 1}. \]
As \(P\), the number of included regressors, goes from \(0\) to \(N\), both \(P/N\) and \(R^2\) increase monotonically from \(0\) to \(1\). Therefore both \(\frac{1}{P/N} - 1\) and \(\frac{1}{R^2} - 1\) decrease monotonically from \(\infty\) to \(0\). In particular,
- Increasing \(P\) increases \(P/N\), decreases \(\frac{1}{P/N} - 1\), and decreases \(\phi\), and
- Increasing \(P\) increases \(R^2\), decreases \(\frac{1}{R^2} - 1\), and increases \(\phi\).
But they may do so at different rates! Adding a regressor increases \(P/N\) by \(1/N\), but it may increase \(R^2\) by more than that (if it is a very highly explanatory regressor) or less than that (if it is not very good at explaining the residuals from the previous fit). In other words, adding a “good” regressor will increase the F-test statistic and adding a “bad” regressor will decrease the F-test statistic.
Forward stepwise regression
One might be tempted to perform the following procedure:
- Start with no regressors
- One at a time, add the regressor that increases the F-statistic the most
- Stop when the F-test does not increase with the next regressor
Although sensible, this no longer has any coverage guarantees, since the set of variables is not independent of the actual F-statistics themselves. So in order to choose where to stop, we need a different criterion: an estimate of the predictive error, which we will talk about next time.