Variable selection and the F-test
Goals
- Derive tests for multiple coefficients at once under normality
- The F-test and F-distribution
- As a special case test whether a group of variables should be included in the model
- Review \(R^2\), discuss its limitations, and how the F-test approximately corrects them
Testing multiple coefficients at once
So far we have devised tests for one coefficient at a time — or, more precisely, for one linear combination of test statistics at a time. We derived tests under normality, and asserted that they made sense under homoskedasticity for large \(N\).
Suppose we want to test multiple coefficients at a time. A particular use case we will work up to is asking whether our regression did anything at all. In particular, we might ask whether the fit \(\hat{y}= \boldsymbol{X}\hat{\boldsymbol{\beta}}\) is any better than the fit \(\hat{y}= 0\) that did not run a regression at all.
The F-test
Formally, we’ll be testing the null hypothesis that
\[ \textrm{Null hypothesis }H_0: \boldsymbol{A}{\boldsymbol{\beta}^{*}}= \boldsymbol{a} \]
where \(A\) is a \(Q \times P\) matrix, and \(c\) is a \(Q\)–vector. As a special case, we will test our “do nothing” hypothesis by taking \(\boldsymbol{A}= \boldsymbol{I}\) and \(\boldsymbol{a}=\boldsymbol{0}\).
Our trick will be the same as always: transform into known distributions. Specifically, under \(H_0\) and the normal assumptions, one can check that
\[ \zeta := %\zv^\trans \zv = \sigma^{-2} \left(\boldsymbol{A}\hat{\boldsymbol{\beta}}- \boldsymbol{a}\right)^\intercal\left(\boldsymbol{A}(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{A}^\intercal\right)^{-1} \left(\boldsymbol{A}\hat{\boldsymbol{\beta}}- \boldsymbol{a}\right) \sim \mathcal{\chi}^2_{Q}. \] Note that \(\zeta\) is a single number that summarizes the evidence against a vector-valued hypothesis \(\boldsymbol{A}{\boldsymbol{\beta}^{*}}= \boldsymbol{a}\). Note that this expression is like a “modified” notion of the norm of \(\boldsymbol{A}\hat{\boldsymbol{\beta}}- \boldsymbol{a}\), scaled by the inverse of the covariance of \(\boldsymbol{A}\hat{\boldsymbol{\beta}}- \boldsymbol{a}\). Intuitively, if \(\zeta\) is large, it’s evidence against \(H_0\). However, we still don’t know \(\sigma^2\). So we can do our usual trick of normalizing by \(\hat{\sigma}^2\), which is independent of \(\zeta\) under normality, giving our final computable statistic:
\[ \phi := \frac{1}{Q} \frac{\left(\boldsymbol{A}\hat{\boldsymbol{\beta}}- \boldsymbol{a}\right)^\intercal\left(\boldsymbol{A}(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{A}^\intercal\right)^{-1} \left(\boldsymbol{A}\hat{\boldsymbol{\beta}}- \boldsymbol{a}\right)}{\hat{\sigma}^2} = \frac{s_1 / Q}{s_2 / (N-P)} \sim \mathrm{FDist}_{Q,N-P}. \]
where \(s_1\) and \(s_2\) are independent chi–squared statistics. This quantity is called an “F-statistic” and its distribution is known as the “F-distribution,” and its quantiles (and other properties) are computable in R using qf (etc).
Let \(s_1 \sim \mathcal{\chi}^2_{K_1}\) and \(s_2 \sim \mathcal{\chi}^2_{K_2}\), independently of one another. Then the distribution of \[ \phi := \frac{s_1 / K_1}{s_2 / K_2} \]
is called an ``F distribution with \(K_1\) and \(K_2\) degrees of freedom.’’ We write \(\phi \sim \mathrm{FDist}_{K_1,K_2}\). As long as \(K_2 > 2\), \(\mathbb{E}\left[\phi\right] = \frac{K_2}{K_2 - 2}\).
Total regression as a special case
What do we get when we apply the F-test to \({\boldsymbol{\beta}^{*}}= \boldsymbol{0}\)? The statistic is
\[ \begin{aligned} \zeta :={}& \sigma^{-2} \hat{\boldsymbol{\beta}}^\intercal\boldsymbol{X}^\intercal\boldsymbol{X}\hat{\boldsymbol{\beta}}\\ \phi :={}& \frac{\frac{1}{P}\hat{\boldsymbol{\beta}}^\intercal\boldsymbol{X}^\intercal\boldsymbol{X}\hat{\boldsymbol{\beta}}}{\hat{\sigma}^2} \sim \mathrm{FDist}_{P,N-P}. \end{aligned} \]
(Note that this would commonly be done after centering, so that by the FWL theorem we are not trying to evaluate whether a constant should be included. See, for example, the output of the lm function in R.)
This can be re-written in terms of some interpretable quantities. Let’s define
\[ \begin{aligned} TSS :={}& \boldsymbol{Y}^\intercal\boldsymbol{Y}= \sum_{n=1}^Ny_n^2 & \textrm{"total sum of squares"}\\ ESS :={}& \hat{\boldsymbol{Y}}^\intercal\hat{\boldsymbol{Y}}= \sum_{n=1}^N\hat{y}_n^2 & \textrm{"explained sum of squares"}\\ RSS :={}& \hat{\boldsymbol{\varepsilon}}^\intercal\hat{\boldsymbol{\varepsilon}}= \sum_{n=1}^N\hat{\varepsilon}_n^2 & \textrm{"residual sum of squares"}. \end{aligned} \]
Recall that a commonly used measure of “fit” is
\[ R^2 = 1 - \frac{RSS}{TSS} = \frac{ESS}{TSS}. \]
We might look at all our regressors and choose the set that gives the highest \(R^2\). But this does not make sense, since it would always choose to include every linearly independent regressor.
Our F test statistic behaves better, though. Note that if we are testing the null that \(\boldsymbol{\beta}= \boldsymbol{0}\), then we can write the test statistic in terms of \(R^2\) and \(P / N\):
\[ \phi = \frac{N-P}{P} \frac{\hat{\boldsymbol{Y}}^\intercal\hat{\boldsymbol{Y}}}{\hat{\boldsymbol{\varepsilon}}^\intercal\hat{\boldsymbol{\varepsilon}}} = \frac{N-P}{P} \frac{ESS}{RSS} = \frac{1 - P / N}{P / N} \frac{R^2}{1 - R^2} = \frac{\frac{1}{P/N} - 1} {\frac{1}{R^2} - 1}. \]
As \(P\), the number of included regressors, goes from \(0\) to \(N\), both \(P/N\) and \(R^2\) increase monotonically from \(0\) to \(1\). Therefore both \(\frac{1}{P/N} - 1\) and \(\frac{1}{R^2} - 1\) decrease monotonically from \(\infty\) to \(0\). In particular,
- Increasing \(P\) increases \(P/N\), decreases \(\frac{1}{P/N} - 1\), and decreases \(\phi\), and
- Increasing \(P\) increases \(R^2\), decreases \(\frac{1}{R^2} - 1\), and increases \(\phi\).
But they may do so at different rates! Adding a regressor increases \(P/N\) by \(1/N\), but it may increase \(R^2\) by more than that (if it is a very highly explanatory regressor) or less than that (if it is not very good at explaining the residuals from the previous fit). In other words, adding a “good” regressor will increase the F-test statistic and adding a “bad” regressor will decrease the F-test statistic.
Forward stepwise regression
One might be tempted to perform the following procedure:
- Start with no regressors
- One at a time, add the regressor that increases the F-statistic the most
- Stop when the F-test does not increase with the next regressor
Although sensible, this no longer has any coverage guarantees, since the set of variables is not independent of the actual F-statistics themselves. So in order to choose where to stop, we need a different criterion: an estimate of the predictive error, which we will talk about in our machine learning unit.