Variable selection and the F-test
Goals
- Derive tests for multiple coefficients at once under normality
- The F-test and F-distribution
- As a special case test whether a group of variables should be included in the model
- Review \(R^2\), discuss its limitations, and how the F-test approximately corrects them
Testing multiple coefficients at once
So far we have devised tests for one coefficient at a time — or, more precisely, for one linear combination of test statistics at a time. We derived tests under normality, and asserted that they made sense under homoskedasticity for large \(N\).
Suppose we want to test multiple coefficients at a time. A particular use case we will work up to is asking whether our regression did anything at all. In particular, we might ask whether the fit \(\hat{y}= \boldsymbol{X}\hat{\boldsymbol{\beta}}\) is any better than the fit \(\hat{y}= 0\) that did not run a regression at all.
The F-test
Formally, we’ll be testing the null hypothesis that
\[ \textrm{Null hypothesis }H_0: \boldsymbol{A}{\boldsymbol{\beta}^{*}}= \boldsymbol{a} \]
where \(A\) is a \(Q \times P\) matrix, and \(c\) is a \(Q\)–vector. As a special case, we will test our “do nothing” hypothesis by taking \(\boldsymbol{A}= \boldsymbol{I}\) and \(\boldsymbol{a}=\boldsymbol{0}\).
Our trick will be the same as always: transform into known distributions. Specifically, under \(H_0\) and the normal assumptions, one can check that
\[ \zeta := %\zv^\trans \zv = \sigma^{-2} \left(\boldsymbol{A}\hat{\boldsymbol{\beta}}- \boldsymbol{a}\right)^\intercal\left(\boldsymbol{A}(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{A}^\intercal\right)^{-1} \left(\boldsymbol{A}\hat{\boldsymbol{\beta}}- \boldsymbol{a}\right) \sim \mathcal{\chi}^2_{Q}. \] Note that \(\zeta\) is a single number that summarizes the evidence against a vector-valued hypothesis \(\boldsymbol{A}{\boldsymbol{\beta}^{*}}= \boldsymbol{a}\). Note that this expression is like a “modified” notion of the norm of \(\boldsymbol{A}\hat{\boldsymbol{\beta}}- \boldsymbol{a}\), scaled by the inverse of the covariance of \(\boldsymbol{A}\hat{\boldsymbol{\beta}}- \boldsymbol{a}\). Intuitively, if \(\zeta\) is large, it’s evidence against \(H_0\). However, we still don’t know \(\sigma^2\). So we can do our usual trick of normalizing by \(\hat{\sigma}^2\), which is independent of \(\zeta\) under normality, giving our final computable statistic:
\[ \phi := \frac{1}{Q} \frac{\left(\boldsymbol{A}\hat{\boldsymbol{\beta}}- \boldsymbol{a}\right)^\intercal\left(\boldsymbol{A}(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{A}^\intercal\right)^{-1} \left(\boldsymbol{A}\hat{\boldsymbol{\beta}}- \boldsymbol{a}\right)}{\hat{\sigma}^2} = \frac{s_1 / Q}{s_2 / (N-P)} \sim \mathrm{FDist}_{Q,N-P}. \]
where \(s_1\) and \(s_2\) are independent chi–squared statistics. This quantity is called an “F-statistic” and its distribution is known as the “F-distribution,” and its quantiles (and other properties) are computable in R using qf (etc).
Let \(s_1 \sim \mathcal{\chi}^2_{K_1}\) and \(s_2 \sim \mathcal{\chi}^2_{K_2}\), independently of one another. Then the distribution of \[ \phi := \frac{s_1 / K_1}{s_2 / K_2} \]
is called an ``F distribution with \(K_1\) and \(K_2\) degrees of freedom.’’ We write \(\phi \sim \mathrm{FDist}_{K_1,K_2}\). As long as \(K_2 > 2\), \(\mathbb{E}\left[\phi\right] = \frac{K_2}{K_2 - 2}\).
The “F-statistic” for the test of the null hypothesis \(H_0: \boldsymbol{A}{\boldsymbol{\beta}^{*}}= \boldsymbol{a}\), where \(\boldsymbol{A}\) is a \(Q \times P\) matrix, is given by \[ \phi := \frac{ \frac{1}{Q} \left(\boldsymbol{A}\hat{\boldsymbol{\beta}}- \boldsymbol{a}\right)^\intercal \left(\boldsymbol{A}(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{A}^\intercal\right)^{-1} \left(\boldsymbol{A}\hat{\boldsymbol{\beta}}- \boldsymbol{a}\right) }{ \hat{\sigma}^2 }. \] An “F-test” compares the F-statistic to the quantiles of an F distribution with \(Q\) and \(N-P\) degrees of freedom.
Total regression as a special case
What do we get when we apply the F-test to \({\boldsymbol{\beta}^{*}}= \boldsymbol{0}\)? The F-statistic is
\[ \begin{aligned} \phi :={}& \frac{\frac{1}{P}\hat{\boldsymbol{\beta}}^\intercal\boldsymbol{X}^\intercal\boldsymbol{X}\hat{\boldsymbol{\beta}}}{\hat{\sigma}^2} \sim \mathrm{FDist}_{P,N-P}. \end{aligned} \]
(Note that this would commonly be done after centering, so that by the FWL theorem we are not trying to evaluate whether a constant should be included. See, for example, the output of the lm function in R.)
This can be re-written in terms of some interpretable quantities. Let’s define
\[ \begin{aligned} TSS :={}& \boldsymbol{Y}^\intercal\boldsymbol{Y}= \sum_{n=1}^Ny_n^2 & \textrm{"total sum of squares"}\\ ESS :={}& \hat{\boldsymbol{Y}}^\intercal\hat{\boldsymbol{Y}}= \sum_{n=1}^N\hat{y}_n^2 & \textrm{"explained sum of squares"}\\ RSS :={}& \hat{\boldsymbol{\varepsilon}}^\intercal\hat{\boldsymbol{\varepsilon}}= \sum_{n=1}^N\hat{\varepsilon}_n^2 & \textrm{"residual sum of squares"}. \end{aligned} \]
Recall that a commonly used measure of “fit” is
\[ R^2 = 1 - \frac{RSS}{TSS} = \frac{ESS}{TSS}. \]
We might look at all our regressors and choose the set that gives the highest \(R^2\). But this does not make sense, since it would always choose to include every linearly independent regressor.
Our F test statistic behaves better, though. Note that if we are testing the null that \(\boldsymbol{\beta}= \boldsymbol{0}\), then we can write the test statistic in terms of \(R^2\) and \(P / N\):
\[ \phi = \frac{N-P}{P} \frac{\hat{\boldsymbol{Y}}^\intercal\hat{\boldsymbol{Y}}}{\hat{\boldsymbol{\varepsilon}}^\intercal\hat{\boldsymbol{\varepsilon}}} = \frac{N-P}{P} \frac{ESS}{RSS} = \frac{1 - P / N}{P / N} \frac{R^2}{1 - R^2} = \frac{\frac{1}{P/N} - 1} {\frac{1}{R^2} - 1}. \]
As \(P\), the number of included regressors, goes from \(0\) to \(N\), both \(P/N\) and \(R^2\) increase monotonically from \(0\) to \(1\). Therefore both \(\frac{1}{P/N} - 1\) and \(\frac{1}{R^2} - 1\) decrease monotonically from \(\infty\) to \(0\). In particular,
- Increasing \(P\) increases \(P/N\), decreases \(\frac{1}{P/N} - 1\), and decreases \(\phi\), and
- Increasing \(P\) increases \(R^2\), decreases \(\frac{1}{R^2} - 1\), and increases \(\phi\).
But they may do so at different rates! Adding a regressor increases \(P/N\) by \(1/N\), but it may increase \(R^2\) by more than that (if it is a very highly explanatory regressor) or less than that (if it is not very good at explaining the residuals from the previous fit). In other words, adding a “good” regressor will increase the F-test statistic and adding a “bad” regressor will decrease the F-test statistic.
Testing that some subset of the coefficients are zero
Another common use of the F test is to check whether some subset of the coefficients are zero. That is, suppose that we are regressing \(y_n \sim \boldsymbol{x}_n^\intercal\boldsymbol{\theta}+ \boldsymbol{z}_n^\intercal\boldsymbol{\gamma}\), for \(\boldsymbol{\theta}\in \mathbb{R}^{P_1}\) and \(\boldsymbol{\gamma}\in \mathbb{R}^{P_2}\), and we want to test the hypothesis \(H_0: \boldsymbol{\gamma}= \boldsymbol{0}\). We can express this in our framework by defining \[ \boldsymbol{\beta}= \begin{pmatrix} \boldsymbol{\theta}\\ \boldsymbol{\gamma} \end{pmatrix} \quad \boldsymbol{A}= \begin{pmatrix} \boldsymbol{0}_{P_2 \times P_1} & \boldsymbol{I}_{P_2} \end{pmatrix} \quad H_0: \boldsymbol{A}\boldsymbol{\beta}= \boldsymbol{\gamma}= \boldsymbol{0}_{P_2}. \] We can then form the test as given above using the OLS estimator from the full regression including both \(\boldsymbol{x}_n\) and \(\boldsymbol{z}_n\) as regressors.
It happens that this test statistic can be re-arranged into a nice, simple form. If we denote \(\mathrm{RSS}_{x}\) to denote the residual sum of squares from regressing on \(x\), and \(\mathrm{RSS}_{xz}\) the residual sum of squares from regression on both \(\boldsymbol{x}_n\) and \(\boldsymbol{z}_n\), then the F test statistic is equal to \[ \phi = \frac{N - P_1 - P_2}{P_2} \frac{\mathrm{RSS}_{x} - \mathrm{RSS}_{xz}}{\mathrm{RSS}_{xz}}, \] which is distributed under the null as an \(\mathrm{FDist}_{P_2,N - P_1 - P_2}\) variable, as expected. Necessarily \(\mathrm{RSS}_{x} \ge \mathrm{RSS}_{xz}\), and so we reject \(H_0\) precisely when the RSS becomes a lot smaller when including \(\boldsymbol{z}_n\) as a regressor.
The easiest way to prove this equivalence is via the FWL theorem, which we will learn in unit four.
Forward stepwise regression
One might be tempted to perform the following procedure:
- Start with no regressors
- One at a time, add the regressor that increases the F-statistic the most
- Stop when the F-test does not increase with the next regressor
Although sensible, this no longer has any coverage guarantees, since the set of variables is not independent of the actual F-statistics themselves. So in order to choose where to stop, we need a different criterion: an estimate of the predictive error, which we will talk about in our machine learning unit.