Interpreting the coefficients: The FWL theorem
\(\,\)
Goals
- Interpret the OLS errors and estimates for multiple linear regression
- The The Frisch–Waugh–Lovell Theorem (FWL) theorem (section 7 of Prof. Ding’s lecture notes)
- The role of regressor covariance in the OLS standard errors
The FWL theorem
The FWL theorem gives an expression for sub-vectors of \(\betavhat\). Specifically, let’s partition our regressors into two sets:
\(\y_n \sim \xv_n^\trans \beta = \av_{n}^\trans \betav_a + \bv_{n}^\trans \betav_b\),
where \(\betav^\trans = (\betav_a^\trans, \betav_b^\trans)\) and \(\xv_n^\trans = (\av_n^\trans, \bv_n^\trans)\). We can similarly partition our regressors matrix into two parts \[ \X = (\X_a \, \X_b). \]
A particular example to keep in mind is where
\[ \begin{aligned} \xv_n^\trans =& (\x_{n2}, \ldots, \x_{n(P-1)}, 1)^\trans \\ \bv_n =& (1) \\ \av_n^\trans =& (\x_{n2}, \ldots, \x_{n(P-1)})^\trans \\ \end{aligned} \]
Let us ask what is the effect on \(\betav_a\) of including \(\bv_n\) as a regressor?
The answer is given by the FWL theorem. Recall that
\[ \resvhat = \Y - \X\betavhat = \Y - \X_a \betavhat_a - \X_b \betavhat_b, \]
and that \(\X^\trans \resvhat = \zerov\), so \(\X_a^\trans \resvhat = \zerov\) and \(\X_b^\trans \resvhat = \zerov\). Recall also the definition of the projection matrix perpendicular to the span of \(\X_b\):
\[ \proj{\X_b^\perp} := \id{} - \X_b (\X_b^\trans \X_b)^{-1} \X_b^\trans. \]
Applying \(\proj{\X_b^\perp}\) to both sides of \(\resvhat = \Y - \X\betavhat\) gives
\[ \proj{\X_b^\perp} \resvhat = \resvhat = \proj{\X_b^\perp} \Y - \proj{\X_b^\perp} \X_a \betavhat_a - \proj{\X_b^\perp} \X_b \betavhat_b = \proj{\X_b^\perp} \Y - \proj{\X_b^\perp} \X_a \betavhat_a. \]
Verify that \(\proj{\X_b^\perp} \resvhat = \resvhat\) and \(\proj{\X_b^\perp} \X_b \betavhat_b = \zerov\).
Now appying \((\proj{\X_b^\perp} \X_a)^\trans\) to both sides of the preceding expression gives
\[ \begin{aligned} \X_a^\trans \resvhat ={}& \zerov = \X_a^\trans \proj{\X_b^\perp} \Y - \X_a^\trans \proj{\X_b^\perp} \X_a \betavhat_a \quad \Rightarrow \\ (\proj{\X_b^\perp} \X_a)^\trans \proj{\X_b^\perp} \X_a \betavhat_a ={}& (\proj{\X_b^\perp} \X_a)^\trans \proj{\X_b^\perp} \Y \end{aligned} \]
If we assume that \(\X\) is full-rank, then \(\proj{\X_b^\perp} \X_a\) must be full-rank as well, since otherwise one of the columns of \(\X_a\) would be a linear combination of columns of \(\X_b\). Therefore we can invert to get
\[ \betavhat_a = \left((\proj{\X_b^\perp} \X_a)^\trans \proj{\X_b^\perp} \X_a \right)^{-1} (\proj{\X_b^\perp} \X_a)^\trans \proj{\X_b^\perp} \Y. \]
This is exactly the same as the linear regression
\[ \tilde{\Y} \sim \tilde{\X_a} \betav_a \quad\textrm{where } \tilde{\X_a} := \proj{\X_b^\perp} \X_a \textrm{ and } \tilde{\Y} := \proj{\X_b^\perp} \Y. \]
That is, the OLS coefficient on \(\X_a\) is the same as projecting all the responses and regressors to a space orthogonal to \(\X_b\), and running ordinary regression.
This means the value of \(\betavhat_a\)
The special case of a constant regressor
Suppose we want to regress \(\Y \sim \beta_0 + \betav^\trans \xv_n\). We’d like to know what \(\betavhat\) is, and in particular, what is the effect of including a constant.
We can answer this with the FWL theorem by taking \(\bv_n = (1)\) and \(\av_n = \xv_n\). Then \(\betavhat\) will be the same as in the regression
\[ \tilde{Y} \sim \tilde{X} \betav \]
where \(\tilde{Y} = \proj{\X_b^\perp} \Y\) and \(\tilde{X} = \proj{\X_b^\perp} \X\).
A particular special case is useful for intuition. Take \(\xv_b\) to simply be the constant regressor, \(1\). Then \(\X_b = \onev\), and
\[ \proj{\X_b^\perp} = \id{} - \onev (\onev^\trans \onev)^{-1} \onev^\trans = \id{} - \frac{1}{N} \onev \onev^\trans. \]
\(\onev^\trans \onev = \sumn 1 \cdot 1 = N\)
If we take
\[ \begin{aligned} \onev^\trans \Y =& \sumn 1 \cdot \y_n = \sumn \y_n\\ \frac{1}{N} \onev^\trans \Y =& \meann 1 \cdot \y_n = \meann \y_n = \ybar\\ \onev \frac{1}{N} \onev^\trans \Y =& \onev \ybar = \begin{pmatrix} \ybar \\ \vdots \\ \ybar \end{pmatrix} \\ \proj{\X_b^\perp} \Y = \left(\id - \onev \frac{1}{N} \onev^\trans \right) \Y =& \Y - \begin{pmatrix} \ybar \\ \vdots \\ \ybar \end{pmatrix} = \begin{pmatrix} y_1 - \ybar \\ y_2 - \ybar \\ \vdots \\ y_N - \ybar \end{pmatrix} \end{aligned} \]
The projection matrix \(\proj{\X_b^\perp}\) thus simply centers a vector at its sample mean.
Similarly,
\[ \tilde{\X_a} := \proj{\X_b^\perp} \X_a = \X_a - \frac{1}{N} \onev \onev^\trans \X_a = \X_a - \onev \xbar^\trans \\ \textrm{ where } \xbar^\trans := \begin{pmatrix} \meann \x_{n1} & \ldots & \meann \x_{n(P-1)} \end{pmatrix}, \]
so that the \(n\)–th row of \(\proj{\X_b^\perp} \X_a\) is simply \(\xv_n^\trans - \xbar^\trans\), and each regressor is centered. So
\[ \betavhat_a = (\tilde{\X_a}^\trans \tilde{\X_a})^\trans \tilde{\X_a}^\trans \tilde{\Y}, \]
the OLS estimator where both the regressors and responses have been centered at their sample means. In this case, by the LLN,
\[ \frac{1}{N} \tilde{\X_a}^\trans \tilde{\X_a} \rightarrow \cov{\xv_n}, \]
in contrast to the general case, where
\[ \frac{1}{N} \X^\trans \X \rightarrow \expect{\xv_n \xv_n^\trans} = \cov{\xv_n} + \expect{\xv_n}\expect{\xv_n^\trans}. \]
For this reason, when thinking about the sampling behavior of OLS coefficients where a constant is included in the regression, it’s enough to think about the covariance of the regressors, rather than the outer product.
Derive our simple least squares estimator of \(\betavhat\) using the FWL theorem.
Covariances with the FWL theorem
Though it may not be obvious, estimating standard errors using the residuals from \(\tilde{\Y} \sim \tilde{\X_a}\betav_a\) versus \(\Y \sim \X_a \betav_a + \X_b \betav_b\) is equivalent, whether the heteroskedastic or sandwich covariance matrices are used. See section 7.3 of Prof. Ding’s book for a proof, which uses the Schur representation of sub-matrices of \((\X^\trans \X)^{-1}\).
The role of regressor covariance
For simplicity, let’s consider a simple regression where \(\y_n \sim \beta_0 + \beta_1 \x_n\), with \(\var{\x_n} = \sigma_x^2\) and \(\var{\res_n} = \sigma_\res^2\). By the FWL theorem, we can estimate
\[ \betahat_1 = \frac{\meann (\y_n - \ybar) (\x_n - \xbar)}{\meann (\x_n - \xbar)^2} \quad\textrm{with}\quad N \cov{\betahat_1} \approx \frac{\meann (\y_n - \ybar - \betahat_1 (\x_n - \xbar))^2}{\meann (\x_n - \xbar)^2} \rightarrow \frac{\sigma_\res^2}{\sigma_x^2}. \]
The variance of the (rescaled) regression coefficient thus depends on the ratio of the residual noise to the regressor variance, both after projection.