The FWL theorem and coefficient interpretation.
Goals
- Discuss the FWL theorem and some uses and consequences
- Inclusion of a constant
- Linear regression as the marginal association
- The FWL theorem for visualization
Simple regression
Recall our formulas for simple linear regression. If \(\y_n \sim \beta_1 + \beta_2 \x_n\), then
\[ \begin{align*} \betahat_1 = \overline{y} - \betahat_2 \overline{x} \quad\textrm{and}\quad \betahat_2 ={} \frac{\sumn \left( \y_n - \ybar \right) \left(\x_n - \xbar \right)} {\sumn \left( \x_n - \xbar \right)^2}. \end{align*} \]
Interestingly, note that if we define \(\y'_n = \y_n - \ybar\) and \(\x'_n = \x_n - \xbar\), the “de–meaned” or “centered” versions of the response and regressor, then we could also have computed the regression \(\y'_n \sim \gamma_2 \x'_n\) and gotten the same answer:
\[ \gammahat_2 = \frac{\sumn \y'_n \x'_n}{\sumn \x'_n \x'_n} = \frac{\sumn \left( \y_n - \ybar \right) \left(\x_n - \xbar \right)} {\sumn \left( \x_n - \xbar \right)^2} = \betahat_2. \]
Is this a coincidence? We’ll see today that it’s not.
The FWL theorem
The FWL theorem gives an expression for sub-vectors of \(\betavhat\). Specifically, let’s partition our regressors into two sets:
\(\y_n \sim \xv_n^\trans \beta = \av_{n}^\trans \betav_a + \bv_{n}^\trans \betav_b\),
where \(\betav^\trans = (\betav_a^\trans, \betav_b^\trans)\) and \(\xv_n^\trans = (\av_n^\trans, \bv_n^\trans)\). We can similarly partition our regressors matrix into two parts \[ \X = (\X_a \, \X_b). \]
A particular example to keep in mind is where
\[ \begin{aligned} \xv_n^\trans =& (\x_{n2}, \ldots, \x_{n(P-1)}, 1)^\trans \\ \bv_n =& (1) \\ \av_n^\trans =& (\x_{n2}, \ldots, \x_{n(P-1)})^\trans \\ \end{aligned} \]
Let us ask what is the effect on \(\betav_a\) of including \(\bv_n\) as a regressor?
The answer is given by the FWL theorem. Recall that
\[ \resvhat = \Y - \X\betavhat = \Y - \X_a \betavhat_a - \X_b \betavhat_b, \]
and that \(\X^\trans \resvhat = \zerov\), so \(\X_a^\trans \resvhat = \zerov\) and \(\X_b^\trans \resvhat = \zerov\). Recall also the definition of the projection matrix perpendicular to the span of \(\X_b\):
\[ \proj{\X_b^\perp} := \id{} - \X_b (\X_b^\trans \X_b)^{-1} \X_b^\trans. \]
Applying \(\proj{\X_b^\perp}\) to both sides of \(\resvhat = \Y - \X\betavhat\) gives
\[ \proj{\X_b^\perp} \resvhat = \resvhat = \proj{\X_b^\perp} \Y - \proj{\X_b^\perp} \X_a \betavhat_a - \proj{\X_b^\perp} \X_b \betavhat_b = \proj{\X_b^\perp} \Y - \proj{\X_b^\perp} \X_a \betavhat_a. \]
Verify that \(\proj{\X_b^\perp} \resvhat = \resvhat\) and \(\proj{\X_b^\perp} \X_b \betavhat_b = \zerov\).
Now appying \((\proj{\X_b^\perp} \X_a)^\trans\) to both sides of the preceding expression gives
\[ \begin{aligned} \X_a^\trans \resvhat ={}& \zerov = \X_a^\trans \proj{\X_b^\perp} \Y - \X_a^\trans \proj{\X_b^\perp} \X_a \betavhat_a \quad \Rightarrow \\ \left(\proj{\X_b^\perp} \X_a \right)^\trans \proj{\X_b^\perp} \X_a \betavhat_a ={}& \left(\proj{\X_b^\perp} \X_a \right)^\trans \proj{\X_b^\perp} \Y \end{aligned} \]
If we assume that \(\X\) is full-rank, then \(\proj{\X_b^\perp} \X_a\) must be full-rank as well, since otherwise one of the columns of \(\X_a\) would be a linear combination of columns of \(\X_b\). Therefore we can invert to get
\[ \betavhat_a = \left((\proj{\X_b^\perp} \X_a)^\trans \proj{\X_b^\perp} \X_a \right)^{-1} \left(\proj{\X_b^\perp} \X_a \right)^\trans \proj{\X_b^\perp} \Y. \]
This is exactly the same as the linear regression
\[ \tilde{\Y} \sim \tilde{\X_a} \betav_a \quad\textrm{where } \tilde{\X_a} := \proj{\X_b^\perp} \X_a \textrm{ and } \tilde{\Y} := \proj{\X_b^\perp} \Y. \]
That is, the OLS coefficient on \(\X_a\) is the same as projecting all the responses and regressors to a space orthogonal to \(\X_b\), and running ordinary regression.
See section 7.3 of Prof. Ding’s book for a more rigorous proof, which uses the Schur representation of sub-matrices of \((\X^\trans \X)^{-1}\).
The special case of a constant regressor
Suppose we want to regress \(\Y \sim \beta_0 + \betav^\trans \xv_n\). We’d like to know what \(\betavhat\) is, and in particular, what is the effect of including a constant.
We can answer this with the FWL theorem by taking \(\bv_n = (1)\) and \(\av_n = \xv_n\). Then \(\betavhat\) will be the same as in the regression
\[ \tilde{Y} \sim \tilde{X} \betav \]
where \(\tilde{Y} = \proj{\X_b^\perp} \Y\) and \(\tilde{X} = \proj{\X_b^\perp} \X\).
A particular special case is useful for intuition. Take \(\xv_b\) to simply be the constant regressor, \(1\). Then \(\X_b = \onev\), and
\[ \proj{\X_b^\perp} = \id{} - \onev (\onev^\trans \onev)^{-1} \onev^\trans = \id{} - \frac{1}{N} \onev \onev^\trans. \]
\(\onev^\trans \onev = \sumn 1 \cdot 1 = N\)
If we take
\[ \begin{aligned} \onev^\trans \Y =& \sumn 1 \cdot \y_n = \sumn \y_n\\ \frac{1}{N} \onev^\trans \Y =& \meann 1 \cdot \y_n = \meann \y_n = \ybar\\ \onev \frac{1}{N} \onev^\trans \Y =& \onev \ybar = \begin{pmatrix} \ybar \\ \vdots \\ \ybar \end{pmatrix} \\ \proj{\X_b^\perp} \Y = \left(\id - \onev \frac{1}{N} \onev^\trans \right) \Y =& \Y - \begin{pmatrix} \ybar \\ \vdots \\ \ybar \end{pmatrix} = \begin{pmatrix} y_1 - \ybar \\ y_2 - \ybar \\ \vdots \\ y_N - \ybar \end{pmatrix} \end{aligned} \]
The projection matrix \(\proj{\X_b^\perp}\) thus simply centers a vector at its sample mean.
Similarly,
\[ \tilde{\X_a} := \proj{\X_b^\perp} \X_a = \X_a - \frac{1}{N} \onev \onev^\trans \X_a = \X_a - \onev \xbar^\trans \\ \textrm{ where } \xbar^\trans := \begin{pmatrix} \meann \x_{n1} & \ldots & \meann \x_{n(P-1)} \end{pmatrix}, \]
so that the \(n\)–th row of \(\proj{\X_b^\perp} \X_a\) is simply \(\xv_n^\trans - \xbar^\trans\), and each regressor is centered. So
\[ \betavhat_a = (\tilde{\X_a}^\trans \tilde{\X_a})^\trans \tilde{\X_a}^\trans \tilde{\Y}, \]
the OLS estimator where both the regressors and responses have been centered at their sample means. In this case, by the LLN,
\[ \frac{1}{N} \tilde{\X_a}^\trans \tilde{\X_a} \rightarrow \cov{\xv_n}, \]
in contrast to the general case, where
\[ \frac{1}{N} \X^\trans \X \rightarrow \expect{\xv_n \xv_n^\trans} = \cov{\xv_n} + \expect{\xv_n}\expect{\xv_n^\trans}. \]
For this reason, when thinking about the sampling behavior of OLS coefficients where a constant is included in the regression, it’s enough to think about the covariance of the regressors, rather than the outer product.
Derive our simple least squares estimator of \(\betavhat_1\) using the FWL theorem.