The FWL theorem and coefficient interpretation.

Goals

Discuss the FWL theorem and some uses and consequences
- Inclusion of a constant
- Linear regression as the marginal association
- The FWL theorem for visualization

Simple regression

Recall our formulas for simple linear regression. If \(y_n \sim \beta_1 + \beta_2 x_n\), then

\[ \begin{align*} \hat{\beta}_1 = \overline{y} - \hat{\beta}_2 \overline{x} \quad\textrm{and}\quad \hat{\beta}_2 ={} \frac{\sum_{n=1}^N\left( y_n - \bar{y}\right) \left(x_n - \bar{x}\right)} {\sum_{n=1}^N\left( x_n - \bar{x}\right)^2}. \end{align*} \]

Interestingly, note that if we define \(y'_n = y_n - \bar{y}\) and \(x'_n = x_n - \bar{x}\), the “de–meaned” or “centered” versions of the response and regressor, then we could also have computed the regression \(y'_n \sim \gamma_2 x'_n\) and gotten the same answer:

\[ \hat{\gamma}_2 = \frac{\sum_{n=1}^Ny'_n x'_n}{\sum_{n=1}^Nx'_n x'_n} = \frac{\sum_{n=1}^N\left( y_n - \bar{y}\right) \left(x_n - \bar{x}\right)} {\sum_{n=1}^N\left( x_n - \bar{x}\right)^2} = \hat{\beta}_2. \]

Is this a coincidence? We’ll see today that it’s not.

Correlated regressors

Take \(\boldsymbol{X}= (\boldsymbol{x}_1, \ldots, \boldsymbol{x}_P)\), where \(\boldsymbol{x}_1 = \boldsymbol{1}\), so that we are regressing on \(P-1\) regressors and a constant. If the regressors are all orthogonal to one another (\(\boldsymbol{x}_k^\intercal\boldsymbol{x}_j = 0\) for \(k \ne j\)), then we know that

\[ \hat{\beta}= (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{X}^\intercal\boldsymbol{Y}= \begin{pmatrix} \boldsymbol{x}_1^\intercal\boldsymbol{x}_1 & 0 & \ldots 0 \\ 0 & \ddots & 0 \\ 0 & \ldots & \boldsymbol{x}_P^\intercal\boldsymbol{x}_P \\ \end{pmatrix}^{-1} \begin{pmatrix} \boldsymbol{x}_1^\intercal\boldsymbol{Y}\\ \vdots \\ \boldsymbol{x}_P^\intercal\boldsymbol{Y}\\ \end{pmatrix} = \begin{pmatrix} \frac{\boldsymbol{x}_1^\intercal\boldsymbol{Y}}{\boldsymbol{x}_1^\intercal\boldsymbol{x}_1} \\ \vdots \\ \frac{\boldsymbol{x}_P^\intercal\boldsymbol{Y}}{\boldsymbol{x}_P^\intercal\boldsymbol{x}_P} \\ \end{pmatrix}. \]

Exercise

What is the limiting behavior of \(\frac{1}{N}\boldsymbol{X}^\intercal\boldsymbol{X}\) when the \(\boldsymbol{x}_n\) are independent of one another? What if they are independent and \(\mathbb{E}\left[\boldsymbol{x}_n\right] = 0\), except for a constant \(\boldsymbol{x}_{n1} = 1\)?

However, typically the regressors are not orthogonal to one another. When they are not, we can ask

How can we interpret the coefficients?
How does the relation between the regressors affect the \(\hat{\boldsymbol{\beta}}\) covariance matrix?

The FWL theorem

The FWL theorem gives an expression for sub-vectors of \(\hat{\boldsymbol{\beta}}\). Specifically, let’s partition our regressors into two sets:

\(y_n \sim \boldsymbol{x}_n^\intercal\beta = \boldsymbol{a}_{n}^\intercal\boldsymbol{\beta}_a + \boldsymbol{b}_{n}^\intercal\boldsymbol{\beta}_b\),

where \(\boldsymbol{\beta}^\intercal= (\boldsymbol{\beta}_a^\intercal, \boldsymbol{\beta}_b^\intercal)\) and \(\boldsymbol{x}_n^\intercal= (\boldsymbol{a}_n^\intercal, \boldsymbol{b}_n^\intercal)\). We can similarly partition our regressors matrix into two parts \[ \boldsymbol{X}= (\boldsymbol{X}_a \, \boldsymbol{X}_b). \]

A particular example to keep in mind is where

\[ \begin{aligned} \boldsymbol{x}_n^\intercal=& (x_{n2}, \ldots, x_{n(P-1)}, 1)^\intercal\\ \boldsymbol{b}_n =& (1) \\ \boldsymbol{a}_n^\intercal=& (x_{n2}, \ldots, x_{n(P-1)})^\intercal\\ \end{aligned} \]

Let us ask what is the effect on \(\boldsymbol{\beta}_a\) of including \(\boldsymbol{b}_n\) as a regressor?
The answer is given by the FWL theorem. Recall that

\[ \hat{\boldsymbol{\varepsilon}}= \boldsymbol{Y}- \boldsymbol{X}\hat{\boldsymbol{\beta}}= \boldsymbol{Y}- \boldsymbol{X}_a \hat{\boldsymbol{\beta}}_a - \boldsymbol{X}_b \hat{\boldsymbol{\beta}}_b, \]

and that \(\boldsymbol{X}^\intercal\hat{\boldsymbol{\varepsilon}}= \boldsymbol{0}\), so \(\boldsymbol{X}_a^\intercal\hat{\boldsymbol{\varepsilon}}= \boldsymbol{0}\) and \(\boldsymbol{X}_b^\intercal\hat{\boldsymbol{\varepsilon}}= \boldsymbol{0}\). Recall also the definition of the projection matrix perpendicular to the span of \(\boldsymbol{X}_b\):

\[ \underset{\boldsymbol{X}_b^\perp}{\boldsymbol{P}} := \boldsymbol{I}{} - \boldsymbol{X}_b (\boldsymbol{X}_b^\intercal\boldsymbol{X}_b)^{-1} \boldsymbol{X}_b^\intercal. \]

Applying \(\underset{\boldsymbol{X}_b^\perp}{\boldsymbol{P}}\) to both sides of \(\hat{\boldsymbol{\varepsilon}}= \boldsymbol{Y}- \boldsymbol{X}\hat{\boldsymbol{\beta}}\) gives

\[ \underset{\boldsymbol{X}_b^\perp}{\boldsymbol{P}} \hat{\boldsymbol{\varepsilon}}= \hat{\boldsymbol{\varepsilon}}= \underset{\boldsymbol{X}_b^\perp}{\boldsymbol{P}} \boldsymbol{Y}- \underset{\boldsymbol{X}_b^\perp}{\boldsymbol{P}} \boldsymbol{X}_a \hat{\boldsymbol{\beta}}_a - \underset{\boldsymbol{X}_b^\perp}{\boldsymbol{P}} \boldsymbol{X}_b \hat{\boldsymbol{\beta}}_b = \underset{\boldsymbol{X}_b^\perp}{\boldsymbol{P}} \boldsymbol{Y}- \underset{\boldsymbol{X}_b^\perp}{\boldsymbol{P}} \boldsymbol{X}_a \hat{\boldsymbol{\beta}}_a. \]

Exercise

Verify that \(\underset{\boldsymbol{X}_b^\perp}{\boldsymbol{P}} \hat{\boldsymbol{\varepsilon}}= \hat{\boldsymbol{\varepsilon}}\) and \(\underset{\boldsymbol{X}_b^\perp}{\boldsymbol{P}} \boldsymbol{X}_b \hat{\boldsymbol{\beta}}_b = \boldsymbol{0}\).

Now appying \((\underset{\boldsymbol{X}_b^\perp}{\boldsymbol{P}} \boldsymbol{X}_a)^\intercal\) to both sides of the preceding expression gives

\[ \begin{aligned} \boldsymbol{X}_a^\intercal\hat{\boldsymbol{\varepsilon}}={}& \boldsymbol{0}= \boldsymbol{X}_a^\intercal\underset{\boldsymbol{X}_b^\perp}{\boldsymbol{P}} \boldsymbol{Y}- \boldsymbol{X}_a^\intercal\underset{\boldsymbol{X}_b^\perp}{\boldsymbol{P}} \boldsymbol{X}_a \hat{\boldsymbol{\beta}}_a \quad \Rightarrow \\ \left(\underset{\boldsymbol{X}_b^\perp}{\boldsymbol{P}} \boldsymbol{X}_a \right)^\intercal \underset{\boldsymbol{X}_b^\perp}{\boldsymbol{P}} \boldsymbol{X}_a \hat{\boldsymbol{\beta}}_a ={}& \left(\underset{\boldsymbol{X}_b^\perp}{\boldsymbol{P}} \boldsymbol{X}_a \right)^\intercal\underset{\boldsymbol{X}_b^\perp}{\boldsymbol{P}} \boldsymbol{Y} \end{aligned} \]

If we assume that \(\boldsymbol{X}\) is full-rank, then \(\underset{\boldsymbol{X}_b^\perp}{\boldsymbol{P}} \boldsymbol{X}_a\) must be full-rank as well, since otherwise one of the columns of \(\boldsymbol{X}_a\) would be a linear combination of columns of \(\boldsymbol{X}_b\). Therefore we can invert to get

\[ \hat{\boldsymbol{\beta}}_a = \left((\underset{\boldsymbol{X}_b^\perp}{\boldsymbol{P}} \boldsymbol{X}_a)^\intercal\underset{\boldsymbol{X}_b^\perp}{\boldsymbol{P}} \boldsymbol{X}_a \right)^{-1} \left(\underset{\boldsymbol{X}_b^\perp}{\boldsymbol{P}} \boldsymbol{X}_a \right)^\intercal\underset{\boldsymbol{X}_b^\perp}{\boldsymbol{P}} \boldsymbol{Y}. \]

This is exactly the same as the linear regression

\[ \tilde{\boldsymbol{Y}} \sim \tilde{\boldsymbol{X}_a} \boldsymbol{\beta}_a \quad\textrm{where } \tilde{\boldsymbol{X}_a} := \underset{\boldsymbol{X}_b^\perp}{\boldsymbol{P}} \boldsymbol{X}_a \textrm{ and } \tilde{\boldsymbol{Y}} := \underset{\boldsymbol{X}_b^\perp}{\boldsymbol{P}} \boldsymbol{Y}. \]

That is, the OLS coefficient on \(\boldsymbol{X}_a\) is the same as projecting all the responses and regressors to a space orthogonal to \(\boldsymbol{X}_b\), and running ordinary regression.

See section 7.3 of Prof. Ding’s book for a more rigorous proof, which uses the Schur representation of sub-matrices of \((\boldsymbol{X}^\intercal\boldsymbol{X})^{-1}\).

The special case of a constant regressor

Suppose we want to regress \(\boldsymbol{Y}\sim \beta_0 + \boldsymbol{\beta}^\intercal\boldsymbol{x}_n\). We’d like to know what \(\hat{\boldsymbol{\beta}}\) is, and in particular, what is the effect of including a constant.

We can answer this with the FWL theorem by taking \(\boldsymbol{b}_n = (1)\) and \(\boldsymbol{a}_n = \boldsymbol{x}_n\). Then \(\hat{\boldsymbol{\beta}}\) will be the same as in the regression

\[ \tilde{Y} \sim \tilde{X} \boldsymbol{\beta} \]

where \(\tilde{Y} = \underset{\boldsymbol{X}_b^\perp}{\boldsymbol{P}} \boldsymbol{Y}\) and \(\tilde{X} = \underset{\boldsymbol{X}_b^\perp}{\boldsymbol{P}} \boldsymbol{X}\).

A particular special case is useful for intuition. Take \(\boldsymbol{x}_b\) to simply be the constant regressor, \(1\). Then \(\boldsymbol{X}_b = \boldsymbol{1}\), and

\[ \underset{\boldsymbol{X}_b^\perp}{\boldsymbol{P}} = \boldsymbol{I}{} - \boldsymbol{1}(\boldsymbol{1}^\intercal\boldsymbol{1})^{-1} \boldsymbol{1}^\intercal= \boldsymbol{I}{} - \frac{1}{N} \boldsymbol{1}\boldsymbol{1}^\intercal. \]

\(\boldsymbol{1}^\intercal\boldsymbol{1}= \sum_{n=1}^N1 \cdot 1 = N\)

If we take

\[ \begin{aligned} \boldsymbol{1}^\intercal\boldsymbol{Y}=& \sum_{n=1}^N1 \cdot y_n = \sum_{n=1}^Ny_n\\ \frac{1}{N} \boldsymbol{1}^\intercal\boldsymbol{Y}=& \frac{1}{N} \sum_{n=1}^N1 \cdot y_n = \frac{1}{N} \sum_{n=1}^Ny_n = \bar{y}\\ \boldsymbol{1}\frac{1}{N} \boldsymbol{1}^\intercal\boldsymbol{Y}=& \boldsymbol{1}\bar{y}= \begin{pmatrix} \bar{y}\\ \vdots \\ \bar{y} \end{pmatrix} \\ \underset{\boldsymbol{X}_b^\perp}{\boldsymbol{P}} \boldsymbol{Y}= \left(\boldsymbol{I}- \boldsymbol{1}\frac{1}{N} \boldsymbol{1}^\intercal\right) \boldsymbol{Y}=& \boldsymbol{Y}- \begin{pmatrix} \bar{y}\\ \vdots \\ \bar{y} \end{pmatrix} = \begin{pmatrix} y_1 - \bar{y}\\ y_2 - \bar{y}\\ \vdots \\ y_N - \bar{y} \end{pmatrix} \end{aligned} \]

The projection matrix \(\underset{\boldsymbol{X}_b^\perp}{\boldsymbol{P}}\) thus simply centers a vector at its sample mean.

Similarly,

\[ \tilde{\boldsymbol{X}_a} := \underset{\boldsymbol{X}_b^\perp}{\boldsymbol{P}} \boldsymbol{X}_a = \boldsymbol{X}_a - \frac{1}{N} \boldsymbol{1}\boldsymbol{1}^\intercal\boldsymbol{X}_a = \boldsymbol{X}_a - \boldsymbol{1}\bar{x}^\intercal\\ \textrm{ where } \bar{x}^\intercal:= \begin{pmatrix} \frac{1}{N} \sum_{n=1}^Nx_{n1} & \ldots & \frac{1}{N} \sum_{n=1}^Nx_{n(P-1)} \end{pmatrix}, \]

so that the \(n\)–th row of \(\underset{\boldsymbol{X}_b^\perp}{\boldsymbol{P}} \boldsymbol{X}_a\) is simply \(\boldsymbol{x}_n^\intercal- \bar{x}^\intercal\), and each regressor is centered. So

\[ \hat{\boldsymbol{\beta}}_a = (\tilde{\boldsymbol{X}_a}^\intercal\tilde{\boldsymbol{X}_a})^\intercal\tilde{\boldsymbol{X}_a}^\intercal\tilde{\boldsymbol{Y}}, \]

the OLS estimator where both the regressors and responses have been centered at their sample means. In this case, by the LLN,

\[ \frac{1}{N} \tilde{\boldsymbol{X}_a}^\intercal\tilde{\boldsymbol{X}_a} \rightarrow \mathrm{Cov}\left(\boldsymbol{x}_n\right), \]

in contrast to the general case, where

\[ \frac{1}{N} \boldsymbol{X}^\intercal\boldsymbol{X}\rightarrow \mathbb{E}\left[\boldsymbol{x}_n \boldsymbol{x}_n^\intercal\right] = \mathrm{Cov}\left(\boldsymbol{x}_n\right) + \mathbb{E}\left[\boldsymbol{x}_n\right]\mathbb{E}\left[\boldsymbol{x}_n^\intercal\right]. \]

For this reason, when thinking about the sampling behavior of OLS coefficients where a constant is included in the regression, it’s enough to think about the covariance of the regressors, rather than the outer product.

Exercise

Derive our simple least squares estimator of \(\hat{\boldsymbol{\beta}}_1\) using the FWL theorem.