The FWL theorem and coefficient interpretation.

Goals

  • Discuss the FWL theorem and some uses and consequences
    • Inclusion of a constant
    • Linear regression as the marginal association
    • The FWL theorem for visualization

Simple regression

Recall our formulas for simple linear regression. If \(y_n \sim \beta_1 + \beta_2 x_n\), then

\[ \begin{align*} \hat{\beta}_1 = \overline{y} - \hat{\beta}_2 \overline{x} \quad\textrm{and}\quad \hat{\beta}_2 ={} \frac{\sum_{n=1}^N\left( y_n - \bar{y}\right) \left(x_n - \bar{x}\right)} {\sum_{n=1}^N\left( x_n - \bar{x}\right)^2}. \end{align*} \]

Interestingly, note that if we define \(y'_n = y_n - \bar{y}\) and \(x'_n = x_n - \bar{x}\), the “de–meaned” or “centered” versions of the response and regressor, then we could also have computed the regression \(y'_n \sim \gamma_2 x'_n\) and gotten the same answer:

\[ \hat{\gamma}_2 = \frac{\sum_{n=1}^Ny'_n x'_n}{\sum_{n=1}^Nx'_n x'_n} = \frac{\sum_{n=1}^N\left( y_n - \bar{y}\right) \left(x_n - \bar{x}\right)} {\sum_{n=1}^N\left( x_n - \bar{x}\right)^2} = \hat{\beta}_2. \]

Is this a coincidence? We’ll see today that it’s not.

Correlated regressors

Take \(\boldsymbol{X}= (\boldsymbol{x}_1, \ldots, \boldsymbol{x}_P)\), where \(\boldsymbol{x}_1 = \boldsymbol{1}\), so that we are regressing on \(P-1\) regressors and a constant. If the regressors are all orthogonal to one another (\(\boldsymbol{x}_k^\intercal\boldsymbol{x}_j = 0\) for \(k \ne j\)), then we know that

\[ \hat{\beta}= (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{X}^\intercal\boldsymbol{Y}= \begin{pmatrix} \boldsymbol{x}_1^\intercal\boldsymbol{x}_1 & 0 & \ldots 0 \\ 0 & \ddots & 0 \\ 0 & \ldots & \boldsymbol{x}_P^\intercal\boldsymbol{x}_P \\ \end{pmatrix}^{-1} \begin{pmatrix} \boldsymbol{x}_1^\intercal\boldsymbol{Y}\\ \vdots \\ \boldsymbol{x}_P^\intercal\boldsymbol{Y}\\ \end{pmatrix} = \begin{pmatrix} \frac{\boldsymbol{x}_1^\intercal\boldsymbol{Y}}{\boldsymbol{x}_1^\intercal\boldsymbol{x}_1} \\ \vdots \\ \frac{\boldsymbol{x}_P^\intercal\boldsymbol{Y}}{\boldsymbol{x}_P^\intercal\boldsymbol{x}_P} \\ \end{pmatrix}. \]

Exercise

What is the limiting behavior of \(\frac{1}{N}\boldsymbol{X}^\intercal\boldsymbol{X}\) when the \(\boldsymbol{x}_n\) are independent of one another? What if they are independent and \(\mathbb{E}\left[\boldsymbol{x}_n\right] = 0\), except for a constant \(\boldsymbol{x}_{n1} = 1\)?

However, typically the regressors are not orthogonal to one another. When they are not, we can ask

  • How can we interpret the coefficients?
  • How does the relation between the regressors affect the \(\hat{\boldsymbol{\beta}}\) covariance matrix?

The FWL theorem

The FWL theorem gives an expression for sub-vectors of \(\hat{\boldsymbol{\beta}}\). Specifically, let’s partition our regressors into two sets:

\(y_n \sim \boldsymbol{x}_n^\intercal\beta = \boldsymbol{a}_{n}^\intercal\boldsymbol{\beta}_a + \boldsymbol{b}_{n}^\intercal\boldsymbol{\beta}_b\),

where \(\boldsymbol{\beta}^\intercal= (\boldsymbol{\beta}_a^\intercal, \boldsymbol{\beta}_b^\intercal)\) and \(\boldsymbol{x}_n^\intercal= (\boldsymbol{a}_n^\intercal, \boldsymbol{b}_n^\intercal)\). We can similarly partition our regressors matrix into two parts \[ \boldsymbol{X}= (\boldsymbol{X}_a \, \boldsymbol{X}_b). \]

A particular example to keep in mind is where

\[ \begin{aligned} \boldsymbol{x}_n^\intercal=& (x_{n2}, \ldots, x_{n(P-1)}, 1)^\intercal\\ \boldsymbol{b}_n =& (1) \\ \boldsymbol{a}_n^\intercal=& (x_{n2}, \ldots, x_{n(P-1)})^\intercal\\ \end{aligned} \]

Let us ask what is the effect on \(\boldsymbol{\beta}_a\) of including \(\boldsymbol{b}_n\) as a regressor?
The answer is given by the FWL theorem. Recall that

\[ \hat{\boldsymbol{\varepsilon}}= \boldsymbol{Y}- \boldsymbol{X}\hat{\boldsymbol{\beta}}= \boldsymbol{Y}- \boldsymbol{X}_a \hat{\boldsymbol{\beta}}_a - \boldsymbol{X}_b \hat{\boldsymbol{\beta}}_b, \]

and that \(\boldsymbol{X}^\intercal\hat{\boldsymbol{\varepsilon}}= \boldsymbol{0}\), so \(\boldsymbol{X}_a^\intercal\hat{\boldsymbol{\varepsilon}}= \boldsymbol{0}\) and \(\boldsymbol{X}_b^\intercal\hat{\boldsymbol{\varepsilon}}= \boldsymbol{0}\). Recall also the definition of the projection matrix perpendicular to the span of \(\boldsymbol{X}_b\):

\[ \underset{\boldsymbol{X}_b^\perp}{\boldsymbol{P}} := \boldsymbol{I}{} - \boldsymbol{X}_b (\boldsymbol{X}_b^\intercal\boldsymbol{X}_b)^{-1} \boldsymbol{X}_b^\intercal. \]

Applying \(\underset{\boldsymbol{X}_b^\perp}{\boldsymbol{P}}\) to both sides of \(\hat{\boldsymbol{\varepsilon}}= \boldsymbol{Y}- \boldsymbol{X}\hat{\boldsymbol{\beta}}\) gives

\[ \underset{\boldsymbol{X}_b^\perp}{\boldsymbol{P}} \hat{\boldsymbol{\varepsilon}}= \hat{\boldsymbol{\varepsilon}}= \underset{\boldsymbol{X}_b^\perp}{\boldsymbol{P}} \boldsymbol{Y}- \underset{\boldsymbol{X}_b^\perp}{\boldsymbol{P}} \boldsymbol{X}_a \hat{\boldsymbol{\beta}}_a - \underset{\boldsymbol{X}_b^\perp}{\boldsymbol{P}} \boldsymbol{X}_b \hat{\boldsymbol{\beta}}_b = \underset{\boldsymbol{X}_b^\perp}{\boldsymbol{P}} \boldsymbol{Y}- \underset{\boldsymbol{X}_b^\perp}{\boldsymbol{P}} \boldsymbol{X}_a \hat{\boldsymbol{\beta}}_a. \]

Exercise

Verify that \(\underset{\boldsymbol{X}_b^\perp}{\boldsymbol{P}} \hat{\boldsymbol{\varepsilon}}= \hat{\boldsymbol{\varepsilon}}\) and \(\underset{\boldsymbol{X}_b^\perp}{\boldsymbol{P}} \boldsymbol{X}_b \hat{\boldsymbol{\beta}}_b = \boldsymbol{0}\).

Now appying \((\underset{\boldsymbol{X}_b^\perp}{\boldsymbol{P}} \boldsymbol{X}_a)^\intercal\) to both sides of the preceding expression gives

\[ \begin{aligned} \boldsymbol{X}_a^\intercal\hat{\boldsymbol{\varepsilon}}={}& \boldsymbol{0}= \boldsymbol{X}_a^\intercal\underset{\boldsymbol{X}_b^\perp}{\boldsymbol{P}} \boldsymbol{Y}- \boldsymbol{X}_a^\intercal\underset{\boldsymbol{X}_b^\perp}{\boldsymbol{P}} \boldsymbol{X}_a \hat{\boldsymbol{\beta}}_a \quad \Rightarrow \\ \left(\underset{\boldsymbol{X}_b^\perp}{\boldsymbol{P}} \boldsymbol{X}_a \right)^\intercal \underset{\boldsymbol{X}_b^\perp}{\boldsymbol{P}} \boldsymbol{X}_a \hat{\boldsymbol{\beta}}_a ={}& \left(\underset{\boldsymbol{X}_b^\perp}{\boldsymbol{P}} \boldsymbol{X}_a \right)^\intercal\underset{\boldsymbol{X}_b^\perp}{\boldsymbol{P}} \boldsymbol{Y} \end{aligned} \]

If we assume that \(\boldsymbol{X}\) is full-rank, then \(\underset{\boldsymbol{X}_b^\perp}{\boldsymbol{P}} \boldsymbol{X}_a\) must be full-rank as well, since otherwise one of the columns of \(\boldsymbol{X}_a\) would be a linear combination of columns of \(\boldsymbol{X}_b\). Therefore we can invert to get

\[ \hat{\boldsymbol{\beta}}_a = \left((\underset{\boldsymbol{X}_b^\perp}{\boldsymbol{P}} \boldsymbol{X}_a)^\intercal\underset{\boldsymbol{X}_b^\perp}{\boldsymbol{P}} \boldsymbol{X}_a \right)^{-1} \left(\underset{\boldsymbol{X}_b^\perp}{\boldsymbol{P}} \boldsymbol{X}_a \right)^\intercal\underset{\boldsymbol{X}_b^\perp}{\boldsymbol{P}} \boldsymbol{Y}. \]

This is exactly the same as the linear regression

\[ \tilde{\boldsymbol{Y}} \sim \tilde{\boldsymbol{X}_a} \boldsymbol{\beta}_a \quad\textrm{where } \tilde{\boldsymbol{X}_a} := \underset{\boldsymbol{X}_b^\perp}{\boldsymbol{P}} \boldsymbol{X}_a \textrm{ and } \tilde{\boldsymbol{Y}} := \underset{\boldsymbol{X}_b^\perp}{\boldsymbol{P}} \boldsymbol{Y}. \]

That is, the OLS coefficient on \(\boldsymbol{X}_a\) is the same as projecting all the responses and regressors to a space orthogonal to \(\boldsymbol{X}_b\), and running ordinary regression.

See section 7.3 of Prof. Ding’s book for a more rigorous proof, which uses the Schur representation of sub-matrices of \((\boldsymbol{X}^\intercal\boldsymbol{X})^{-1}\).

The special case of a constant regressor

Suppose we want to regress \(\boldsymbol{Y}\sim \beta_0 + \boldsymbol{\beta}^\intercal\boldsymbol{x}_n\). We’d like to know what \(\hat{\boldsymbol{\beta}}\) is, and in particular, what is the effect of including a constant.

We can answer this with the FWL theorem by taking \(\boldsymbol{b}_n = (1)\) and \(\boldsymbol{a}_n = \boldsymbol{x}_n\). Then \(\hat{\boldsymbol{\beta}}\) will be the same as in the regression

\[ \tilde{Y} \sim \tilde{X} \boldsymbol{\beta} \]

where \(\tilde{Y} = \underset{\boldsymbol{X}_b^\perp}{\boldsymbol{P}} \boldsymbol{Y}\) and \(\tilde{X} = \underset{\boldsymbol{X}_b^\perp}{\boldsymbol{P}} \boldsymbol{X}\).

A particular special case is useful for intuition. Take \(\boldsymbol{x}_b\) to simply be the constant regressor, \(1\). Then \(\boldsymbol{X}_b = \boldsymbol{1}\), and

\[ \underset{\boldsymbol{X}_b^\perp}{\boldsymbol{P}} = \boldsymbol{I}{} - \boldsymbol{1}(\boldsymbol{1}^\intercal\boldsymbol{1})^{-1} \boldsymbol{1}^\intercal= \boldsymbol{I}{} - \frac{1}{N} \boldsymbol{1}\boldsymbol{1}^\intercal. \]

\(\boldsymbol{1}^\intercal\boldsymbol{1}= \sum_{n=1}^N1 \cdot 1 = N\)

If we take

\[ \begin{aligned} \boldsymbol{1}^\intercal\boldsymbol{Y}=& \sum_{n=1}^N1 \cdot y_n = \sum_{n=1}^Ny_n\\ \frac{1}{N} \boldsymbol{1}^\intercal\boldsymbol{Y}=& \frac{1}{N} \sum_{n=1}^N1 \cdot y_n = \frac{1}{N} \sum_{n=1}^Ny_n = \bar{y}\\ \boldsymbol{1}\frac{1}{N} \boldsymbol{1}^\intercal\boldsymbol{Y}=& \boldsymbol{1}\bar{y}= \begin{pmatrix} \bar{y}\\ \vdots \\ \bar{y} \end{pmatrix} \\ \underset{\boldsymbol{X}_b^\perp}{\boldsymbol{P}} \boldsymbol{Y}= \left(\boldsymbol{I}- \boldsymbol{1}\frac{1}{N} \boldsymbol{1}^\intercal\right) \boldsymbol{Y}=& \boldsymbol{Y}- \begin{pmatrix} \bar{y}\\ \vdots \\ \bar{y} \end{pmatrix} = \begin{pmatrix} y_1 - \bar{y}\\ y_2 - \bar{y}\\ \vdots \\ y_N - \bar{y} \end{pmatrix} \end{aligned} \]

The projection matrix \(\underset{\boldsymbol{X}_b^\perp}{\boldsymbol{P}}\) thus simply centers a vector at its sample mean.

Similarly,

\[ \tilde{\boldsymbol{X}_a} := \underset{\boldsymbol{X}_b^\perp}{\boldsymbol{P}} \boldsymbol{X}_a = \boldsymbol{X}_a - \frac{1}{N} \boldsymbol{1}\boldsymbol{1}^\intercal\boldsymbol{X}_a = \boldsymbol{X}_a - \boldsymbol{1}\bar{x}^\intercal\\ \textrm{ where } \bar{x}^\intercal:= \begin{pmatrix} \frac{1}{N} \sum_{n=1}^Nx_{n1} & \ldots & \frac{1}{N} \sum_{n=1}^Nx_{n(P-1)} \end{pmatrix}, \]

so that the \(n\)–th row of \(\underset{\boldsymbol{X}_b^\perp}{\boldsymbol{P}} \boldsymbol{X}_a\) is simply \(\boldsymbol{x}_n^\intercal- \bar{x}^\intercal\), and each regressor is centered. So

\[ \hat{\boldsymbol{\beta}}_a = (\tilde{\boldsymbol{X}_a}^\intercal\tilde{\boldsymbol{X}_a})^\intercal\tilde{\boldsymbol{X}_a}^\intercal\tilde{\boldsymbol{Y}}, \]

the OLS estimator where both the regressors and responses have been centered at their sample means. In this case, by the LLN,

\[ \frac{1}{N} \tilde{\boldsymbol{X}_a}^\intercal\tilde{\boldsymbol{X}_a} \rightarrow \mathrm{Cov}\left(\boldsymbol{x}_n\right), \]

in contrast to the general case, where

\[ \frac{1}{N} \boldsymbol{X}^\intercal\boldsymbol{X}\rightarrow \mathbb{E}\left[\boldsymbol{x}_n \boldsymbol{x}_n^\intercal\right] = \mathrm{Cov}\left(\boldsymbol{x}_n\right) + \mathbb{E}\left[\boldsymbol{x}_n\right]\mathbb{E}\left[\boldsymbol{x}_n^\intercal\right]. \]

For this reason, when thinking about the sampling behavior of OLS coefficients where a constant is included in the regression, it’s enough to think about the covariance of the regressors, rather than the outer product.

Exercise

Derive our simple least squares estimator of \(\hat{\boldsymbol{\beta}}_1\) using the FWL theorem.