$$ \newcommand{\mybold}[1]{\boldsymbol{#1}} \newcommand{\trans}{\intercal} \newcommand{\norm}[1]{\left\Vert#1\right\Vert} \newcommand{\abs}[1]{\left|#1\right|} \newcommand{\bbr}{\mathbb{R}} \newcommand{\bbz}{\mathbb{Z}} \newcommand{\bbc}{\mathbb{C}} \newcommand{\gauss}[1]{\mathcal{N}\left(#1\right)} \newcommand{\chisq}[1]{\mathcal{\chi}^2_{#1}} \newcommand{\studentt}[1]{\mathrm{StudentT}_{#1}} \newcommand{\fdist}[2]{\mathrm{FDist}_{#1,#2}} \newcommand{\iid}{\overset{\mathrm{IID}}{\sim}} \newcommand{\argmin}[1]{\underset{#1}{\mathrm{argmin}}\,} \newcommand{\projop}[1]{\underset{#1}{\mathrm{Proj}}\,} \newcommand{\proj}[1]{\underset{#1}{\mybold{P}}} \newcommand{\expect}[1]{\mathbb{E}\left[#1\right]} \newcommand{\prob}[1]{\mathbb{P}\left(#1\right)} \newcommand{\dens}[1]{\mathit{p}\left(#1\right)} \newcommand{\var}[1]{\mathrm{Var}\left(#1\right)} \newcommand{\cov}[1]{\mathrm{Cov}\left(#1\right)} \newcommand{\sumn}{\sum_{n=1}^N} \newcommand{\meann}{\frac{1}{N} \sumn} \newcommand{\cltn}{\frac{1}{\sqrt{N}} \sumn} \newcommand{\trace}[1]{\mathrm{trace}\left(#1\right)} \newcommand{\diag}[1]{\mathrm{Diag}\left(#1\right)} \newcommand{\grad}[2]{\nabla_{#1} \left. #2 \right.} \newcommand{\gradat}[3]{\nabla_{#1} \left. #2 \right|_{#3}} \newcommand{\fracat}[3]{\left. \frac{#1}{#2} \right|_{#3}} \newcommand{\W}{\mybold{W}} \newcommand{\w}{w} \newcommand{\wbar}{\bar{w}} \newcommand{\wv}{\mybold{w}} \newcommand{\X}{\mybold{X}} \newcommand{\x}{x} \newcommand{\xbar}{\bar{x}} \newcommand{\xv}{\mybold{x}} \newcommand{\Xcov}{\mybold{M}_{\X}} \newcommand{\Xcovhat}{\hat{\mybold{M}}_{\X}} \newcommand{\Covsand}{\Sigmam_{\mathrm{sand}}} \newcommand{\Covsandhat}{\hat{\Sigmam}_{\mathrm{sand}}} \newcommand{\Z}{\mybold{Z}} \newcommand{\z}{z} \newcommand{\zv}{\mybold{z}} \newcommand{\zbar}{\bar{z}} \newcommand{\Y}{\mybold{Y}} \newcommand{\Yhat}{\hat{\Y}} \newcommand{\y}{y} \newcommand{\yv}{\mybold{y}} \newcommand{\yhat}{\hat{\y}} \newcommand{\ybar}{\bar{y}} \newcommand{\res}{\varepsilon} \newcommand{\resv}{\mybold{\res}} \newcommand{\resvhat}{\hat{\mybold{\res}}} \newcommand{\reshat}{\hat{\res}} \newcommand{\betav}{\mybold{\beta}} \newcommand{\betavhat}{\hat{\betav}} \newcommand{\betahat}{\hat{\beta}} \newcommand{\betastar}{{\beta^{*}}} \newcommand{\betavstar}{{\betav^{*}}} \newcommand{\loss}{\mathscr{L}} \newcommand{\losshat}{\hat{\loss}} \newcommand{\f}{f} \newcommand{\fhat}{\hat{f}} \newcommand{\bv}{\mybold{\b}} \newcommand{\bvhat}{\hat{\bv}} \newcommand{\alphav}{\mybold{\alpha}} \newcommand{\alphavhat}{\hat{\av}} \newcommand{\alphahat}{\hat{\alpha}} \newcommand{\omegav}{\mybold{\omega}} \newcommand{\gv}{\mybold{\gamma}} \newcommand{\gvhat}{\hat{\gv}} \newcommand{\ghat}{\hat{\gamma}} \newcommand{\hv}{\mybold{\h}} \newcommand{\hvhat}{\hat{\hv}} \newcommand{\hhat}{\hat{\h}} \newcommand{\gammav}{\mybold{\gamma}} \newcommand{\gammavhat}{\hat{\gammav}} \newcommand{\gammahat}{\hat{\gamma}} \newcommand{\new}{\mathrm{new}} \newcommand{\zerov}{\mybold{0}} \newcommand{\onev}{\mybold{1}} \newcommand{\id}{\mybold{I}} \newcommand{\sigmahat}{\hat{\sigma}} \newcommand{\etav}{\mybold{\eta}} \newcommand{\muv}{\mybold{\mu}} \newcommand{\Sigmam}{\mybold{\Sigma}} \newcommand{\rdom}[1]{\mathbb{R}^{#1}} \newcommand{\RV}[1]{{#1}} \def\A{\mybold{A}} \def\A{\mybold{A}} \def\av{\mybold{a}} \def\a{a} \def\B{\mybold{B}} \def\b{b} \def\S{\mybold{S}} \def\sv{\mybold{s}} \def\s{s} \def\R{\mybold{R}} \def\rv{\mybold{r}} \def\r{r} \def\V{\mybold{V}} \def\vv{\mybold{v}} \def\v{v} \def\vhat{\hat{v}} \def\U{\mybold{U}} \def\uv{\mybold{u}} \def\u{u} \def\W{\mybold{W}} \def\wv{\mybold{w}} \def\w{w} \def\tv{\mybold{t}} \def\t{t} \def\Sc{\mathcal{S}} \def\ev{\mybold{e}} \def\Lammat{\mybold{\Lambda}} \def\Q{\mybold{Q}} \def\eps{\varepsilon} $$

The FWL theorem and coefficient interpretation.

Goals

  • Discuss the FWL theorem and some uses and consequences
    • Inclusion of a constant
    • Linear regression as the marginal association
    • The FWL theorem for visualization

Simple regression

Recall our formulas for simple linear regression. If \(\y_n \sim \beta_1 + \beta_2 \x_n\), then

\[ \begin{align*} \betahat_1 = \overline{y} - \betahat_2 \overline{x} \quad\textrm{and}\quad \betahat_2 ={} \frac{\sumn \left( \y_n - \ybar \right) \left(\x_n - \xbar \right)} {\sumn \left( \x_n - \xbar \right)^2}. \end{align*} \]

Interestingly, note that if we define \(\y'_n = \y_n - \ybar\) and \(\x'_n = \x_n - \xbar\), the “de–meaned” or “centered” versions of the response and regressor, then we could also have computed the regression \(\y'_n \sim \gamma_2 \x'_n\) and gotten the same answer:

\[ \gammahat_2 = \frac{\sumn \y'_n \x'_n}{\sumn \x'_n \x'_n} = \frac{\sumn \left( \y_n - \ybar \right) \left(\x_n - \xbar \right)} {\sumn \left( \x_n - \xbar \right)^2} = \betahat_2. \]

Is this a coincidence? We’ll see today that it’s not.

Correlated regressors

Take \(\X = (\xv_1, \ldots, \xv_P)\), where \(\xv_1 = \onev\), so that we are regressing on \(P-1\) regressors and a constant. If the regressors are all orthogonal to one another (\(\xv_k^\trans \xv_j = 0\) for \(k \ne j\)), then we know that

\[ \betahat = (\X^\trans\X)^{-1} \X^\trans \Y = \begin{pmatrix} \xv_1^\trans \xv_1 & 0 & \ldots 0 \\ 0 & \ddots & 0 \\ 0 & \ldots & \xv_P^\trans \xv_P \\ \end{pmatrix}^{-1} \begin{pmatrix} \xv_1^\trans \Y \\ \vdots \\ \xv_P^\trans \Y \\ \end{pmatrix} = \begin{pmatrix} \frac{\xv_1^\trans \Y}{\xv_1^\trans \xv_1} \\ \vdots \\ \frac{\xv_P^\trans \Y}{\xv_P^\trans \xv_P} \\ \end{pmatrix}. \]

Exercise

What is the limiting behavior of \(\frac{1}{N}\X^\trans \X\) when the \(\xv_n\) are independent of one another? What if they are independent and \(\expect{\xv_n} = 0\), except for a constant \(\xv_{n1} = 1\)?

However, typically the regressors are not orthogonal to one another. When they are not, we can ask

  • How can we interpret the coefficients?
  • How does the relation between the regressors affect the \(\betavhat\) covariance matrix?

The FWL theorem

The FWL theorem gives an expression for sub-vectors of \(\betavhat\). Specifically, let’s partition our regressors into two sets:

\(\y_n \sim \xv_n^\trans \beta = \av_{n}^\trans \betav_a + \bv_{n}^\trans \betav_b\),

where \(\betav^\trans = (\betav_a^\trans, \betav_b^\trans)\) and \(\xv_n^\trans = (\av_n^\trans, \bv_n^\trans)\). We can similarly partition our regressors matrix into two parts \[ \X = (\X_a \, \X_b). \]

A particular example to keep in mind is where

\[ \begin{aligned} \xv_n^\trans =& (\x_{n2}, \ldots, \x_{n(P-1)}, 1)^\trans \\ \bv_n =& (1) \\ \av_n^\trans =& (\x_{n2}, \ldots, \x_{n(P-1)})^\trans \\ \end{aligned} \]

Let us ask what is the effect on \(\betav_a\) of including \(\bv_n\) as a regressor?
The answer is given by the FWL theorem. Recall that

\[ \resvhat = \Y - \X\betavhat = \Y - \X_a \betavhat_a - \X_b \betavhat_b, \]

and that \(\X^\trans \resvhat = \zerov\), so \(\X_a^\trans \resvhat = \zerov\) and \(\X_b^\trans \resvhat = \zerov\). Recall also the definition of the projection matrix perpendicular to the span of \(\X_b\):

\[ \proj{\X_b^\perp} := \id{} - \X_b (\X_b^\trans \X_b)^{-1} \X_b^\trans. \]

Applying \(\proj{\X_b^\perp}\) to both sides of \(\resvhat = \Y - \X\betavhat\) gives

\[ \proj{\X_b^\perp} \resvhat = \resvhat = \proj{\X_b^\perp} \Y - \proj{\X_b^\perp} \X_a \betavhat_a - \proj{\X_b^\perp} \X_b \betavhat_b = \proj{\X_b^\perp} \Y - \proj{\X_b^\perp} \X_a \betavhat_a. \]

Exercise

Verify that \(\proj{\X_b^\perp} \resvhat = \resvhat\) and \(\proj{\X_b^\perp} \X_b \betavhat_b = \zerov\).

Now appying \((\proj{\X_b^\perp} \X_a)^\trans\) to both sides of the preceding expression gives

\[ \begin{aligned} \X_a^\trans \resvhat ={}& \zerov = \X_a^\trans \proj{\X_b^\perp} \Y - \X_a^\trans \proj{\X_b^\perp} \X_a \betavhat_a \quad \Rightarrow \\ \left(\proj{\X_b^\perp} \X_a \right)^\trans \proj{\X_b^\perp} \X_a \betavhat_a ={}& \left(\proj{\X_b^\perp} \X_a \right)^\trans \proj{\X_b^\perp} \Y \end{aligned} \]

If we assume that \(\X\) is full-rank, then \(\proj{\X_b^\perp} \X_a\) must be full-rank as well, since otherwise one of the columns of \(\X_a\) would be a linear combination of columns of \(\X_b\). Therefore we can invert to get

\[ \betavhat_a = \left((\proj{\X_b^\perp} \X_a)^\trans \proj{\X_b^\perp} \X_a \right)^{-1} \left(\proj{\X_b^\perp} \X_a \right)^\trans \proj{\X_b^\perp} \Y. \]

This is exactly the same as the linear regression

\[ \tilde{\Y} \sim \tilde{\X_a} \betav_a \quad\textrm{where } \tilde{\X_a} := \proj{\X_b^\perp} \X_a \textrm{ and } \tilde{\Y} := \proj{\X_b^\perp} \Y. \]

That is, the OLS coefficient on \(\X_a\) is the same as projecting all the responses and regressors to a space orthogonal to \(\X_b\), and running ordinary regression.

See section 7.3 of Prof. Ding’s book for a more rigorous proof, which uses the Schur representation of sub-matrices of \((\X^\trans \X)^{-1}\).

The special case of a constant regressor

Suppose we want to regress \(\Y \sim \beta_0 + \betav^\trans \xv_n\). We’d like to know what \(\betavhat\) is, and in particular, what is the effect of including a constant.

We can answer this with the FWL theorem by taking \(\bv_n = (1)\) and \(\av_n = \xv_n\). Then \(\betavhat\) will be the same as in the regression

\[ \tilde{Y} \sim \tilde{X} \betav \]

where \(\tilde{Y} = \proj{\X_b^\perp} \Y\) and \(\tilde{X} = \proj{\X_b^\perp} \X\).

A particular special case is useful for intuition. Take \(\xv_b\) to simply be the constant regressor, \(1\). Then \(\X_b = \onev\), and

\[ \proj{\X_b^\perp} = \id{} - \onev (\onev^\trans \onev)^{-1} \onev^\trans = \id{} - \frac{1}{N} \onev \onev^\trans. \]

\(\onev^\trans \onev = \sumn 1 \cdot 1 = N\)

If we take

\[ \begin{aligned} \onev^\trans \Y =& \sumn 1 \cdot \y_n = \sumn \y_n\\ \frac{1}{N} \onev^\trans \Y =& \meann 1 \cdot \y_n = \meann \y_n = \ybar\\ \onev \frac{1}{N} \onev^\trans \Y =& \onev \ybar = \begin{pmatrix} \ybar \\ \vdots \\ \ybar \end{pmatrix} \\ \proj{\X_b^\perp} \Y = \left(\id - \onev \frac{1}{N} \onev^\trans \right) \Y =& \Y - \begin{pmatrix} \ybar \\ \vdots \\ \ybar \end{pmatrix} = \begin{pmatrix} y_1 - \ybar \\ y_2 - \ybar \\ \vdots \\ y_N - \ybar \end{pmatrix} \end{aligned} \]

The projection matrix \(\proj{\X_b^\perp}\) thus simply centers a vector at its sample mean.

Similarly,

\[ \tilde{\X_a} := \proj{\X_b^\perp} \X_a = \X_a - \frac{1}{N} \onev \onev^\trans \X_a = \X_a - \onev \xbar^\trans \\ \textrm{ where } \xbar^\trans := \begin{pmatrix} \meann \x_{n1} & \ldots & \meann \x_{n(P-1)} \end{pmatrix}, \]

so that the \(n\)–th row of \(\proj{\X_b^\perp} \X_a\) is simply \(\xv_n^\trans - \xbar^\trans\), and each regressor is centered. So

\[ \betavhat_a = (\tilde{\X_a}^\trans \tilde{\X_a})^\trans \tilde{\X_a}^\trans \tilde{\Y}, \]

the OLS estimator where both the regressors and responses have been centered at their sample means. In this case, by the LLN,

\[ \frac{1}{N} \tilde{\X_a}^\trans \tilde{\X_a} \rightarrow \cov{\xv_n}, \]

in contrast to the general case, where

\[ \frac{1}{N} \X^\trans \X \rightarrow \expect{\xv_n \xv_n^\trans} = \cov{\xv_n} + \expect{\xv_n}\expect{\xv_n^\trans}. \]

For this reason, when thinking about the sampling behavior of OLS coefficients where a constant is included in the regression, it’s enough to think about the covariance of the regressors, rather than the outer product.

Exercise

Derive our simple least squares estimator of \(\betavhat_1\) using the FWL theorem.