$$ \newcommand{\mybold}[1]{\boldsymbol{#1}} \newcommand{\trans}{\intercal} \newcommand{\norm}[1]{\left\Vert#1\right\Vert} \newcommand{\abs}[1]{\left|#1\right|} \newcommand{\bbr}{\mathbb{R}} \newcommand{\bbz}{\mathbb{Z}} \newcommand{\bbc}{\mathbb{C}} \newcommand{\gauss}[1]{\mathcal{N}\left(#1\right)} \newcommand{\chisq}[1]{\mathcal{\chi}^2_{#1}} \newcommand{\studentt}[1]{\mathrm{StudentT}_{#1}} \newcommand{\fdist}[2]{\mathrm{FDist}_{#1,#2}} \newcommand{\argmin}[1]{\underset{#1}{\mathrm{argmin}}\,} \newcommand{\projop}[1]{\underset{#1}{\mathrm{Proj}}\,} \newcommand{\proj}[1]{\underset{#1}{\mybold{P}}} \newcommand{\expect}[1]{\mathbb{E}\left[#1\right]} \newcommand{\prob}[1]{\mathbb{P}\left(#1\right)} \newcommand{\dens}[1]{\mathit{p}\left(#1\right)} \newcommand{\var}[1]{\mathrm{Var}\left(#1\right)} \newcommand{\cov}[1]{\mathrm{Cov}\left(#1\right)} \newcommand{\sumn}{\sum_{n=1}^N} \newcommand{\meann}{\frac{1}{N} \sumn} \newcommand{\cltn}{\frac{1}{\sqrt{N}} \sumn} \newcommand{\trace}[1]{\mathrm{trace}\left(#1\right)} \newcommand{\diag}[1]{\mathrm{Diag}\left(#1\right)} \newcommand{\grad}[2]{\nabla_{#1} \left. #2 \right.} \newcommand{\gradat}[3]{\nabla_{#1} \left. #2 \right|_{#3}} \newcommand{\fracat}[3]{\left. \frac{#1}{#2} \right|_{#3}} \newcommand{\W}{\mybold{W}} \newcommand{\w}{w} \newcommand{\wbar}{\bar{w}} \newcommand{\wv}{\mybold{w}} \newcommand{\X}{\mybold{X}} \newcommand{\x}{x} \newcommand{\xbar}{\bar{x}} \newcommand{\xv}{\mybold{x}} \newcommand{\Xcov}{\Sigmam_{\X}} \newcommand{\Xcovhat}{\hat{\Sigmam}_{\X}} \newcommand{\Covsand}{\Sigmam_{\mathrm{sand}}} \newcommand{\Covsandhat}{\hat{\Sigmam}_{\mathrm{sand}}} \newcommand{\Z}{\mybold{Z}} \newcommand{\z}{z} \newcommand{\zv}{\mybold{z}} \newcommand{\zbar}{\bar{z}} \newcommand{\Y}{\mybold{Y}} \newcommand{\Yhat}{\hat{\Y}} \newcommand{\y}{y} \newcommand{\yv}{\mybold{y}} \newcommand{\yhat}{\hat{\y}} \newcommand{\ybar}{\bar{y}} \newcommand{\res}{\varepsilon} \newcommand{\resv}{\mybold{\res}} \newcommand{\resvhat}{\hat{\mybold{\res}}} \newcommand{\reshat}{\hat{\res}} \newcommand{\betav}{\mybold{\beta}} \newcommand{\betavhat}{\hat{\betav}} \newcommand{\betahat}{\hat{\beta}} \newcommand{\betastar}{{\beta^{*}}} \newcommand{\bv}{\mybold{\b}} \newcommand{\bvhat}{\hat{\bv}} \newcommand{\alphav}{\mybold{\alpha}} \newcommand{\alphavhat}{\hat{\av}} \newcommand{\alphahat}{\hat{\alpha}} \newcommand{\omegav}{\mybold{\omega}} \newcommand{\gv}{\mybold{\gamma}} \newcommand{\gvhat}{\hat{\gv}} \newcommand{\ghat}{\hat{\gamma}} \newcommand{\hv}{\mybold{\h}} \newcommand{\hvhat}{\hat{\hv}} \newcommand{\hhat}{\hat{\h}} \newcommand{\gammav}{\mybold{\gamma}} \newcommand{\gammavhat}{\hat{\gammav}} \newcommand{\gammahat}{\hat{\gamma}} \newcommand{\new}{\mathrm{new}} \newcommand{\zerov}{\mybold{0}} \newcommand{\onev}{\mybold{1}} \newcommand{\id}{\mybold{I}} \newcommand{\sigmahat}{\hat{\sigma}} \newcommand{\etav}{\mybold{\eta}} \newcommand{\muv}{\mybold{\mu}} \newcommand{\Sigmam}{\mybold{\Sigma}} \newcommand{\rdom}[1]{\mathbb{R}^{#1}} \newcommand{\RV}[1]{\tilde{#1}} \def\A{\mybold{A}} \def\A{\mybold{A}} \def\av{\mybold{a}} \def\a{a} \def\B{\mybold{B}} \def\S{\mybold{S}} \def\sv{\mybold{s}} \def\s{s} \def\R{\mybold{R}} \def\rv{\mybold{r}} \def\r{r} \def\V{\mybold{V}} \def\vv{\mybold{v}} \def\v{v} \def\U{\mybold{U}} \def\uv{\mybold{u}} \def\u{u} \def\W{\mybold{W}} \def\wv{\mybold{w}} \def\w{w} \def\tv{\mybold{t}} \def\t{t} \def\Sc{\mathcal{S}} \def\ev{\mybold{e}} \def\Lammat{\mybold{\Lambda}} $$

Interpreting the coefficients: The FWL theorem

\(\,\)

Goals

  • Interpret the OLS errors and estimates for multiple linear regression
    • The The Frisch–Waugh–Lovell Theorem (FWL) theorem (section 7 of Prof. Ding’s lecture notes)
    • The role of regressor covariance in the OLS standard errors

Correlated regressors

Take \(\X = (\xv_1, \ldots, \xv_P)\), where \(\xv_1 = \onev\), so that we are regressing on \(P-1\) regressors and a constant. If the regressors are all orthogonal to one another (\(\xv_k^\trans \xv_j = 0\) for \(k \ne j\)), then we know that

\[ \betahat = (\X^\trans\X)^{-1} \X^\trans \Y = \begin{pmatrix} \xv_1^\trans \xv_1 & 0 & \ldots 0 \\ 0 & \ddots & 0 \\ 0 & \ldots & \xv_P^\trans \xv_P \\ \end{pmatrix}^{-1} \begin{pmatrix} \xv_1^\trans \Y \\ \vdots \\ \xv_P^\trans \Y \\ \end{pmatrix} = \begin{pmatrix} \frac{\xv_1^\trans \Y}{\xv_1^\trans \xv_1} \\ \vdots \\ \frac{\xv_P^\trans \Y}{\xv_P^\trans \xv_P} \\ \end{pmatrix}. \]

Exercise

What is the limiting behavior of \(\frac{1}{N}\X^\trans \X\) when the \(\xv_n\) are independent of one another? What if they are independent and \(\expect{\xv_n} = 0\), except for a constant \(\xv_{n1} = 1\)?

However, typically the regressors are not orthogonal to one another. When they are not, we can ask

  • How can we interpret the coefficients?
  • How does the relation between the regressors affect the \(\betavhat\) covariance matrix?

For simplicity, the remainder of this lecture will assume homeskedastic errors and random regressors.

The FWL theorem

The FWL theorem gives an expression for sub-vectors of \(\betavhat\). Specifically, let’s partition our regressors into two sets:

\(\y_n \sim \xv_n^\trans \beta = \av_{n}^\trans \betav_a + \bv_{n}^\trans \betav_b\),

where \(\betav^\trans = (\betav_a^\trans, \betav_b^\trans)\) and \(\xv_n^\trans = (\av_n^\trans, \bv_n^\trans)\). We can similarly partition our regressors matrix into two parts \[ \X = (\X_a \, \X_b). \]

A particular example to keep in mind is where

\[ \begin{aligned} \xv_n^\trans =& (\x_{n2}, \ldots, \x_{n(P-1)}, 1)^\trans \\ \bv_n =& (1) \\ \av_n^\trans =& (\x_{n2}, \ldots, \x_{n(P-1)})^\trans \\ \end{aligned} \]

Let us ask what is the effect on \(\betav_a\) of including \(\bv_n\) as a regressor?
The answer is given by the FWL theorem. Recall that

\[ \resvhat = \Y - \X\betavhat = \Y - \X_a \betavhat_a - \X_b \betavhat_b, \]

and that \(\X^\trans \resvhat = \zerov\), so \(\X_a^\trans \resvhat = \zerov\) and \(\X_b^\trans \resvhat = \zerov\). Recall also the definition of the projection matrix perpendicular to the span of \(\X_b\):

\[ \proj{\X_b^\perp} := \id{} - \X_b (\X_b^\trans \X_b)^{-1} \X_b^\trans. \]

Applying \(\proj{\X_b^\perp}\) to both sides of \(\resvhat = \Y - \X\betavhat\) gives

\[ \proj{\X_b^\perp} \resvhat = \resvhat = \proj{\X_b^\perp} \Y - \proj{\X_b^\perp} \X_a \betavhat_a - \proj{\X_b^\perp} \X_b \betavhat_b = \proj{\X_b^\perp} \Y - \proj{\X_b^\perp} \X_a \betavhat_a. \]

Exercise

Verify that \(\proj{\X_b^\perp} \resvhat = \resvhat\) and \(\proj{\X_b^\perp} \X_b \betavhat_b = \zerov\).

Now appying \((\proj{\X_b^\perp} \X_a)^\trans\) to both sides of the preceding expression gives

\[ \begin{aligned} \X_a^\trans \resvhat ={}& \zerov = \X_a^\trans \proj{\X_b^\perp} \Y - \X_a^\trans \proj{\X_b^\perp} \X_a \betavhat_a \quad \Rightarrow \\ (\proj{\X_b^\perp} \X_a)^\trans \proj{\X_b^\perp} \X_a \betavhat_a ={}& (\proj{\X_b^\perp} \X_a)^\trans \proj{\X_b^\perp} \Y \end{aligned} \]

If we assume that \(\X\) is full-rank, then \(\proj{\X_b^\perp} \X_a\) must be full-rank as well, since otherwise one of the columns of \(\X_a\) would be a linear combination of columns of \(\X_b\). Therefore we can invert to get

\[ \betavhat_a = \left((\proj{\X_b^\perp} \X_a)^\trans \proj{\X_b^\perp} \X_a \right)^{-1} (\proj{\X_b^\perp} \X_a)^\trans \proj{\X_b^\perp} \Y. \]

This is exactly the same as the linear regression

\[ \tilde{\Y} \sim \tilde{\X_a} \betav_a \quad\textrm{where } \tilde{\X_a} := \proj{\X_b^\perp} \X_a \textrm{ and } \tilde{\Y} := \proj{\X_b^\perp} \Y. \]

That is, the OLS coefficient on \(\X_a\) is the same as projecting all the responses and regressors to a space orthogonal to \(\X_b\), and running ordinary regression.

This means the value of \(\betavhat_a\)

The special case of a constant regressor

Suppose we want to regress \(\Y \sim \beta_0 + \betav^\trans \xv_n\). We’d like to know what \(\betavhat\) is, and in particular, what is the effect of including a constant.

We can answer this with the FWL theorem by taking \(\bv_n = (1)\) and \(\av_n = \xv_n\). Then \(\betavhat\) will be the same as in the regression

\[ \tilde{Y} \sim \tilde{X} \betav \]

where \(\tilde{Y} = \proj{\X_b^\perp} \Y\) and \(\tilde{X} = \proj{\X_b^\perp} \X\).

A particular special case is useful for intuition. Take \(\xv_b\) to simply be the constant regressor, \(1\). Then \(\X_b = \onev\), and

\[ \proj{\X_b^\perp} = \id{} - \onev (\onev^\trans \onev)^{-1} \onev^\trans = \id{} - \frac{1}{N} \onev \onev^\trans. \]

\(\onev^\trans \onev = \sumn 1 \cdot 1 = N\)

If we take

\[ \begin{aligned} \onev^\trans \Y =& \sumn 1 \cdot \y_n = \sumn \y_n\\ \frac{1}{N} \onev^\trans \Y =& \meann 1 \cdot \y_n = \meann \y_n = \ybar\\ \onev \frac{1}{N} \onev^\trans \Y =& \onev \ybar = \begin{pmatrix} \ybar \\ \vdots \\ \ybar \end{pmatrix} \\ \proj{\X_b^\perp} \Y = \left(\id - \onev \frac{1}{N} \onev^\trans \right) \Y =& \Y - \begin{pmatrix} \ybar \\ \vdots \\ \ybar \end{pmatrix} = \begin{pmatrix} y_1 - \ybar \\ y_2 - \ybar \\ \vdots \\ y_N - \ybar \end{pmatrix} \end{aligned} \]

The projection matrix \(\proj{\X_b^\perp}\) thus simply centers a vector at its sample mean.

Similarly,

\[ \tilde{\X_a} := \proj{\X_b^\perp} \X_a = \X_a - \frac{1}{N} \onev \onev^\trans \X_a = \X_a - \onev \xbar^\trans \\ \textrm{ where } \xbar^\trans := \begin{pmatrix} \meann \x_{n1} & \ldots & \meann \x_{n(P-1)} \end{pmatrix}, \]

so that the \(n\)–th row of \(\proj{\X_b^\perp} \X_a\) is simply \(\xv_n^\trans - \xbar^\trans\), and each regressor is centered. So

\[ \betavhat_a = (\tilde{\X_a}^\trans \tilde{\X_a})^\trans \tilde{\X_a}^\trans \tilde{\Y}, \]

the OLS estimator where both the regressors and responses have been centered at their sample means. In this case, by the LLN,

\[ \frac{1}{N} \tilde{\X_a}^\trans \tilde{\X_a} \rightarrow \cov{\xv_n}, \]

in contrast to the general case, where

\[ \frac{1}{N} \X^\trans \X \rightarrow \expect{\xv_n \xv_n^\trans} = \cov{\xv_n} + \expect{\xv_n}\expect{\xv_n^\trans}. \]

For this reason, when thinking about the sampling behavior of OLS coefficients where a constant is included in the regression, it’s enough to think about the covariance of the regressors, rather than the outer product.

Exercise

Derive our simple least squares estimator of \(\betavhat\) using the FWL theorem.

Covariances with the FWL theorem

Though it may not be obvious, estimating standard errors using the residuals from \(\tilde{\Y} \sim \tilde{\X_a}\betav_a\) versus \(\Y \sim \X_a \betav_a + \X_b \betav_b\) is equivalent, whether the heteroskedastic or sandwich covariance matrices are used. See section 7.3 of Prof. Ding’s book for a proof, which uses the Schur representation of sub-matrices of \((\X^\trans \X)^{-1}\).

The role of regressor covariance

For simplicity, let’s consider a simple regression where \(\y_n \sim \beta_0 + \beta_1 \x_n\), with \(\var{\x_n} = \sigma_x^2\) and \(\var{\res_n} = \sigma_\res^2\). By the FWL theorem, we can estimate

\[ \betahat_1 = \frac{\meann (\y_n - \ybar) (\x_n - \xbar)}{\meann (\x_n - \xbar)^2} \quad\textrm{with}\quad N \cov{\betahat_1} \approx \frac{\meann (\y_n - \ybar - \betahat_1 (\x_n - \xbar))^2}{\meann (\x_n - \xbar)^2} \rightarrow \frac{\sigma_\res^2}{\sigma_x^2}. \]

The variance of the (rescaled) regression coefficient thus depends on the ratio of the residual noise to the regressor variance, both after projection.