Omitted variables
Goals
- Understand the different consequences of omitted variables for inference and prediction problems
- Apply the FWL theorem to understand how omitted variable bias occurs:
- Remove a constant to change second moments to covariances
- Omitted variables that are uncorrelated with regressors don’t have an effect
- Therefore, in theorey all omitted variable bias could in theory be dealt with via nonlinear transformations!
- Omitted variables that are uncorrelated with the response don’t have an effect
Setup
In this lecture, we’re interested in the consequences of running a misspecified regression. That is, we suppose that the true generative process is \[ y_n = \beta_0 + {\beta^{*}}x_n + \gamma^*z_n + \varepsilon_n \quad\textrm{where}\quad\mathbb{E}\left[\varepsilon_n\right] = 0 \textrm{, independently of }x_n, z_n, \] but we estimate \({\beta^{*}}\) via the misspecified regression \(y\sim x\beta\) without \(z\). For simplicity, in this lecture I’ll focus on scalar \(x\) and \(z\), though the ideas readily generalize.
There are various reasons why this might happen. One is that we think that the true \(z_n\) is simply not observed. For example, we might imagine that \(y_n\) is a test score, \(x_n\) is the indicator for a study program whose effectiveness we want to measure, and \(z_n\) is a student’s “true” ability, which would be very difficult or expensive to measure (if it can indeed even be measured at all). A different reason is that we think \(z\) is observed, but we can’t calculate it. An example of this might be the Taylor series expansion \[ y= \sum_{k=0}^\infty \beta_k x^k + \varepsilon, \] which is true as long \(\mathbb{E}\left[y \vert x\right]\) is sufficiently smooth. Of course we cannot directly estimate this infinite-dimensional regression, so instead might just regress \(y~ \beta_0 + \beta_1 x+ \beta_s x^2\), in which case the higher-order terms are missing regressors. We will show below, using the FWL theorem, that these two cases are actually not so different from one another.
The role of the constant
Note that, by the FWL theorem, the role of \(\beta_0\) can be accounted for by simply centering everything. That is, we replace \(x_n\) with \(x_n \leftarrow x_n - \bar{x}\), \(y_n \leftarrow y_n - \overline{y}\), and \(z_n \leftarrow z_n - \overline{z}\). Without further comment, we’ll assume that we’ve done this.
When this is done, note that quantities like \(\frac{1}{N} \boldsymbol{X}^\intercal\boldsymbol{Y}\) become covariances. When \(N\) is large, we thus have that \(x\) being uncorrelated with \(y\) implies that \(\frac{1}{N} \boldsymbol{X}^\intercal\boldsymbol{Y}= 0\), and “uncorrelatedness” becomes equivalent (up to the LLN) with orthogonality.
I’ll note that sometimes we talk about \(\frac{1}{N} \boldsymbol{X}^\intercal\boldsymbol{X}\) as a “regressor covariance” matrix, and when we do so it’s precisely with this application of the FWL theorem in mind.
A running example
Following the above, a running example I’ll use is where
- \(y_n\) is academic performance
- \(x_n\) is a tutoring and
- \(z_n\) is natural ability.
Suppose, in an extreme case, that \({\beta^{*}}= 0\) and \(\gamma^*= 1\). That means that the tutoring is useless has no effect on academic performance, and that only natural ability matters.
Suppose we don’t observe \(z_n\), and just run the regression \(y\sim 1 + x\beta\) (recall \(x\) is a scalar).
\[ \hat{\beta}= \frac{\frac{1}{N} \boldsymbol{X}^\intercal\boldsymbol{Y}}{\frac{1}{N} \boldsymbol{X}^\intercal\boldsymbol{X}} = \frac{\frac{1}{N} \boldsymbol{X}^\intercal(\boldsymbol{X}{\beta^{*}}+ \boldsymbol{Z}\gamma^*+ \boldsymbol{\varepsilon})}{\frac{1}{N} \boldsymbol{X}^\intercal\boldsymbol{X}} \approx {\beta^{*}}+ \frac{\mathrm{Cov}\left(x, z\right)}{\mathrm{Var}\left(x\right)} \gamma^*+ \frac{\mathrm{Cov}\left(x, \varepsilon\right)}{\mathrm{Var}\left(x\right)} = {\beta^{*}}+ \frac{\mathrm{Cov}\left(x, z\right)}{\mathrm{Var}\left(x\right)} \gamma^*. \]
For inference
We see that, if \(x\) is correlated with \(z\) and \(\gamma^*\ne 0\), that we incorrectly estimate \(\hat{\beta}\ne 0 = {\beta^{*}}\). We note that, for this to happen, we require both that
- \(x\) and \(z\) are correlated
- \(z\) and \(y\) are correlated ().
In our example, this means that ability is both correlated with tutoring and with academic outcomes. This double correlation leads to \(x\) being correlated with \(y\) despite having no causal relationship, and is an example of correlation not implying causation.
For prediction
Suppose that, instead of estimating \({\beta^{*}}\), we just wanted to predict which students will do well in school. And suppose that only talented students get tutoring. Then the fact that we mis-estimate \({\beta^{*}}\) isn’t a problem. Our prediction on a new datapoint is
\[ \hat{y}_\mathrm{new}= \hat{\beta}x_\mathrm{new}\approx \frac{\mathrm{Cov}\left(x, z\right)}{\mathrm{Var}\left(x\right)} \gamma^*x_\mathrm{new}. \]
And so the covariance between our prediction and the new datapoint is
\[ \begin{aligned} \mathrm{Cov}\left(y_\mathrm{new}, \hat{y}_\mathrm{new}\right) \approx{}& \mathrm{Cov}\left(y_\mathrm{new}, \hat{y}_\mathrm{new}\right) \\={}& \mathrm{Cov}\left(\gamma^*z_\mathrm{new}+ \varepsilon_\mathrm{new}, \hat{\beta}x_\mathrm{new}\right) \\={}& {\gamma^*}^2 \mathrm{Cov}\left(z_\mathrm{new}, x_\mathrm{new}\right) \frac{\mathrm{Cov}\left(x, z\right)}{\mathrm{Var}\left(x\right)} \\={}& {\gamma^*}^2 \frac{\mathrm{Cov}\left(x, z\right)^2}{\mathrm{Var}\left(x\right)}. \end{aligned} \] Noting that \[ \mathrm{Var}\left(y_\mathrm{new}\right) = {\gamma^*}^2 \mathrm{Var}\left(z\right) + \sigma^2 \quad\textrm{and}\quad \mathrm{Var}\left(\hat{y}_\mathrm{new}\right) \approx {\gamma^*}^2 \frac{\mathrm{Cov}\left(x, z\right)^2}{\mathrm{Var}\left(x\right)}. \] we get \[ \begin{aligned} \mathrm{Corr}\left(y_\mathrm{new}, \hat{y}_\mathrm{new}\right) ={}& \frac{\mathrm{Cov}\left(y_\mathrm{new}, \hat{y}_\mathrm{new}\right)}{\sqrt{\mathrm{Var}\left(y_\mathrm{new}\right)\mathrm{Var}\left(\hat{y}_\mathrm{new}\right)}} \\\approx{}& \frac{\mathrm{Cov}\left(x, z\right)^2}{\mathrm{Var}\left(x\right)}\frac{\sqrt{\mathrm{Var}\left(x\right)}}{\left|\mathrm{Cov}\left(x, z\right)\right|}\frac{1}{\sqrt{\mathrm{Var}\left(z\right) + \frac{\sigma^2}{{\gamma^*}^2}}} \\={}& \frac{\left|\mathrm{Cov}\left(x, z\right)\right|}{\sqrt{\mathrm{Var}\left(x\right)} \sqrt{\mathrm{Var}\left(z\right) + \frac{\sigma^2}{{\gamma^*}^2}}}. \end{aligned} \] Here, we see that the prediction can still be highly correlated with the true value \(y\) — precisely to the extent that \(\gamma^*> \sigma\), and that \(x\) is correlated with \(z\).
In other words, for prediction, it doesn’t matter that correlation is not causation, since correlation is good enough. Of course, we have to assume that our test population is the same as the training population. If it is only the best students who get tutoring in the neighborhood where we gathered data, but not true of the test population (for example, due to discounted after school tutoring for struggling students), then our predictions are no longer useful.
Unobservables only matter through their conditional expectation
Suppose we’re worried about \(z_n\) being unobservable. Note that we can always write \[ z= z- \mathbb{E}\left[z\vert x\right] + \mathbb{E}\left[z\vert x\right] = \eta + \phi(x). \] We don’t necessarily know what \(\phi(x)\) is, and if we can’t observe \(z\) then we can’t estimate it, but we know it exists for essentially any random variable \(z\). Then we have that \[ \mathrm{Cov}\left(z, x\right) = \mathrm{Cov}\left(\eta, x\right) + \mathrm{Cov}\left(\phi(x), x\right) = \mathrm{Cov}\left(\phi(x), x\right). \] Thus it doesn’t matter that we’ve omitted the regressor \(\eta\); all that matters for the estimation of \(\beta\) is the omitted conditional expectation \(\phi(x)\). Of course, this is not helpful, since we don’t necessarily know what \(\phi(x)\) is — and we cannot control for all possible functions \(\phi(x)\) without producing regressors that are colinear with \(x\)! However, this does show that, in a certain sense, the unobservability of \(z\) only matters due to our inability to compute \(\mathbb{E}\left[z\vert x\right]\).