$$ \newcommand{\mybold}[1]{\boldsymbol{#1}} \newcommand{\trans}{\intercal} \newcommand{\norm}[1]{\left\Vert#1\right\Vert} \newcommand{\abs}[1]{\left|#1\right|} \newcommand{\bbr}{\mathbb{R}} \newcommand{\bbz}{\mathbb{Z}} \newcommand{\bbc}{\mathbb{C}} \newcommand{\gauss}[1]{\mathcal{N}\left(#1\right)} \newcommand{\chisq}[1]{\mathcal{\chi}^2_{#1}} \newcommand{\studentt}[1]{\mathrm{StudentT}_{#1}} \newcommand{\fdist}[2]{\mathrm{FDist}_{#1,#2}} \newcommand{\argmin}[1]{\underset{#1}{\mathrm{argmin}}\,} \newcommand{\projop}[1]{\underset{#1}{\mathrm{Proj}}\,} \newcommand{\proj}[1]{\underset{#1}{\mybold{P}}} \newcommand{\expect}[1]{\mathbb{E}\left[#1\right]} \newcommand{\prob}[1]{\mathbb{P}\left(#1\right)} \newcommand{\dens}[1]{\mathit{p}\left(#1\right)} \newcommand{\var}[1]{\mathrm{Var}\left(#1\right)} \newcommand{\cov}[1]{\mathrm{Cov}\left(#1\right)} \newcommand{\sumn}{\sum_{n=1}^N} \newcommand{\meann}{\frac{1}{N} \sumn} \newcommand{\cltn}{\frac{1}{\sqrt{N}} \sumn} \newcommand{\trace}[1]{\mathrm{trace}\left(#1\right)} \newcommand{\diag}[1]{\mathrm{Diag}\left(#1\right)} \newcommand{\grad}[2]{\nabla_{#1} \left. #2 \right.} \newcommand{\gradat}[3]{\nabla_{#1} \left. #2 \right|_{#3}} \newcommand{\fracat}[3]{\left. \frac{#1}{#2} \right|_{#3}} \newcommand{\W}{\mybold{W}} \newcommand{\w}{w} \newcommand{\wbar}{\bar{w}} \newcommand{\wv}{\mybold{w}} \newcommand{\X}{\mybold{X}} \newcommand{\x}{x} \newcommand{\xbar}{\bar{x}} \newcommand{\xv}{\mybold{x}} \newcommand{\Xcov}{\Sigmam_{\X}} \newcommand{\Xcovhat}{\hat{\Sigmam}_{\X}} \newcommand{\Covsand}{\Sigmam_{\mathrm{sand}}} \newcommand{\Covsandhat}{\hat{\Sigmam}_{\mathrm{sand}}} \newcommand{\Z}{\mybold{Z}} \newcommand{\z}{z} \newcommand{\zv}{\mybold{z}} \newcommand{\zbar}{\bar{z}} \newcommand{\Y}{\mybold{Y}} \newcommand{\Yhat}{\hat{\Y}} \newcommand{\y}{y} \newcommand{\yv}{\mybold{y}} \newcommand{\yhat}{\hat{\y}} \newcommand{\ybar}{\bar{y}} \newcommand{\res}{\varepsilon} \newcommand{\resv}{\mybold{\res}} \newcommand{\resvhat}{\hat{\mybold{\res}}} \newcommand{\reshat}{\hat{\res}} \newcommand{\betav}{\mybold{\beta}} \newcommand{\betavhat}{\hat{\betav}} \newcommand{\betahat}{\hat{\beta}} \newcommand{\betastar}{{\beta^{*}}} \newcommand{\bv}{\mybold{\b}} \newcommand{\bvhat}{\hat{\bv}} \newcommand{\alphav}{\mybold{\alpha}} \newcommand{\alphavhat}{\hat{\av}} \newcommand{\alphahat}{\hat{\alpha}} \newcommand{\omegav}{\mybold{\omega}} \newcommand{\gv}{\mybold{\gamma}} \newcommand{\gvhat}{\hat{\gv}} \newcommand{\ghat}{\hat{\gamma}} \newcommand{\hv}{\mybold{\h}} \newcommand{\hvhat}{\hat{\hv}} \newcommand{\hhat}{\hat{\h}} \newcommand{\gammav}{\mybold{\gamma}} \newcommand{\gammavhat}{\hat{\gammav}} \newcommand{\gammahat}{\hat{\gamma}} \newcommand{\new}{\mathrm{new}} \newcommand{\zerov}{\mybold{0}} \newcommand{\onev}{\mybold{1}} \newcommand{\id}{\mybold{I}} \newcommand{\sigmahat}{\hat{\sigma}} \newcommand{\etav}{\mybold{\eta}} \newcommand{\muv}{\mybold{\mu}} \newcommand{\Sigmam}{\mybold{\Sigma}} \newcommand{\rdom}[1]{\mathbb{R}^{#1}} \newcommand{\RV}[1]{\tilde{#1}} \def\A{\mybold{A}} \def\A{\mybold{A}} \def\av{\mybold{a}} \def\a{a} \def\B{\mybold{B}} \def\S{\mybold{S}} \def\sv{\mybold{s}} \def\s{s} \def\R{\mybold{R}} \def\rv{\mybold{r}} \def\r{r} \def\V{\mybold{V}} \def\vv{\mybold{v}} \def\v{v} \def\U{\mybold{U}} \def\uv{\mybold{u}} \def\u{u} \def\W{\mybold{W}} \def\wv{\mybold{w}} \def\w{w} \def\tv{\mybold{t}} \def\t{t} \def\Sc{\mathcal{S}} \def\ev{\mybold{e}} \def\Lammat{\mybold{\Lambda}} $$

Omitted variables in inference and prediction

\(\,\)

\[ \def\xcov{\mybold{\Sigma}_{xx}} \def\xzcov{\mybold{\Sigma}_{xz}} \def\zxcov{\mybold{\Sigma}_{zx}} \def\deltav{\mybold{\delta}} \]

Goals

  • Look at consequences of omitted variables
    • For inference
    • For prediction

Setup

For today, we will assume that

\[ \y_n = \xv_n^\trans \betav + \zv_n^\trans \gammav + \res_n, \]

We will take both \(\xv_n\) and \(\zv_n\) to be random, independent across rows but possibly dependent on one another, and assume that \(\res_n\) are mean zero and independent of both \(\xv_n\) and \(\zv_n\).

We will assume that we regress \(\y_n \sim \xv_n^\trans \betav\) alone. That is, the true \(\y_n\) was generated with some variables \(\zv_n\) that were omitted. What goes wrong when you omit variables in this way? The answer will turn out to depend both on how \(\zv_n\) and \(\xv_n\) are related, and on whether we are interested in prediction or in inference.

Limiting behavior with omitted variables

We will focus on the limiting behavior of \(\betavhat\). By assumption,

\[ \begin{aligned} \betavhat ={}& (\X^\trans \X)^{-1} \X^\trans \Y \\={}& (\X^\trans \X)^{-1} \X^\trans \left(\X \betav + \Z \gammav + \res \right) \\={}& \betav + (\X^\trans \X)^{-1} \X^\trans \Z \gammav + (\X^\trans \X)^{-1} \X^\trans \res \\\rightarrow{}& \betav + \xcov^{-1} \xzcov \gammav, \end{aligned} \]

where \(\xcov = \expect{\xv_n \xv_n^\trans}\) and \(\xzcov = \expect{\xv_n \zv_n^\trans}\). We’ll also write \(\zxcov = \xzcov^\trans\). If \(\xv_n\) and \(\zv_n\) are mean–zero, or if we have removed a constant with the FWL theorem, then these quantities are covariances, and it can be useful to them of them that way.

Consequences for inference

In an inference problem, we want to know the value of \(\betav\). As a simple example, suppose we have observational data, randomly selected from some population. In the population we might imagine \(\x_n\), which is whether a patient took an expensive drug, and \(\z_n\), which is the patient’s wealth. Since the drug is expensive, \(\xzcov \ne \zerov\) — that is, wealth is correlated with taking the drug, since the drug is expensive. Suppose we measure the patient, \(\y_n\), after some time.

One might imagine at least three different models of the world, each with different meanings:

\[ \begin{aligned} \y_n ={}& \gamma \z_n + \res_n & \beta = 0 && \textrm{Wealth determines health, drug doesn't matter} \\ \y_n ={}& \beta \x_n + \res_n & \gamma = 0 && \textrm{Drug determines health, wealth doesn't matter} \\ \y_n ={}& \beta \x_n + \gamma \z_n + \res_n & \gamma, \beta \ne 0 && \textrm{Both drug and wealth determine health} \\ \end{aligned} \]

These different models have very different real-world implications for the efficacy of the drug. Of course none of these are true — they are stylized representations of complex social processes. But these stylized models can help illustrate mathematically the consequences of linear regression, especially with omitted variables.

Now suppose we don’t actually observe \(\z_n\) and regress only on \(\x_n\), and take our estimate, \(\betahat\), to be a measure of the efficacy of the drug. No matter which model we’re in, we want to estimate \(\beta\). By our previous section, we have

\[ \betahat \rightarrow \beta + \xcov^{-1} \xzcov \gamma. \]

When is \(\betahat\) a good estimate of \(\beta\)?

  • If \(\xzcov = \zerov\), then \(\betahat \rightarrow \beta\). That is, if wealth is uncorrelated with drug–taking, then omitting wealth does not bias our estimate.
  • If \(\gamma = 0\), then \(\betahat \rightarrow \beta\). That is, if wealth has no effect on health, then omitting wealth does not bias our estimate.
  • If both \(\gamma \ne 0\) and \(\xzcov \ne 0\), then \(\betahat\) is not estimating \(\beta\) — it is estimating a possibly complex combination of \(\gamma\) and \(\beta\).

This is simply a mathematical statement of the fact that correlation does not imply causation — \(\betahat \ne 0\) indicates correlation between health and the drug, but does not indicate a causal relationship if there may be unobserved variables correlated with both.

In every study, there are omitted variables, since you can’t observe everything. In order to perform inference, you have to make a case that those omissions do not matter — either because \(\xzcov \approx \zerov\) or \(\gammav \approx \zerov\).

Consequences for prediction

Take the previous setting, but suppose we only want to predict a patient’s health level after some time (e.g. for insurance purposes), and do not care about any causal relationships like the efficacy of the drug. Again, suppose we observe only \(\xv_n\). What predictive error do we make when using \(\yhat_\new = \betavhat^\trans \xv_\new\)?

The error we commit is

\[ \begin{aligned} \y_\new - \yhat_\new ={}& (\betav - \betavhat)^\trans \xv_\new + \gammav^\trans \zv_\new + \res_\new \\\rightarrow{}& (\betav - \betav - \xcov^{-1} \xzcov \gammav)^\trans \xv_\new + \gamma^\trans \zv_\new + \res_\new \\=& - \gammav^\trans \zxcov \xcov^{-1} \xv_\new + \gammav^\trans \zv_\new + \res_\new \\=& \gammav^\trans \left( \zv_\new - \zxcov \xcov^{-1} \xv_\new \right) + \res_\new \end{aligned} \]

For a particular \(\xv_\new\), the expected error (i.e., the conditional bias) is thus

\[ \expect{\y_\new - \yhat_\new \vert \xv_\new, \X, \Y} = \gammav^\trans \left( \expect{\zv_\new | \xv_\new} - \zxcov \xcov^{-1} \xv_\new \right). \] Note that in the preceding expectation, \(\X\) and \(\Y\) are kept fixed, but the limit is in probability over \(\X\) and \(\Y\).

Of course, this bias is zero if \(\gammav = \zerov\). But note that :

  • The conditional bias can still be nonzero even if \(\zxcov = \zerov\), as long as \(\expect{\zv_\new | \xv_\new} \ne \zerov\).
  • Conversely, the conditional bias can be zero even if \(\zxcov \ne \zerov\) as long as \(\expect{\zv_\new | \xv_\new} = \zxcov \xcov^{-1} \xv_\new\).

In the inference setting, bias for estimating \(\betav\) was entirely determined by \(\xzcov\). However, as we can see, the answer is a little more complicated than the inference setting.

Going back to our example, our prediction doesn’t suffer at all when omitting wealth as long as wealth is perfectly linearly related to \(\xv_n\). In fact, if \(\zv_n = \A \xv_n\) for some matrix \(\A\), then \(\betav^\trans \xv_n + \gammav^\trans \zv_n = (\betav + \A^\trans \gammav) \xv_n\). Then \(\betav\) and \(\gammav\) are not even separately identifiable, as \(\Z\) and \(\X\) are colinear, and including \(\Z\) does not improve the predictive power.

However, we will make prediction errors if \(\y_\new\) contains some dependence on \(\zv_\new\) that is not linear in \(\xv_\new\). On the other hand, if \(\expect{\delta_n \vert \xv_n} = 0\), then \(\delta_n\) can be thought of as part of the mean–zero residual. Excluding \(\zv_n\) might degrade our predictive performance, since its variance will count towards the residual variance, but its exclusion will not incur bias.

A unifying decomposition

We can understand both the above phenomena in a single decomposition. Write

\[ \begin{aligned} \zv_n ={}& \zv_n - \zxcov \xcov^{-1} \xv_n + \xzcov \xcov^{-1} \xv_n =: \delta_n + \hat\zv_n\quad\textrm{where} \\ \hat\zv_n :={}& \zxcov \xcov^{-1} \xv_n \quad\textrm{and} \\ \deltav_n :={}& \zv_n - \zxcov \xcov^{-1} \xv_n. \end{aligned} \]

Note that

\[ \expect{\deltav_n \xv_n^\trans } = \expect{\zv_n \xv_n^\trans} - \expect{\zxcov \xcov^{-1} \xv_n \xv_n^\trans} = \zxcov -\zxcov = \zerov, \]

so \(\deltav_n\) is the part of \(\zv_n\) that is “uncorrelated” with \(\xv_n\) (meaning, has no second moment with \(\xv_n\)). Conversely, \(\hat\zv_n\) is a linear approximaton of \(\zv_n\) using \(\xv_n\). In fact, \(\hat\zv_n\) is the limit as \(N\rightarrow\infty\) of a least-squares regression of the components of \(\zv_n\) on \(\xv_n\), and \(\delta_n\) is the limit of the residual, so \(\hat\zv_n\) is the “best linear approximation” in the sense that it solves

\[ \hat\zv_{nk} = \min_{\alpha} \expect{(\zv_{nk} - \alphav^\trans \xv_n)^2}. \]

Exercise

Prove the preceding statement. Analogously, prove that if you regress \(\zv_{nk} \sim \alphav^\trans \xv_n + \delta_{kn}\), then \(\hat\alphav^\trans \xv_n \rightarrow \hat\zv_{nk}\) and \(\hat{\delta_{kn}} \rightarrow \delta_k\) as \(N\rightarrow\infty\).

Given this decomposition, we can write

\[ \begin{aligned} \y_n ={}& \betav^\trans \xv_n + \gammav^\trans \zv_n + \res_n \\={}& \betav^\trans \xv_n + \gammav^\trans (\hat\zv_n + \deltav_n) + \res_n \\={}& \left(\betav + \xcov^{-1} \xzcov \gammav \right)^\trans \xv_n + \gamma^\trans \deltav_n + \res_n. \end{aligned} \]

Because \(\expect{\xv_n \deltav^\trans} = 0\), we can immediately read our two forms of omitted variable bias off the preceding expression:

Inference Prediction
\(\betavhat \rightarrow \betav + \xcov^{-1} \xzcov \gammav\) \(\expect{\y_\new - \yhat_\new \vert \xv_\new,\X,\Y} \rightarrow \gamma^\trans \expect{\deltav_\new \vert \xv_\new}\)
Inferential bias is caused by \(\hat\zv_n\) and is independent of \(\deltav_n\) Prediction bias is caused by \(\deltav_n\) and is independent of \(\hat\zv_n\)

Another way to see this is:

  • For inference, \(\deltav_n\) can be considered part of the residual since it is uncorrelated with \(\xv_n\), and
  • For prediction, \(\hat\zv_n\) can be considered part of the signal, since it is perfectly correlated with \(\xv_n\).