$$ \newcommand{\mybold}[1]{\boldsymbol{#1}} \newcommand{\trans}{\intercal} \newcommand{\norm}[1]{\left\Vert#1\right\Vert} \newcommand{\abs}[1]{\left|#1\right|} \newcommand{\bbr}{\mathbb{R}} \newcommand{\bbz}{\mathbb{Z}} \newcommand{\bbc}{\mathbb{C}} \newcommand{\gauss}[1]{\mathcal{N}\left(#1\right)} \newcommand{\chisq}[1]{\mathcal{\chi}^2_{#1}} \newcommand{\studentt}[1]{\mathrm{StudentT}_{#1}} \newcommand{\fdist}[2]{\mathrm{FDist}_{#1,#2}} \newcommand{\iid}{\overset{\mathrm{IID}}{\sim}} \newcommand{\argmin}[1]{\underset{#1}{\mathrm{argmin}}\,} \newcommand{\projop}[1]{\underset{#1}{\mathrm{Proj}}\,} \newcommand{\proj}[1]{\underset{#1}{\mybold{P}}} \newcommand{\expect}[1]{\mathbb{E}\left[#1\right]} \newcommand{\prob}[1]{\mathbb{P}\left(#1\right)} \newcommand{\dens}[1]{\mathit{p}\left(#1\right)} \newcommand{\var}[1]{\mathrm{Var}\left(#1\right)} \newcommand{\cov}[1]{\mathrm{Cov}\left(#1\right)} \newcommand{\sumn}{\sum_{n=1}^N} \newcommand{\meann}{\frac{1}{N} \sumn} \newcommand{\cltn}{\frac{1}{\sqrt{N}} \sumn} \newcommand{\trace}[1]{\mathrm{trace}\left(#1\right)} \newcommand{\diag}[1]{\mathrm{Diag}\left(#1\right)} \newcommand{\grad}[2]{\nabla_{#1} \left. #2 \right.} \newcommand{\gradat}[3]{\nabla_{#1} \left. #2 \right|_{#3}} \newcommand{\fracat}[3]{\left. \frac{#1}{#2} \right|_{#3}} \newcommand{\W}{\mybold{W}} \newcommand{\w}{w} \newcommand{\wbar}{\bar{w}} \newcommand{\wv}{\mybold{w}} \newcommand{\X}{\mybold{X}} \newcommand{\x}{x} \newcommand{\xbar}{\bar{x}} \newcommand{\xv}{\mybold{x}} \newcommand{\Xcov}{\mybold{M}_{\X}} \newcommand{\Xcovhat}{\hat{\mybold{M}}_{\X}} \newcommand{\Covsand}{\Sigmam_{\mathrm{sand}}} \newcommand{\Covsandhat}{\hat{\Sigmam}_{\mathrm{sand}}} \newcommand{\Z}{\mybold{Z}} \newcommand{\z}{z} \newcommand{\zv}{\mybold{z}} \newcommand{\zbar}{\bar{z}} \newcommand{\Y}{\mybold{Y}} \newcommand{\Yhat}{\hat{\Y}} \newcommand{\y}{y} \newcommand{\yv}{\mybold{y}} \newcommand{\yhat}{\hat{\y}} \newcommand{\ybar}{\bar{y}} \newcommand{\res}{\varepsilon} \newcommand{\resv}{\mybold{\res}} \newcommand{\resvhat}{\hat{\mybold{\res}}} \newcommand{\reshat}{\hat{\res}} \newcommand{\betav}{\mybold{\beta}} \newcommand{\betavhat}{\hat{\betav}} \newcommand{\betahat}{\hat{\beta}} \newcommand{\betastar}{{\beta^{*}}} \newcommand{\betavstar}{{\betav^{*}}} \newcommand{\loss}{\mathscr{L}} \newcommand{\losshat}{\hat{\loss}} \newcommand{\f}{f} \newcommand{\fhat}{\hat{f}} \newcommand{\bv}{\mybold{\b}} \newcommand{\bvhat}{\hat{\bv}} \newcommand{\alphav}{\mybold{\alpha}} \newcommand{\alphavhat}{\hat{\av}} \newcommand{\alphahat}{\hat{\alpha}} \newcommand{\omegav}{\mybold{\omega}} \newcommand{\gv}{\mybold{\gamma}} \newcommand{\gvhat}{\hat{\gv}} \newcommand{\ghat}{\hat{\gamma}} \newcommand{\hv}{\mybold{\h}} \newcommand{\hvhat}{\hat{\hv}} \newcommand{\hhat}{\hat{\h}} \newcommand{\gammav}{\mybold{\gamma}} \newcommand{\gammavhat}{\hat{\gammav}} \newcommand{\gammahat}{\hat{\gamma}} \newcommand{\new}{\mathrm{new}} \newcommand{\zerov}{\mybold{0}} \newcommand{\onev}{\mybold{1}} \newcommand{\id}{\mybold{I}} \newcommand{\sigmahat}{\hat{\sigma}} \newcommand{\etav}{\mybold{\eta}} \newcommand{\muv}{\mybold{\mu}} \newcommand{\Sigmam}{\mybold{\Sigma}} \newcommand{\rdom}[1]{\mathbb{R}^{#1}} \newcommand{\RV}[1]{{#1}} \def\A{\mybold{A}} \def\A{\mybold{A}} \def\av{\mybold{a}} \def\a{a} \def\B{\mybold{B}} \def\b{b} \def\S{\mybold{S}} \def\sv{\mybold{s}} \def\s{s} \def\R{\mybold{R}} \def\rv{\mybold{r}} \def\r{r} \def\V{\mybold{V}} \def\vv{\mybold{v}} \def\v{v} \def\vhat{\hat{v}} \def\U{\mybold{U}} \def\uv{\mybold{u}} \def\u{u} \def\W{\mybold{W}} \def\wv{\mybold{w}} \def\w{w} \def\tv{\mybold{t}} \def\t{t} \def\Sc{\mathcal{S}} \def\ev{\mybold{e}} \def\Lammat{\mybold{\Lambda}} \def\Q{\mybold{Q}} \def\eps{\varepsilon} $$

Stochastic assumptions on the residual

Goals

Introduce statistical assumptions to OLS
- Hierachies of assumptions
  - Independent mean zero residuals
  - IID residuals
  - IID Normal residuals
- Bias in OLS
- Variance in OLS
- Fixed versus random regressors
- Meaning and criticism of assumptions

Births dataset

Recall our births dataset, which contains observational data on many aspects of a pregancy, including whether the mother was a smoker. Let’s suppose we’re interested in seeing whether there is evidence that smoking affects a baby’s birth weight negatively.

The histograms are suggestive:

Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(density)` instead.

But we might reasonably want to try to control for some of the many additional variables using regression. For instance, we might run the regression:

lm_inter <- lm(weight ~ mage + habit + (whitemom + sex + visits + gained)^2, births_df)
summary(lm_inter)$coefficients["habitsmoker", "Estimate"]

[1] -0.6169764

That’s a pretty large and negative estimated effect. But we might want to ask a number of critical questions:

Could such a large effect be due to chance, especially given how many regressors we controlled for?
Can we quantify mathematically what goes wrong when we fail to include regressors we should?
What happens if some of our regressors are observed with errors?
How does the amount of data affect the accuracy of our results?

To answer these questions, we need some stochastic assumptions.

From sample means to regression

Suppose we just look at the sample averages for smokers and non–smokers: $\ybar_{s}$ and $\ybar_{x}$, respectively. We talked a lot at the beginning about when we might assume that our $\y_n$ are IID samples from some population of smokers and non–smokers. In that case, note that the $n$–th smoker’s weight observation can be written as

\[ \y_{sn} = \expect{\y_{sn}} + (\y_{sn} - \expect{\y_{sn}}) = \mu_s + \res_{sn}, \]

where $\mu_s = \expect{\y_{sn}}$ is what we want to know, and $\expect{\res_{sn}} = 0$ by definition! In this case, if $\z_n$ encodes smoking, we can write

\[ \y_n = \beta_1 + \beta_2 \z_n + \res_{n} \]

with no additional assumptions — there automatically exists $\beta_1$ and $\beta_2$ such that this is true, namely

\[ \beta_1 = \mu_x \quad\textrm{and}\quad \beta_2 = \mu_s - \mu_x. \]

Furthermore, we know that $\betavhat$ is unbiased for these. (Exercise: show this.)

However, the situation becomes more complicated if we want to control for mother’s age (for example). In that case, if we take $\x_n$ to be mother’s age, write

\[ \y_n = \beta_1 + \beta_2 \z_n + \beta_3 \x_n + \res_n, \]

then it may not be true that $\expect{\res_n} = 0$ from IID sampling alone. For example, suppose that, actually, $\y_n = \gamma \x_n^2 + \eta_n$, for $\expect{\eta_n} = 0$. Then we can plug in and see that (taking the regressors as fixed, and not random):

\[ \expect{\res_n} = \expect{\y_n - \beta_1 + \beta_2 \z_n + \beta_3 \x_n} = \expect{\gamma \x_n^2 - \beta_1 + \beta_2 \z_n + \beta_3 \x_n} \ne 0, \]

in general.

What we’ve just done is an example of sequence of steps we’ll be following a lot:

Tentatively assume some “true” model (often for the sake of argument), and then
Examine the behavior of our regression under that true model, typically by plugging the true model into our formula for $\betavhat$ or $\yhat_n$.

We can do this in order to: - Gain intuition about linear regression that we hope extends to other models, - Falsify the assumed model, - Explore patterns in the data, - Attempt to assess the prediction accuracy, - … and so on.

Typically our assumptions may be falsifiable, but they are often not verifiable. They have to be critically examined, and are a central part of any argument made using regression.

Assumptions on the “true” model

In the next six weeks, we will explore a hierarchy of assumptions. Typically, the more assume, the more precise you can be about the behavior of linear regression!

IID data

The loosest assumption is that $(\xv_n, \y_n)$ are IID pairs.

IID data assumption

Assume that the pairs $(\x_n, \y_n)$ are IID, and that both $\expect{\y_n \vert \x_n}$ and $\var{\y_n \vert \x_n}$ are finite.

This is similar to a “machine learning” style assumption, where we don’t assume much about the relationship between $\x_n$ and $\y_n$, but want to predict $\expect{\y_n \vert \xv_n}$ with some flexible learning algorithm — which could be linear regression.

Linear expectation

The next strongest assumption is this:

Linear expectation (LE) assumption

Under the IID data assumption, assume additionally that there exists a $\betav$ such that, for all $\xv_n$, $\expect{\y_n \vert \xv_n} = \betav^\trans \xv_n$. Equivalently, we can write $\expect{\Y \vert \X} = \X \betav$.

This is a strong assumption! It is unlikely to be actually true in most practical settings. However, something like this is necessary if we’re to evaluate how well OLS recovers a “true” $\betav$.

Under this assumption, we can always define

\[ \res_n = \y_n - \betav^\trans \xv_n = \y_n - \expect{\y_n \vert \xv_n}, \]

which must satisfy

\[ \expect{\res_n \vert \xv_n} = 0 \quad\Rightarrow\quad \expect{\res_n} = \expect{\expect{\res_n \vert \xv_n}} = 0, \]

this is equivalent to saying that there exists $\betav$ such that $\y_n = \betav^\trans \xv_n + \res_n$ with $\expect{\res_n \vert \xv_n} = 0$.

Since this assumption allows $\var{\res_n \vert \xv_n}$ to vary with $\xv_n$, the residuals are sometimes called “heteroskedastic,” meaning “different randomness,” in contrast with the next assumption.

Unbiasedness under the LE assumption

Amazingly, the LE assumption is enough to prove the unbiasedness of OLS under correct specification. Here is a proof:

\[ \begin{aligned} \betavhat ={}& (\X^\trans \X)^{-1} \X^\trans \Y \\={}& (\X^\trans \X)^{-1} \X^\trans (\X \betav + \resv) \\={}& (\X^\trans \X)^{-1} \X^\trans \X \betav + (\X^\trans \X)^{-1} \X^\trans \resv \\={}& \betav + (\X^\trans \X)^{-1} \X^\trans \resv \quad\Rightarrow\\ \expect{\betavhat \vert \X} ={}& \betav + (\X^\trans \X)^{-1} \X^\trans \expect{\resv \vert \X} ={} \betav \quad\Rightarrow\\ \expect{\betavhat} ={}& \expect{\expect{\betavhat \vert \X}} = \betav. \end{aligned} \]

Omitted variables under the LE assumption

The LE assumption also helps make it clear what happens when you fail to include variables which you should have included.

Suppose that the true model is $\Y = \X \betav + \Z \gammav + \resv$ under the LE assumption, but we run the regression $\Y \sim \Z \alphav$. How different is $\alphavhat$ from the true $\gammav$?

We can repeat the steps above to get

\[ \begin{aligned} \alphavhat ={}& (\Z^\trans \Z)^{-1} \Z^\trans \Y \\={}& (\Z^\trans \Z)^{-1} \Z^\trans (\X \betav + \Z \gammav + \resv) \\={}& \gammav + (\Z^\trans \Z)^{-1} \Z^\trans \X \betav + (\Z^\trans \Z)^{-1} \Z^\trans \resv \quad\Rightarrow\\ \expect{\alphavhat \vert \X, \Z} ={}& \gammav + (\Z^\trans \Z)^{-1} \Z^\trans \X \betav. \end{aligned} \]

We see that $\alphahat$ is biased as an estimator of $\gammav$ unless either $\Z^\trans \X \betav = \zerov$. This term can be zero in two ways:

If $\betav = 0$. This happens when $\X$ actually has no effect on $\Y$ in the true model, and so can be safely omitted.
If $\Z^\trans \X = \zerov$. This happens when $\X$ is orthogonal to $\Z$.

One way to think about the $\Z^\trans \X = \zerov$ case is as follows: you could incorporate $\X \betav$ into a new residual, $\etav = \resv + \X \betav$ in the model $\Y = \Z \alpha + \etav$. It is no longer the case that $\expect{\etav | \Z}$ is zero, but it is uncorrelated with $\Z$, and so has no effect on the regression. This is an instance where it’s more useful to analyze OLS using the assumption that errors are uncorrelated rather than independent.

TODO: I expanded on this in lecture but did not write it up here. In particular, write out how highly correlated omitted variables are good for prediction but bad for inference.

Homoskedastic residuals

The next assumption is more common in econometrics, and it simply assumes that the residuals all have the same variance.

Homoskedastic residuals assumption

Under the linear expectation assumption, assume additionally that $\var{\res_n \vert \xv_n} = \sigma_\res^2$ for all $\xv_n$; that is, the residual variance is constant.

Such residuals are called “homoskedastic” for “same randomness.”

Variance under homoskedastic residuals

The homoskedastic residuals assumption leads to a particularly simple form for the covariance of $\betavhat$, which is more complicated under heteroskedasticity. Using our result above,

\[ \begin{aligned} \cov{\betavhat \vert \X} ={}& \expect{\left( \betavhat - \betav\right) \left( \betavhat - \betav\right)^\trans \vert \X} \\={}& \expect{\left( (\X^\trans \X)^{-1} \X^\trans \resv \right) \left( (\X^\trans \X)^{-1} \X^\trans \resv \right)^\trans \vert \X} \\={}& (\X^\trans \X)^{-1} \X^\trans \expect{\resv \resv^\trans \vert \X} \X (\X^\trans \X)^{-1} \\={}& (\X^\trans \X)^{-1} \X^\trans \sigma^2_\res \id \X (\X^\trans \X)^{-1} \\={}& \sigma^2_\res (\X^\trans \X)^{-1} \X^\trans \id \X (\X^\trans \X)^{-1} \\={}& \sigma^2_\res (\X^\trans \X)^{-1}. \end{aligned} \]

Unlike the expectation, it is no longer the case that $\cov{\betavhat}$ has a simple form marginally over $\X$, because, in general,

\[ \expect{(\X^\trans \X)^{-1}} \ne \expect{\X^\trans \X}^{-1}. \]

A simple special case is for univarite $\x_n$, for which it is hopefull familiar that

\[ \expect{\frac{1}{\x_n^2}} \ne \frac{1}{\expect{\x_n^2}} \]

unless $\var{\x_n} = 0$. This is one of the “mathematical conveniences” that motivate treating $\xv_n$ as fixed rather than random when analyzing OLS — see below for more discussion.

Gaussian residuals

Finally, we come to the most classic assumption: fully Gaussian residuals.

Gaussian residuals assumption

Under the linear expectation assumption, assume additionally that $\res_n \sim \gauss{0, \sigma^2}$.

This assumption is, of course, the least likely to hold in practice. However, it’s also the easiest to analyze — we’ll be able to derive closed–form expressions for the behavior of many OLS test statistics. This is also the assumption that is made implicitly by a lot of standard statistical software, including the lm function from R, so it is important to understand its implications.

Normality under Gaussian residuals

Under the Gaussian assumption, since $\y_n = \betav^\trans \xv_n + \res_n$, this implies that $\y_n$ is also Gaussian conditional on $\xv_n$. Even more, $\betavhat$ is Gaussian conditional on $\X$, since it is itself a linear combination of Gaussian errors:

\[ \betavhat = \beta + (\X^\trans \X) \X^\trans \resv \quad\Rightarrow\quad \betavhat \sim \gauss{\beta, \sigma^2 (\X^\trans \X)^{-1}}. \]

Independent or uncorrelated residuals?

Note that under all these assumptions, the residuals are assumed to be independent of one another. That’s not strictly necessary — if you closely examine the proofs to come, you will see that often pairwise uncorrelated residuals will be enough. However, for simplicity, I’ll stay with the assumption of independent residuals in this class; it will hopefully be clear enough where that can be weakened if necessary.

Fixed or random regressors?

Above, it’s clear that $\y_n$ is considered random (and, correspondingly, so are the residuals $\res_n$). What about $\xv_n$ for the linear expectation assumption onwards? Is it random or fixed?

Depending on the setting, it might be reasonable to model $\xv_n$ as either random or fixed. For example, if $\xv_n$ are part of a systematic design, such as an evenly spaced set of weights for which you will measure the deflection of a spring, then it makes sense to think of $\xv_n$ as fixed. However, if your data are samples from some larger population, such as data about mothers’ smoking habits and baby birth weight, then it might make sense to think of $\xv_n$ as random along with $\y_n$.

However, mathematically speaking,

It is usually easier to think of $\xv_n$ as fixed, and
It would rarely affect our substantive conclusions, since the variance of the residuals dominates the variability of the regressors. (This may not be obvious, even later in the course, but I assure you it is true.)

For these reasons, I will make the following assumption for simplicity:

Fixed regressor assumption

Under the linear expectation, homoskedastic, and Gaussian assumptions, I will assume that the regressors $\X$ are fixed unless I say explicitly otherwise. In that case, by conditioning, I simply mean “for that value of $\xv_n$,” as in $\expect{\y_n \vert \xv_n} = \betav^\trans \xv_n$.

Assumptions in the Kleiber example

Recall the Kleiber example, in which observations were animals, weights, and metabolism rates. Which of these assumptions make sense?

Arguably, none of them:

This is not a “random sample from all animals,” even if you could make that sentence make sense
There are multiple observations for the same type of animal
There is no reason to believe a linear model is well specified
There is no reason to believe the errors are normal or even homoskedastic

The best you might hope for is to argue that the IID assumption characterizes something like “how much would my conclusion have changed if I had collected the same dataset again but with different randomly chosen animals of the same species?” But this is not very much like the real question, which is “how much might the slope of the regression line differ from 2/3 due only to the ideosyncracies of this particular dataset?”