Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(density)` instead.

Recall our births dataset, which contains observational data on many aspects of a pregancy, including whether the mother was a smoker. Let’s suppose we’re interested in seeing whether there is evidence that smoking affects a baby’s birth weight negatively.
The histograms are suggestive:
Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(density)` instead.

But we might reasonably want to try to control for some of the many additional variables using regression. For instance, we might run the regression:
lm_inter <- lm(weight ~ mage + habit + (whitemom + sex + visits + gained)^2, births_df)
summary(lm_inter)$coefficients["habitsmoker", "Estimate"][1] -0.6169764
That’s a pretty large and negative estimated effect. But we might want to ask a number of critical questions:
To answer these questions, we need some stochastic assumptions.
In the next six weeks, we will explore a hierarchy of assumptions. Typically, the more assume, the more precise you can be about the behavior of linear regression!
The word “assumptions,” though common, is not really the right word. Typically an “assumption” is something you accept as true. Our assumptions will in fact often be unreasonable approximations to reality. A better word might be “conceit” (meaning “an elaborate or strained metaphor”). But some starting point is necessary for statistical analysis, and we can proceed as if they were true, while constantly (a) thinking about how things might change if our assumptions were violated in plausible ways and (b) proactively looking for postive evidence that our assumptions are actually not true.
To my mind, the substitute of an oversimplified but tractable proxy for a prohibitively complex reality is not qualitatively different from what we do with language all the time, especially in the context of persuasive rhetoric. The real danger comes when, out of ignorance, neglect, or a desire to fool yourself or others, the proxy is taken too seriously.
The loosest assumption is that \((\boldsymbol{x}_n, y_n)\) are IID pairs.
Assume that the pairs \((x_n, y_n)\) are IID, and that both \(\mathbb{E}\left[y_n \vert x_n\right]\) and \(\mathrm{Var}\left(y_n \vert x_n\right)\) are finite.
This is similar to a “machine learning” style assumption, where we don’t assume much about the relationship between \(x_n\) and \(y_n\), but want to predict \(\mathbb{E}\left[y_n \vert \boldsymbol{x}_n\right]\) with some flexible learning algorithm — which could be linear regression.
The next strongest assumption is this:
Under the IID data assumption, assume additionally that there exists a \(\boldsymbol{\beta}\) such that, for all \(\boldsymbol{x}_n\), \(\mathbb{E}\left[y_n \vert \boldsymbol{x}_n\right] = \boldsymbol{\beta}^\intercal\boldsymbol{x}_n\). Equivalently, we can write \(\mathbb{E}\left[\boldsymbol{Y}\vert \boldsymbol{X}\right] = \boldsymbol{X}\boldsymbol{\beta}\).
This is a strong assumption! It is unlikely to be actually true in most practical settings. However, something like this is necessary if we’re to evaluate how well OLS recovers a “true” \(\boldsymbol{\beta}\).
Under this assumption, we can always define
\[ \varepsilon_n = y_n - \boldsymbol{\beta}^\intercal\boldsymbol{x}_n = y_n - \mathbb{E}\left[y_n \vert \boldsymbol{x}_n\right], \]
which must satisfy
\[ \mathbb{E}\left[\varepsilon_n \vert \boldsymbol{x}_n\right] = 0 \quad\Rightarrow\quad \mathbb{E}\left[\varepsilon_n\right] = \mathbb{E}\left[\mathbb{E}\left[\varepsilon_n \vert \boldsymbol{x}_n\right]\right] = 0, \]
this is equivalent to saying that there exists \(\boldsymbol{\beta}\) such that \(y_n = \boldsymbol{\beta}^\intercal\boldsymbol{x}_n + \varepsilon_n\) with \(\mathbb{E}\left[\varepsilon_n \vert \boldsymbol{x}_n\right] = 0\).
Since this assumption allows \(\mathrm{Var}\left(\varepsilon_n \vert \boldsymbol{x}_n\right)\) to vary with \(\boldsymbol{x}_n\), the residuals are sometimes called “heteroskedastic,” meaning “different randomness,” in contrast with the next assumption.
Amazingly, the LE assumption is enough to prove the unbiasedness of OLS under correct specification. Here is a proof:
\[ \begin{aligned} \hat{\boldsymbol{\beta}}={}& (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{X}^\intercal\boldsymbol{Y} \\={}& (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{X}^\intercal(\boldsymbol{X}\boldsymbol{\beta}+ \boldsymbol{\varepsilon}) \\={}& (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{X}^\intercal\boldsymbol{X}\boldsymbol{\beta}+ (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{X}^\intercal\boldsymbol{\varepsilon} \\={}& \boldsymbol{\beta}+ (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{X}^\intercal\boldsymbol{\varepsilon} \quad\Rightarrow\\ \mathbb{E}\left[\hat{\boldsymbol{\beta}}\vert \boldsymbol{X}\right] ={}& \boldsymbol{\beta}+ (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{X}^\intercal\mathbb{E}\left[\boldsymbol{\varepsilon}\vert \boldsymbol{X}\right] ={} \boldsymbol{\beta} \quad\Rightarrow\\ \mathbb{E}\left[\hat{\boldsymbol{\beta}}\right] ={}& \mathbb{E}\left[\mathbb{E}\left[\hat{\boldsymbol{\beta}}\vert \boldsymbol{X}\right]\right] = \boldsymbol{\beta}. \end{aligned} \]
The LE assumption also helps make it clear what happens when you fail to include variables which you should have included.
Suppose that the true model is \(\boldsymbol{Y}= \boldsymbol{X}\boldsymbol{\beta}+ \boldsymbol{Z}\boldsymbol{\gamma}+ \boldsymbol{\varepsilon}\) under the LE assumption, but we run the regression \(\boldsymbol{Y}\sim \boldsymbol{Z}\boldsymbol{\alpha}\). How different is \(\hat{\boldsymbol{a}}\) from the true \(\boldsymbol{\gamma}\)?
We can repeat the steps above to get
\[ \begin{aligned} \hat{\boldsymbol{a}}={}& (\boldsymbol{Z}^\intercal\boldsymbol{Z})^{-1} \boldsymbol{Z}^\intercal\boldsymbol{Y} \\={}& (\boldsymbol{Z}^\intercal\boldsymbol{Z})^{-1} \boldsymbol{Z}^\intercal(\boldsymbol{X}\boldsymbol{\beta}+ \boldsymbol{Z}\boldsymbol{\gamma}+ \boldsymbol{\varepsilon}) \\={}& \boldsymbol{\gamma}+ (\boldsymbol{Z}^\intercal\boldsymbol{Z})^{-1} \boldsymbol{Z}^\intercal\boldsymbol{X}\boldsymbol{\beta}+ (\boldsymbol{Z}^\intercal\boldsymbol{Z})^{-1} \boldsymbol{Z}^\intercal\boldsymbol{\varepsilon} \quad\Rightarrow\\ \mathbb{E}\left[\hat{\boldsymbol{a}}\vert \boldsymbol{X}, \boldsymbol{Z}\right] ={}& \boldsymbol{\gamma}+ (\boldsymbol{Z}^\intercal\boldsymbol{Z})^{-1} \boldsymbol{Z}^\intercal\boldsymbol{X}\boldsymbol{\beta}. \end{aligned} \]
We see that \(\hat{\alpha}\) is biased as an estimator of \(\boldsymbol{\gamma}\) unless either \(\boldsymbol{Z}^\intercal\boldsymbol{X}\boldsymbol{\beta}= \boldsymbol{0}\). This term can be zero in two ways:
One way to think about the \(\boldsymbol{Z}^\intercal\boldsymbol{X}= \boldsymbol{0}\) case is as follows: you could incorporate \(\boldsymbol{X}\boldsymbol{\beta}\) into a new residual, \(\boldsymbol{\eta}= \boldsymbol{\varepsilon}+ \boldsymbol{X}\boldsymbol{\beta}\) in the model \(\boldsymbol{Y}= \boldsymbol{Z}\alpha + \boldsymbol{\eta}\). It is no longer the case that \(\mathbb{E}\left[\boldsymbol{\eta}| \boldsymbol{Z}\right]\) is zero, but it is uncorrelated with \(\boldsymbol{Z}\), and so has no effect on the regression. This is an instance where it’s more useful to analyze OLS using the assumption that errors are uncorrelated rather than independent.
The next assumption is more common in econometrics, and it simply assumes that the residuals all have the same variance.
Under the linear expectation assumption, assume additionally that \(\mathrm{Var}\left(\varepsilon_n \vert \boldsymbol{x}_n\right) = \sigma_\varepsilon^2\) for all \(\boldsymbol{x}_n\); that is, the residual variance is constant.
Such residuals are called “homoskedastic” for “same randomness.”
The homoskedastic residuals assumption leads to a particularly simple form for the covariance of \(\hat{\boldsymbol{\beta}}\), which is more complicated under heteroskedasticity. Using our result above,
\[ \begin{aligned} \mathrm{Cov}\left(\hat{\boldsymbol{\beta}}\vert \boldsymbol{X}\right) ={}& \mathbb{E}\left[\left( \hat{\boldsymbol{\beta}}- \boldsymbol{\beta}\right) \left( \hat{\boldsymbol{\beta}}- \boldsymbol{\beta}\right)^\intercal\vert \boldsymbol{X}\right] \\={}& \mathbb{E}\left[\left( (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{X}^\intercal\boldsymbol{\varepsilon}\right) \left( (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{X}^\intercal\boldsymbol{\varepsilon}\right)^\intercal\vert \boldsymbol{X}\right] \\={}& (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{X}^\intercal \mathbb{E}\left[\boldsymbol{\varepsilon}\boldsymbol{\varepsilon}^\intercal\vert \boldsymbol{X}\right] \boldsymbol{X}(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \\={}& (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{X}^\intercal \sigma^2_\varepsilon\boldsymbol{I} \boldsymbol{X}(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \\={}& \sigma^2_\varepsilon(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{X}^\intercal \boldsymbol{I} \boldsymbol{X}(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \\={}& \sigma^2_\varepsilon(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1}. \end{aligned} \]
Unlike the expectation, it is no longer the case that \(\mathrm{Cov}\left(\hat{\boldsymbol{\beta}}\right)\) has a simple form marginally over \(\boldsymbol{X}\), because, in general,
\[ \mathbb{E}\left[(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1}\right] \ne \mathbb{E}\left[\boldsymbol{X}^\intercal\boldsymbol{X}\right]^{-1}. \]
A simple special case is for univarite \(x_n\), for which it is hopefully familiar that
\[ \mathbb{E}\left[\frac{1}{x_n^2}\right] \ne \frac{1}{\mathbb{E}\left[x_n^2\right]} \]
unless \(\mathrm{Var}\left(x_n\right) = 0\). This is one of the “mathematical conveniences” that motivate treating \(\boldsymbol{x}_n\) as fixed rather than random when analyzing OLS — see below for more discussion.
Finally, we come to the most classic assumption: fully Gaussian residuals.
Under the linear expectation assumption, assume additionally that \(\varepsilon_n \sim \mathcal{N}\left(0, \sigma^2\right)\).
This assumption is, of course, the least likely to hold in practice. However, it’s also the easiest to analyze — we’ll be able to derive closed–form expressions for the behavior of many OLS test statistics. This is also the assumption that is made implicitly by a lot of standard statistical software, including the lm function from R, so it is important to understand its implications.
Under the Gaussian assumption, since \(y_n = \boldsymbol{\beta}^\intercal\boldsymbol{x}_n + \varepsilon_n\), this implies that \(y_n\) is also Gaussian conditional on \(\boldsymbol{x}_n\). Even more, \(\hat{\boldsymbol{\beta}}\) is Gaussian conditional on \(\boldsymbol{X}\), since it is itself a linear combination of Gaussian errors:
\[ \hat{\boldsymbol{\beta}}= \beta + (\boldsymbol{X}^\intercal\boldsymbol{X}) \boldsymbol{X}^\intercal\boldsymbol{\varepsilon} \quad\Rightarrow\quad \hat{\boldsymbol{\beta}}\sim \mathcal{N}\left(\beta, \sigma^2 (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1}\right). \]
Note that under all these assumptions, the residuals are assumed to be independent of one another. That’s not strictly necessary — if you closely examine the proofs to come, you will see that often pairwise uncorrelated residuals will be enough. However, for simplicity, I’ll stay with the assumption of independent residuals in this class; it will hopefully be clear enough where that can be weakened if necessary.
Above, it’s clear that \(y_n\) is considered random (and, correspondingly, so are the residuals \(\varepsilon_n\)). What about \(\boldsymbol{x}_n\) for the linear expectation assumption onwards? Is it random or fixed?
Depending on the setting, it might be reasonable to model \(\boldsymbol{x}_n\) as either random or fixed. For example, if \(\boldsymbol{x}_n\) are part of a systematic design, such as an evenly spaced set of weights for which you will measure the deflection of a spring, then it makes sense to think of \(\boldsymbol{x}_n\) as fixed. However, if your data are samples from some larger population, such as data about mothers’ smoking habits and baby birth weight, then it might make sense to think of \(\boldsymbol{x}_n\) as random along with \(y_n\).
However, mathematically speaking,
For these reasons, I will make the following assumption for simplicity:
Under the linear expectation, homoskedastic, and Gaussian assumptions, I will assume that the regressors \(\boldsymbol{X}\) are fixed unless I say explicitly otherwise. In that case, by conditioning, I simply mean “for that value of \(\boldsymbol{x}_n\),” as in \(\mathbb{E}\left[y_n \vert \boldsymbol{x}_n\right] = \boldsymbol{\beta}^\intercal\boldsymbol{x}_n\).

Recall the Kleiber example, in which observations were animals, weights, and metabolism rates. Which of these assumptions make sense?
Arguably, none of them:
The best you might hope for is to argue that the IID assumption characterizes something like “how much would my conclusion have changed if I had collected the same dataset again but with different randomly chosen animals of the same species?” But this is not very much like the real question, which is “how much might the slope of the regression line differ from 2/3 due only to the ideosyncracies of this particular dataset?”
---
title: "Stochastic assumptions on the residual"
format:
html:
code-fold: false
code-tools: true
---
::: {.content-visible when-format="html"}
{{< include /macros.tex >}}
:::
# Goals
- Introduce statistical assumptions to OLS
- Hierachies of assumptions
- Independent mean zero residuals
- IID residuals
- IID Normal residuals
- Bias in OLS
- Variance in OLS
- Fixed versus random regressors
- Meaning and criticism of assumptions
# Births dataset
```{r}
#| echo: false
#| output: false
library(tidyverse)
library(gridExtra)
library(quarto)
root_dir <- quarto::find_project_root()
births_df <- read.csv(file.path(root_dir, "datasets/births/births14.csv"))
```
Recall our births dataset, which contains observational data on
many aspects of a pregancy, including whether the mother was a smoker. Let's
suppose we're interested in seeing whether there is evidence that smoking
affects a baby's birth weight negatively.
The histograms are suggestive:
```{r}
#| echo: false
births_df %>% filter(!is.na(habit)) %>%
ggplot() +
geom_histogram(aes(x=weight, fill=habit, y=..density.., color=habit),
alpha=0.2, position="identity", bins=40)
```
But we might reasonably want to try to control for some of the many additional
variables using regression. For instance, we might run the regression:
```{r}
lm_inter <- lm(weight ~ mage + habit + (whitemom + sex + visits + gained)^2, births_df)
summary(lm_inter)$coefficients["habitsmoker", "Estimate"]
```
That's a pretty large and negative estimated effect. But we might want to ask
a number of critical questions:
- Could such a large effect be due to chance, especially given how many regressors we controlled for?
- Can we quantify mathematically what goes wrong when we fail to include regressors we should?
- What happens if some of our regressors are observed with errors?
- How does the amount of data affect the accuracy of our results?
To answer these questions, we need some stochastic assumptions.
# Assumptions on the "true" model
In the next six weeks, we will explore a hierarchy of assumptions. Typically,
the more assume, the more precise you can be about the behavior of linear regression!
The word "assumptions," though common, is not really the right word. Typically
an "assumption" is something you accept as true. Our assumptions will in fact
often be unreasonable approximations to reality. A
better word might be "conceit" (meaning
["an elaborate or strained metaphor"](https://www.merriam-webster.com/dictionary/conceit)). But
some starting point is necessary for statistical analysis, and we can proceed as if they were true, while
constantly (a) thinking about how things might change if our assumptions
were violated in plausible ways and (b) proactively looking for postive evidence that
our assumptions are actually not true.
To my mind, the substitute of an oversimplified but tractable
proxy for a prohibitively complex reality is not qualitatively
different from what we do with language all the time, especially in the
context of persuasive rhetoric. The real danger comes when, out of
ignorance, neglect, or a desire to fool yourself or others, the
proxy is taken too seriously.
### IID data
The loosest assumption is that $(\xv_n, \y_n)$ are IID pairs.
::: {.callout-note title="IID data assumption"}
Assume that the pairs $(\x_n, \y_n)$ are IID, and that
both $\expect{\y_n \vert \x_n}$ and $\var{\y_n \vert \x_n}$ are finite.
:::
This is similar to a "machine learning" style assumption, where we don't
assume much about the relationship between $\x_n$ and $\y_n$, but
want to predict $\expect{\y_n \vert \xv_n}$ with some flexible learning
algorithm --- which could be linear regression.
### Linear expectation
The next strongest assumption is this:
::: {.callout-note title="Linear expectation (LE) assumption"}
Under the IID data assumption,
assume additionally that there exists a $\betav$ such that, for all $\xv_n$,
$\expect{\y_n \vert \xv_n} = \betav^\trans \xv_n$. Equivalently, we
can write $\expect{\Y \vert \X} = \X \betav$.
:::
This is a strong assumption! It is unlikely to be actually true in most
practical settings. However, something like this is necessary if we're to evaluate
how well OLS recovers a "true" $\betav$.
Under this assumption, we can always define
$$
\res_n = \y_n - \betav^\trans \xv_n = \y_n - \expect{\y_n \vert \xv_n},
$$
which must satisfy
$$
\expect{\res_n \vert \xv_n} = 0
\quad\Rightarrow\quad
\expect{\res_n} = \expect{\expect{\res_n \vert \xv_n}} = 0,
$$
this is equivalent to saying that there exists $\betav$ such that
$\y_n = \betav^\trans \xv_n + \res_n$ with $\expect{\res_n \vert \xv_n} = 0$.
Since this assumption allows $\var{\res_n \vert \xv_n}$ to vary with $\xv_n$,
the residuals are sometimes called "heteroskedastic," meaning "different randomness,"
in contrast with the next assumption.
#### Unbiasedness under the LE assumption
Amazingly, the LE assumption is enough to prove the unbiasedness of
OLS under correct specification. Here is a proof:
$$
\begin{aligned}
\betavhat ={}& (\X^\trans \X)^{-1} \X^\trans \Y
\\={}& (\X^\trans \X)^{-1} \X^\trans (\X \betav + \resv)
\\={}& (\X^\trans \X)^{-1} \X^\trans \X \betav + (\X^\trans \X)^{-1} \X^\trans \resv
\\={}& \betav + (\X^\trans \X)^{-1} \X^\trans \resv
\quad\Rightarrow\\
\expect{\betavhat \vert \X} ={}&
\betav + (\X^\trans \X)^{-1} \X^\trans \expect{\resv \vert \X}
={} \betav
\quad\Rightarrow\\
\expect{\betavhat} ={}& \expect{\expect{\betavhat \vert \X}} = \betav.
\end{aligned}
$$
#### Omitted variables under the LE assumption
The LE assumption also helps make it clear what happens when you *fail* to include
variables which you should have included.
Suppose that the true model is $\Y = \X \betav + \Z \gammav + \resv$
under the LE assumption, but we run the regression $\Y \sim \Z \alphav$. How
different is $\alphavhat$ from the true $\gammav$?
We can repeat the steps above to get
$$
\begin{aligned}
\alphavhat ={}& (\Z^\trans \Z)^{-1} \Z^\trans \Y
\\={}& (\Z^\trans \Z)^{-1} \Z^\trans (\X \betav + \Z \gammav + \resv)
\\={}& \gammav + (\Z^\trans \Z)^{-1} \Z^\trans \X \betav +
(\Z^\trans \Z)^{-1} \Z^\trans \resv
\quad\Rightarrow\\
\expect{\alphavhat \vert \X, \Z} ={}&
\gammav + (\Z^\trans \Z)^{-1} \Z^\trans \X \betav.
\end{aligned}
$$
We see that $\alphahat$ is biased as an
estimator of $\gammav$ unless either $\Z^\trans \X \betav = \zerov$. This term
can be zero in two ways:
1. If $\betav = 0$. This happens when $\X$ actually has no effect on $\Y$
in the true model, and so can be safely omitted.
1. If $\Z^\trans \X = \zerov$. This happens when $\X$ is orthogonal to
$\Z$.
One way to think about the $\Z^\trans \X = \zerov$ case is as follows: you could
incorporate $\X \betav$ into a new residual, $\etav = \resv + \X \betav$ in the model
$\Y = \Z \alpha + \etav$. It is no longer the case that $\expect{\etav | \Z}$ is
zero, but it is *uncorrelated with *$\Z$, and so has no effect on the regression. This
is an instance where it's more useful to analyze OLS using the assumption that errors are
uncorrelated rather than independent.
### Homoskedastic residuals
The next assumption is more common in econometrics, and it simply assumes
that the residuals all have the same variance.
::: {.callout-note title="Homoskedastic residuals assumption"}
Under the linear expectation assumption, assume additionally that
$\var{\res_n \vert \xv_n} = \sigma_\res^2$ for all $\xv_n$; that is,
the residual variance is constant.
:::
Such residuals are called "homoskedastic" for "same randomness."
#### Variance under homoskedastic residuals
The homoskedastic residuals assumption leads to a particularly simple
form for the *covariance* of $\betavhat$, which is more complicated
under heteroskedasticity. Using our result above,
$$
\begin{aligned}
\cov{\betavhat \vert \X} ={}&
\expect{\left( \betavhat - \betav\right) \left( \betavhat - \betav\right)^\trans \vert \X}
\\={}&
\expect{\left( (\X^\trans \X)^{-1} \X^\trans \resv \right)
\left( (\X^\trans \X)^{-1} \X^\trans \resv \right)^\trans \vert \X}
\\={}&
(\X^\trans \X)^{-1} \X^\trans
\expect{\resv \resv^\trans \vert \X}
\X (\X^\trans \X)^{-1}
\\={}&
(\X^\trans \X)^{-1} \X^\trans
\sigma^2_\res \id
\X (\X^\trans \X)^{-1}
\\={}&
\sigma^2_\res (\X^\trans \X)^{-1} \X^\trans
\id
\X (\X^\trans \X)^{-1}
\\={}&
\sigma^2_\res (\X^\trans \X)^{-1}.
\end{aligned}
$$
Unlike the expectation, it is no longer the case that $\cov{\betavhat}$ has a simple
form marginally over $\X$, because, in general,
$$
\expect{(\X^\trans \X)^{-1}} \ne \expect{\X^\trans \X}^{-1}.
$$
A simple special case is for univarite $\x_n$, for which it is hopefully familiar
that
$$
\expect{\frac{1}{\x_n^2}} \ne \frac{1}{\expect{\x_n^2}}
$$
unless $\var{\x_n} = 0$. This is one of the "mathematical conveniences" that motivate
treating $\xv_n$ as fixed rather than random when analyzing OLS --- see below for
more discussion.
### Gaussian residuals
Finally, we come to the most classic assumption: fully Gaussian residuals.
::: {.callout-note title="Gaussian residuals assumption"}
Under the linear expectation assumption, assume additionally that
$\res_n \sim \gauss{0, \sigma^2}$.
:::
This assumption is, of course, the least likely to hold in practice. However,
it's also the easiest to analyze --- we'll be able to derive closed--form
expressions for the behavior of many OLS test statistics. This is also the
assumption that is made implicitly by a lot of standard statistical software,
including the `lm` function from `R`, so it is important to understand its
implications.
#### Normality under Gaussian residuals
Under the Gaussian assumption, since $\y_n = \betav^\trans \xv_n + \res_n$,
this implies that $\y_n$ is also Gaussian conditional on $\xv_n$. Even more,
$\betavhat$ is Gaussian conditional on $\X$, since it is itself a linear
combination of Gaussian errors:
$$
\betavhat = \beta + (\X^\trans \X) \X^\trans \resv
\quad\Rightarrow\quad
\betavhat \sim \gauss{\beta, \sigma^2 (\X^\trans \X)^{-1}}.
$$
## Independent or uncorrelated residuals?
Note that under all these assumptions, the residuals are assumed to be independent
of one another. That's not strictly necessary --- if you closely examine the
proofs to come, you will see that often pairwise uncorrelated residuals will be
enough. However, for simplicity, I'll stay with the assumption of independent
residuals in this class; it will hopefully be clear enough where that
can be weakened if necessary.
## Fixed or random regressors?
Above, it's clear that $\y_n$ is considered random (and, correspondingly,
so are the residuals $\res_n$). What about $\xv_n$ for the linear
expectation assumption onwards? Is it random or fixed?
Depending on the setting, it might be reasonable to model $\xv_n$
as either random *or* fixed. For example, if $\xv_n$ are part of a
systematic design, such as an evenly spaced set of weights for which you
will measure the deflection
of a spring, then it makes sense to think of $\xv_n$ as fixed. However,
if your data are samples from some larger population, such as data about
mothers' smoking habits and baby birth weight, then it might make sense
to think of $\xv_n$ as random along with $\y_n$.
However, mathematically speaking,
- It is usually easier to think of $\xv_n$ as fixed, and
- It would rarely affect our substantive conclusions, since the variance of
the residuals dominates the variability of the regressors. (This
may not be obvious, even later in the course, but I assure you it is true.)
For these reasons, I will make the following assumption for simplicity:
::: {.callout-note title="Fixed regressor assumption"}
Under the linear expectation, homoskedastic, and Gaussian assumptions,
I will assume that the regressors $\X$ are fixed unless I say explicitly
otherwise. In that case, by conditioning, I simply mean "for that value
of $\xv_n$," as in $\expect{\y_n \vert \xv_n} = \betav^\trans \xv_n$.
:::
# Assumptions in the Kleiber example
```{r}
#| echo: false
#| output: false
kleiber_df <-
read.csv(file.path(root_dir, "datasets/kleiber/kleiber.csv"))
```
```{r}
#| echo: false
ggplot(kleiber_df) +
geom_point(aes(x=Weight_kg, y=Metabol_kcal_per_day, color=Animal), size=3) +
scale_x_log10() + scale_y_log10()
```
Recall the Kleiber example, in which observations were animals,
weights, and metabolism rates. Which of these assumptions make sense?
Arguably, none of them:
- This is not a "random sample from all animals," even if you could
make that sentence make sense
- There are multiple observations for the same type of animal
- There is no reason to believe a linear model is well specified
- There is no reason to believe the errors are normal or even homoskedastic
The best you might hope for is to argue that the IID assumption characterizes
something like "how much would my conclusion have changed if I had collected
the same dataset again but with different randomly chosen animals of the same
species?" But this is not very much like the real question, which is
"how much might the slope of the regression line differ from 2/3 due only to the
ideosyncracies of this particular dataset?"
<!--
Note to future instructors --- this needs more time and more detail to
be useful to the students.
-->