$$ \newcommand{\mybold}[1]{\boldsymbol{#1}} \newcommand{\trans}{\intercal} \newcommand{\norm}[1]{\left\Vert#1\right\Vert} \newcommand{\abs}[1]{\left|#1\right|} \newcommand{\bbr}{\mathbb{R}} \newcommand{\bbz}{\mathbb{Z}} \newcommand{\bbc}{\mathbb{C}} \newcommand{\gauss}[1]{\mathcal{N}\left(#1\right)} \newcommand{\chisq}[1]{\mathcal{\chi}^2_{#1}} \newcommand{\studentt}[1]{\mathrm{StudentT}_{#1}} \newcommand{\fdist}[2]{\mathrm{FDist}_{#1,#2}} \newcommand{\argmin}[1]{\underset{#1}{\mathrm{argmin}}\,} \newcommand{\projop}[1]{\underset{#1}{\mathrm{Proj}}\,} \newcommand{\proj}[1]{\underset{#1}{\mybold{P}}} \newcommand{\expect}[1]{\mathbb{E}\left[#1\right]} \newcommand{\prob}[1]{\mathbb{P}\left(#1\right)} \newcommand{\dens}[1]{\mathit{p}\left(#1\right)} \newcommand{\var}[1]{\mathrm{Var}\left(#1\right)} \newcommand{\cov}[1]{\mathrm{Cov}\left(#1\right)} \newcommand{\sumn}{\sum_{n=1}^N} \newcommand{\meann}{\frac{1}{N} \sumn} \newcommand{\cltn}{\frac{1}{\sqrt{N}} \sumn} \newcommand{\trace}[1]{\mathrm{trace}\left(#1\right)} \newcommand{\diag}[1]{\mathrm{Diag}\left(#1\right)} \newcommand{\grad}[2]{\nabla_{#1} \left. #2 \right.} \newcommand{\gradat}[3]{\nabla_{#1} \left. #2 \right|_{#3}} \newcommand{\fracat}[3]{\left. \frac{#1}{#2} \right|_{#3}} \newcommand{\W}{\mybold{W}} \newcommand{\w}{w} \newcommand{\wbar}{\bar{w}} \newcommand{\wv}{\mybold{w}} \newcommand{\X}{\mybold{X}} \newcommand{\x}{x} \newcommand{\xbar}{\bar{x}} \newcommand{\xv}{\mybold{x}} \newcommand{\Xcov}{\Sigmam_{\X}} \newcommand{\Xcovhat}{\hat{\Sigmam}_{\X}} \newcommand{\Covsand}{\Sigmam_{\mathrm{sand}}} \newcommand{\Covsandhat}{\hat{\Sigmam}_{\mathrm{sand}}} \newcommand{\Z}{\mybold{Z}} \newcommand{\z}{z} \newcommand{\zv}{\mybold{z}} \newcommand{\zbar}{\bar{z}} \newcommand{\Y}{\mybold{Y}} \newcommand{\Yhat}{\hat{\Y}} \newcommand{\y}{y} \newcommand{\yv}{\mybold{y}} \newcommand{\yhat}{\hat{\y}} \newcommand{\ybar}{\bar{y}} \newcommand{\res}{\varepsilon} \newcommand{\resv}{\mybold{\res}} \newcommand{\resvhat}{\hat{\mybold{\res}}} \newcommand{\reshat}{\hat{\res}} \newcommand{\betav}{\mybold{\beta}} \newcommand{\betavhat}{\hat{\betav}} \newcommand{\betahat}{\hat{\beta}} \newcommand{\betastar}{{\beta^{*}}} \newcommand{\bv}{\mybold{\b}} \newcommand{\bvhat}{\hat{\bv}} \newcommand{\alphav}{\mybold{\alpha}} \newcommand{\alphavhat}{\hat{\av}} \newcommand{\alphahat}{\hat{\alpha}} \newcommand{\omegav}{\mybold{\omega}} \newcommand{\gv}{\mybold{\gamma}} \newcommand{\gvhat}{\hat{\gv}} \newcommand{\ghat}{\hat{\gamma}} \newcommand{\hv}{\mybold{\h}} \newcommand{\hvhat}{\hat{\hv}} \newcommand{\hhat}{\hat{\h}} \newcommand{\gammav}{\mybold{\gamma}} \newcommand{\gammavhat}{\hat{\gammav}} \newcommand{\gammahat}{\hat{\gamma}} \newcommand{\new}{\mathrm{new}} \newcommand{\zerov}{\mybold{0}} \newcommand{\onev}{\mybold{1}} \newcommand{\id}{\mybold{I}} \newcommand{\sigmahat}{\hat{\sigma}} \newcommand{\etav}{\mybold{\eta}} \newcommand{\muv}{\mybold{\mu}} \newcommand{\Sigmam}{\mybold{\Sigma}} \newcommand{\rdom}[1]{\mathbb{R}^{#1}} \newcommand{\RV}[1]{\tilde{#1}} \def\A{\mybold{A}} \def\A{\mybold{A}} \def\av{\mybold{a}} \def\a{a} \def\B{\mybold{B}} \def\S{\mybold{S}} \def\sv{\mybold{s}} \def\s{s} \def\R{\mybold{R}} \def\rv{\mybold{r}} \def\r{r} \def\V{\mybold{V}} \def\vv{\mybold{v}} \def\v{v} \def\U{\mybold{U}} \def\uv{\mybold{u}} \def\u{u} \def\W{\mybold{W}} \def\wv{\mybold{w}} \def\w{w} \def\tv{\mybold{t}} \def\t{t} \def\Sc{\mathcal{S}} \def\ev{\mybold{e}} \def\Lammat{\mybold{\Lambda}} $$

The Gaussian assumption

$\LaTeX$

Goals

Apply the LLN formula to OLS and find it not that informative at first
Propose a hierarchy structural assumptions to make the result more interpretable
Propose a Gaussian assumption (to start), and show what we need to estimate to make predictions

Setup

We’ve developed some linear algebra tools and some probability tools. Now let’s put them together to understand linear regression.

Suppose we have a prediction problem. We’re going to use the training data to fit $\betavhat$, and predict a new value using

\[ \yhat_\new = \xv_\new^\trans \betavhat, \]

hoping that $\yhat_\new \approx \y_\new$, where we do not get to see $\y_\new$ (at least, not until after we’ve made our prediction).

For now, we will assume that $(\xv_n, \y_n)$ for both the training data and test points are drawn IID from some unknown distribution.

We will be interested in two key classes of question:

Given our fit, how much error do we expect to make in prediction?
How might our fit have been different had we gotten different (random) training data?

Roughly speaking, these two classes of question will correspond to two different notions of what is random:

Given our prediction $\betavhat$, what is the typical variability of $\y_\new - \yhat_\new$?
What is the variability of $\betavhat$ under new samples of the training data $\X$ and $\Y$?

Recall that $\betavhat$ is a function of the traning data. So, in turn, the prediction $\yhat_\new = \xv_\new^\trans \betavhat$ is a function of the training data, and the new regressor. The prediction error $\y_\new - \yhat_\new$, is a function of the training data, the new regressor, and the new unseen response.

So for the first question, it makes sense to think of the training data as fixed, and only the repsonse as random. For the second question, it makes sense to think of the training data as random. One might combine the two to ask

What is my uncertainty about the value of $\yhat_\new$, given $\xv_\new$?

Taking random variability as a proxy for epistemic uncertainty, it might make sense to think about the randomness in both the training data and the new data point’s response. (We will be very careful about what precisely this means.)

Finally, we might also ask, before even seeing the regressor,

What is my uncertainty about the value of the next $\yhat_\new$ I’ll see?

…for which you’ll want to take into account the variability in the regressors.

The large-sample behavior of the regression coefficient

The key ingredient to (approximately) answering all of these questions will in fact be the large-sample behavior of $\betavhat$. Recall that

\[ \betavhat = (\X^\trans\X)^{-1} \X^\trans\Y = (\frac{1}{N} \X^\trans\X)^{-1} \frac{1}{N} \X^\trans\Y. \]

We already have the tools we need to describe the large $N$ behavior of $\betavhat$, but without further assumptions it won’t (at first) seem very useful. Assume that $\cov{\xv_n} = \Xcov$ is finite, positive definite, and the same for all $\xv_n$. Then, by the LLN for matrices,

\[ \frac{1}{N} \X^\trans\X = \meann \xv_n \xv_n^\trans \rightarrow \Xcov. \]

Furthermore, because the map $f(\Xcov) = \Xcov^{-1}$ is continuous at $\Xcov$ (since $\Xcov$ is positive definite), we have that

\[ \left( \frac{1}{N} \X^\trans\X \right)^{-1} \rightarrow \Xcov^{-1}. \]

Exercise

What do you think is the large-sample behavior of $(\frac{1}{N} \X^\trans\X)^{-1}$ is $\cov{\xv_n}$ is degenerate, i.e., has a non-empty nullspace?

Next, note that $\xv_n \y_n$ is itself a random vector in $\rdom{P}$, since $\y_n$ is a scalar, and potentially not independent of $\xv_n$. (In fact, since we’re doing regression, it had better not be independent of $\xv_n$.)

So if we assume that $\cov{\xv_n \y_n}$ is finite and the same for all $n$, then by the LLN for vectors,

\[ \frac{1}{N} \X^\trans \Y = \meann \xv_n \y_n \rightarrow \expect{\xv_1 \y_1}. \]

(Here I have used $\xv_1 \y_1$ to emphasize that the observations are IID.) It follows that

\[ \betahat \rightarrow \Xcov^{-1} \expect{\xv_1 \y_1}. \]

In this form, this actually doesn’t look very useful. To make this useful, it will help to make some assumptions about the relationship between $\y_n$ and $\x_n$.

A hierarchy of assumptions

In the coming weeks, we will (roughly) study the following hierarchy of assumptions.

Gaussian assumption: Assume that $\y_n = \x_n^\trans \betav + \res_n$ for some $\beta$, where $\res_n \sim \gauss{0, \sigma^2}$, and $\res_n$ is independent of $\x_n$.
Homeskedastic assumption: Assume that $\y_n = \x_n^\trans \betav + \res_n$ for some $\beta$, where $\expect{\res_n} = 0$, $\var{\res_n} = \sigma^2 < \infty$, and $\res_n$ is independent of $\x_n$.
Heteroskedastic assumption: Assume that $\y_n = \x_n^\trans \betav + \res_n$ for some $\beta$, where $\expect{\res_n | \xv_n} = 0$ and $\var{\res_n \vert \xv_n} = \sigma_n^2 < \infty$.
Machine learning assumption: Assume that $\y_n = f(\x_n) + \res_n$ for some $f$, where $\expect{\res_n | \xv_n} = 0$ and $\var{\res_n \vert \xv_n} = \sigma_n^2 < \infty$.

Broadly speaking,

The earlier assumptions are more concrete and so, in a sense, easier to analyze.
The latter assumptions tend to be more abstract and so, in a sense, harder to analyze.
The earlier assumptions are less realistic for many cases, especially prediction problems.
The later assumptions are more realistic, especially the last one, which, as we will see, is little more than a tautology for IID samples (other than the finite variance condition).
The earlier assumptions were studied earlier in the history of statistics, and are most common in introductory textbooks.
The later assumptions were studied relatively more recently (though are still very classical), and are less common in introductory textbooks.

I plan to start at the top of this list and work down.

My goal for this class is for you to feel comfortable choosing assumptions appropriate to the task at hand, without thinking of one set of assumptions as “linear regression” per se.

For each set of assumptions, the core technical techniques that we have been developing — linear algebra and probability — will all be the same, just applied in different ways.

The Gaussian assumption

Let’s begin by assessing the prediction error under the Gaussian assumption. Specifically, for the moment, we’ll assume:

Gaussian assumption, fixed regressors

Assume that $\y_n = \x_n^\trans \betav + \res_n$ for some $\beta$, where $\res_n \sim \gauss{0, \sigma^2}$, the $\res_n$ are IID, and the regressors $\x_n$ are fixed (not random).

This is about the most restrictive assumption you can imagine making. It’s also the one under which you can get some very clear results.

In this model, the residuals are random but the regressors are fixed.
Additionally, the vector $\beta$ is viewed as fixed (but unknown), That means that $\Y = \X \beta + \resv$ is random — specifically, \[ \Y \sim \gauss{\X \beta, \sigma^2 \id{}}. \]

Furthermore, if we observe a new datapoint, $\y_\new = \xv_\new^\trans \beta + \res_\new$, then $\res_\new$ is random, and $\y_\new$ in turn is random, but $\xv_\new$ and $\beta$ are not.

However, since we have \[ \betahat = (\X^\trans \X)^{-1} \X^\trans \Y, \]

our estimator $\betahat$ is random, at least unless we condition on (fix) the training data.

Prediction error with the Gaussian assumption

Under our Gaussian assumption with fixed regressors we can write

\[ \begin{aligned} \y_\new ={}& \xv_\new^\trans \betav + \res_\new\\ \yhat_\new ={}& \xv_\new^\trans \betavhat \Rightarrow \\ \y_\new - \yhat_\new ={}& \xv_\new^\trans (\betav - \betavhat) + \res_\new. \end{aligned} \]

From this we can see that the error $\y_\new - \yhat_\new$ is normally distributed. In particular, if we condition on (fix) the training data, and so fix $\betavhat$, then

\[ \y_\new - \yhat_\new | \X, \Y \sim \gauss{\xv_\new^\trans (\betav - \betavhat), \sigma^2}. \]

Unfortunately we do not know $\betav$ nor $\res_\new$. In fact, we don’t even know $\sigma^2$, the variance of $\res_\new$.

However, under our assumptions, we can show that $\betavhat \rightarrow \beta$. That will mean that $\xv_\new^\trans (\betav - \betavhat) \approx 0$ for large $N$. Even more, this fact will allow us to form an estimate of $\sigma$ which is accurate for large $N$.

We’ll prove $\betavhat \rightarrow \beta$ a few different ways. In each case, the key will be to use the formula

\[ \betahat = (\X^\trans \X)^{-1} \X^\trans \Y, \]

together with

\[ \Y = \X \beta + \resv, \]

where the Gaussian assumption means that

\[ \resv \sim \gauss{\zerov, \sigma^2 \id{}_N}, \]

which is an $N$–dimensional Gaussian random vector. Note that this is equivalent to

\[ \Y \sim \gauss{\X\beta, \sigma^2 \id{}_N}. \]