$$ \newcommand{\mybold}[1]{\boldsymbol{#1}} \newcommand{\trans}{\intercal} \newcommand{\norm}[1]{\left\Vert#1\right\Vert} \newcommand{\abs}[1]{\left|#1\right|} \newcommand{\bbr}{\mathbb{R}} \newcommand{\bbz}{\mathbb{Z}} \newcommand{\bbc}{\mathbb{C}} \newcommand{\gauss}[1]{\mathcal{N}\left(#1\right)} \newcommand{\chisq}[1]{\mathcal{\chi}^2_{#1}} \newcommand{\studentt}[1]{\mathrm{StudentT}_{#1}} \newcommand{\fdist}[2]{\mathrm{FDist}_{#1,#2}} \newcommand{\argmin}[1]{\underset{#1}{\mathrm{argmin}}\,} \newcommand{\projop}[1]{\underset{#1}{\mathrm{Proj}}\,} \newcommand{\proj}[1]{\underset{#1}{\mybold{P}}} \newcommand{\expect}[1]{\mathbb{E}\left[#1\right]} \newcommand{\prob}[1]{\mathbb{P}\left(#1\right)} \newcommand{\dens}[1]{\mathit{p}\left(#1\right)} \newcommand{\var}[1]{\mathrm{Var}\left(#1\right)} \newcommand{\cov}[1]{\mathrm{Cov}\left(#1\right)} \newcommand{\sumn}{\sum_{n=1}^N} \newcommand{\meann}{\frac{1}{N} \sumn} \newcommand{\cltn}{\frac{1}{\sqrt{N}} \sumn} \newcommand{\trace}[1]{\mathrm{trace}\left(#1\right)} \newcommand{\diag}[1]{\mathrm{Diag}\left(#1\right)} \newcommand{\grad}[2]{\nabla_{#1} \left. #2 \right.} \newcommand{\gradat}[3]{\nabla_{#1} \left. #2 \right|_{#3}} \newcommand{\fracat}[3]{\left. \frac{#1}{#2} \right|_{#3}} \newcommand{\W}{\mybold{W}} \newcommand{\w}{w} \newcommand{\wbar}{\bar{w}} \newcommand{\wv}{\mybold{w}} \newcommand{\X}{\mybold{X}} \newcommand{\x}{x} \newcommand{\xbar}{\bar{x}} \newcommand{\xv}{\mybold{x}} \newcommand{\Xcov}{\Sigmam_{\X}} \newcommand{\Xcovhat}{\hat{\Sigmam}_{\X}} \newcommand{\Covsand}{\Sigmam_{\mathrm{sand}}} \newcommand{\Covsandhat}{\hat{\Sigmam}_{\mathrm{sand}}} \newcommand{\Z}{\mybold{Z}} \newcommand{\z}{z} \newcommand{\zv}{\mybold{z}} \newcommand{\zbar}{\bar{z}} \newcommand{\Y}{\mybold{Y}} \newcommand{\Yhat}{\hat{\Y}} \newcommand{\y}{y} \newcommand{\yv}{\mybold{y}} \newcommand{\yhat}{\hat{\y}} \newcommand{\ybar}{\bar{y}} \newcommand{\res}{\varepsilon} \newcommand{\resv}{\mybold{\res}} \newcommand{\resvhat}{\hat{\mybold{\res}}} \newcommand{\reshat}{\hat{\res}} \newcommand{\betav}{\mybold{\beta}} \newcommand{\betavhat}{\hat{\betav}} \newcommand{\betahat}{\hat{\beta}} \newcommand{\betastar}{{\beta^{*}}} \newcommand{\bv}{\mybold{\b}} \newcommand{\bvhat}{\hat{\bv}} \newcommand{\alphav}{\mybold{\alpha}} \newcommand{\alphavhat}{\hat{\av}} \newcommand{\alphahat}{\hat{\alpha}} \newcommand{\omegav}{\mybold{\omega}} \newcommand{\gv}{\mybold{\gamma}} \newcommand{\gvhat}{\hat{\gv}} \newcommand{\ghat}{\hat{\gamma}} \newcommand{\hv}{\mybold{\h}} \newcommand{\hvhat}{\hat{\hv}} \newcommand{\hhat}{\hat{\h}} \newcommand{\gammav}{\mybold{\gamma}} \newcommand{\gammavhat}{\hat{\gammav}} \newcommand{\gammahat}{\hat{\gamma}} \newcommand{\new}{\mathrm{new}} \newcommand{\zerov}{\mybold{0}} \newcommand{\onev}{\mybold{1}} \newcommand{\id}{\mybold{I}} \newcommand{\sigmahat}{\hat{\sigma}} \newcommand{\etav}{\mybold{\eta}} \newcommand{\muv}{\mybold{\mu}} \newcommand{\Sigmam}{\mybold{\Sigma}} \newcommand{\rdom}[1]{\mathbb{R}^{#1}} \newcommand{\RV}[1]{\tilde{#1}} \def\A{\mybold{A}} \def\A{\mybold{A}} \def\av{\mybold{a}} \def\a{a} \def\B{\mybold{B}} \def\S{\mybold{S}} \def\sv{\mybold{s}} \def\s{s} \def\R{\mybold{R}} \def\rv{\mybold{r}} \def\r{r} \def\V{\mybold{V}} \def\vv{\mybold{v}} \def\v{v} \def\U{\mybold{U}} \def\uv{\mybold{u}} \def\u{u} \def\W{\mybold{W}} \def\wv{\mybold{w}} \def\w{w} \def\tv{\mybold{t}} \def\t{t} \def\Sc{\mathcal{S}} \def\ev{\mybold{e}} \def\Lammat{\mybold{\Lambda}} $$

Implications of Gaussianity (and deviations from it)

\(\,\)

Goals

  • Leave the normal assumption behind
    • Derive limiting distributions of \(\betahat\) using the CLT
    • Show implications for the predictive distribution
    • Derive an assumption-free version of what OLS is estimating

Leaving the normal assumption

Up to now, we’ve been assuming that

  1. \(\y_n = \betav^\trans \xv_n + \res_n\) for some \(\betav\)
  2. The regeressors \(\xv_n\) are fixed, \(\X\) is full-rank, and \(\meann \xv_n \xv_n^\trans \rightarrow \Xcov\) for positive definite \(\Xcov\)
  3. The residuals are distributed \(\res_n \sim \gauss{0, \sigma^2}\) IID

Under these assumptions, we were able to derive closed-form, finite-sample distributions for \(\betahat\) and \(\sigmahat^2\). We also showed that the behavior of these closed-form distributions matched what you’d expect from the LLN as \(N\) gets large.

Unfortunately, the normal assumption is unreasonable in practice. So today we will modify the final assumption to

  1. The residuals \(\res_n\) are IID, independent of \(\xv_n\), with \(\expect{\res_n} = 0\) and \(\var{\res_n} = \sigma^2\).

That is, we no longer assume that we know the full distribution of the \(\res_n\), but rather that the mean is zero and the variance finite and constant.
What changes under this more realistic assumption?

Distribution of the OLS coefficients

We still have that

\[ \betahat = (\X^\trans \X)^{-1} \X^\trans \Y = (\X^\trans \X)^{-1} \X^\trans (\X \beta + \resv) = \beta + (\X^\trans \X)^{-1} \X^\trans \resv. \]

That means that \(\expect{\betahat} = \beta\), so our estimator is still unbiased. But the term \((\X^\trans \X)^{-1} \X^\trans \resv\) is no longer normal. It remains the case that

\[ \cov{(\X^\trans \X)^{-1} \X^\trans \resv} = (\X^\trans \X)^{-1} \X^\trans \expect{\resv\resv^\trans} \X (\X^\trans \X)^{-1} = \sigma^2 (\X^\trans \X)^{-1} \rightarrow \zerov. \]

This means that \(\betahat \rightarrow \beta\). This is expected, since we can recall our LLN proof of the consistency of \(\betahat\):

\[ \begin{aligned} \betahat - \beta ={}& (\X^\trans \X)^{-1} \X^\trans \resv \\ ={}& (\frac{1}{N} \X^\trans \X)^{-1} \frac{1}{N} \X^\trans \resv \\ ={}& (\meann \xv_n \xv_n^\trans)^{-1} \meann \xv_n \res_n. \end{aligned} \]

Now, \((\meann \xv_n \xv_n^\trans)^{-1} \rightarrow \Xcov^{-1}\) by the LLN and the continuous mapping theorem, and

\[ \meann \xv_n \res_n \rightarrow \expect{\xv_n \res_n} = \xv_n \expect{\res_n} = \zerov, \]

simply using the fact that \(\xv_n\) and \(\res_n\) are independent (\(\xv_n\) is still fixed) and \(\expect{\res_n} = 0\).

Although we don’t know the finite-sample distribution of \(\betahat - \beta\), the LLN points to a way to approximation the asymptotic distribution of \(\betahat - \beta\) via the CLT. Specifically, note that \(\xv_n \res_n\) are not IID, but \(\expect{\xv_n \res_n} = 0\), and \(\cov{\xv_n \res_n} = \xv_n \xv_n^\trans \sigma^2\). Noting that

\[ \meann \cov{\xv_n \res_n} = \sigma^2 \meann \xv_n \xv_n^\trans \rightarrow \sigma^2 \Xcov, \]

by the multivariate CLT, \[ \frac{1}{\sqrt{N}} \sumn \xv_n \res_n \rightarrow \gauss{0, \sigma^2 \Xcov}. \]

Thus, by the continuous mapping theorem,

\[ \sqrt{N}(\betahat - \beta) = (\meann \xv_n \xv_n^\trans)^{-1} \frac{1}{\sqrt{N}} \sumn \xv_n \res_n \rightarrow \Xcov^{-1} \RV{z} \quad\textrm{where}\quad \RV{\z} \sim \gauss{0, \sigma^2 \Xcov}. \]

Now, by properties of the multivariate normal,

\[ \Xcov^{-1} \RV{z} \sim \gauss{0, \sigma^2 \Xcov^{-1} \Xcov \Xcov^{-1}} = \gauss{0, \sigma^2 \Xcov^{-1}}, \]

so

\[ \sqrt{N}(\betahat - \beta) \rightarrow \gauss{0, \sigma^2 \Xcov^{-1}}. \]

Plug-in estimators for the variance

Of course, in practice, we do not observe the terms in the variance \(\sigma^2 \Xcov^{-1}\). A natural solution is to plug in their consistent estimators,

\[ \begin{aligned} \sigmahat^2 \rightarrow \sigma^2 \quad\textrm{and}\quad \frac{1}{N} \X^\trans \X \rightarrow \Xcov. \end{aligned} \]

We thus say that

\[ \betahat - \beta \sim \gauss{0, \frac{1}{N} \sigmahat^2 \left( \frac{1}{N} \X^\trans \X \right)^{-1}} \quad\textrm{approximately for large }N. \]

Recall that, under normality, we had

\[ \betahat - \beta \sim \gauss{0, \frac{1}{N} \sigma^2 \left(\frac{1}{N} \X^\trans \X\right)^{-1}} \quad\textrm{exactly, under the normal assumption, for all }N. \]

We see that the CLT gives the matching distribution for large \(N\) — the difference is that the Normal distribution is justified for large \(N\) by the CLT rather than by exact normality. In this sense, the normal assumptions are not essential for approximating the sampling distribution of \(\betahat - \beta\).

Using the limiting distribution in the predictive distribution

Unfortunately, the normal assumption plays a much more important role in the predictive distribution. To see this, we can write as usual

\[ \y_new - \yhat_\new = (\beta - \betahat)^\trans \xv_\new + \res_n. \]

We can say that \((\beta - \betahat)^\trans \xv_\new\) is approximately normal for large \(N\) using the CLT. However, since the distribution of \(\res_n\) is unknown, the distribution of \(\y_new - \yhat_\new\) is unknown, even for large \(N\).

As a simple example, we could take

\[ \res_n = \begin{cases} 1 & \textrm{with probability }1/2\\ -1 & \textrm{with probability }1/2\\ \end{cases}. \]

These residuals satisfy the assumptions, but are very non-normal and, normal intervals will in general have poor coverage.

There are good solutions to produce well-calibrate predictive intervals even in the case of severe non-normality, using only the assumption that \((\xv_n, \y_n)\) are IID. For interested students, I recommend starting with A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification by Anastasios N. Angelopoulos, Stephen Bates. If we have time, we will cover conformal inference towards the end of the course.

Limiting distribution of the variance estimator (bonus content)

We can apply the same limiting distribution trick to \(\sigmahat^2\) as well, though it is not particularly useful. Recall that

\[ \begin{aligned} \sigmahat^2 ={}& \meann \reshat_n^2 \\ ={}& \meann \left((\beta - \betahat)^\trans \xv_n + \res_n \right)^2 \\ ={}& \meann \res_n^2 + 2 (\beta - \betahat)^\trans \meann \res_n \xv_n + (\beta - \betahat)^\trans \meann \xv_n\xv_n^\trans (\beta - \betahat). \end{aligned} \]

We thus can write

\[ \begin{aligned} \sqrt{N} \sigmahat^2 ={}& \frac{1}{\sqrt{N}} \sumn \res_n^2 + 2 (\beta - \betahat)^\trans \frac{1}{\sqrt{N}} \sumn \res_n \xv_n + \sqrt{N}(\beta - \betahat)^\trans \meann \xv_n\xv_n^\trans (\beta - \betahat). \end{aligned} \]

Applying the CLT to the final two terms, and using the fact that \(\beta - \betahat \rightarrow \zerov\), we can see that the only term that does not vanish as \(N\rightarrow\infty\) is the first. Applying a CLT to that gives,

\[ \frac{1}{\sqrt{N}} \sumn (\res_n^2 - \sigma^2) \rightarrow \gauss{0, \var{\res_n^2}}, \]

assuming that \(\var{\res_n^2} < \infty\). It follows that

\[ \sqrt{N} \left( \sigmahat^2 - \sigma^2\right) \rightarrow \gauss{0, \v_\sigma} \quad\textrm{ where }\quad \v_\sigma := \var{\res_n^2}. \]

However, this is not very useful because of how \(\sigmahat\) is used. Consider, for example, the problem of constructing intervals for the regression coefficients. Let

\[ \Xcovhat := \meann \xv_n\xv_n^\trans, \]

and consider estimating \[ \frac{1}{\sigmahat} \Xcovhat^{1/2} \sqrt{N}(\betahat - \beta) \approx \frac{1}{\sigma} \Xcov^{1/2} \sqrt{N} (\betahat - \beta) \rightarrow \gauss{\zerov, \id}. \]

Now,

\[ \frac{1}{\sigmahat} = \frac{1}{\sqrt{\sigmahat^2 - \sigma^2 + \sigma^2}} = \frac{1}{\sigma} \frac{1}{\sqrt{\left(\frac{\sigmahat^2}{\sigma^2} - 1 \right) + 1}}. \]

By the CLT, we know that \(\sqrt{N}\left(\frac{\sigmahat^2}{\sigma^2} - 1 \right)\) converges to a normal random variable. It follows that \(\left(\frac{\sigmahat^2}{\sigma^2} - 1 \right)\) is small for large \(N\) — roughtly as small as \(1 / \sqrt{N}\). How does its variability affect the variability of the preceding term? By series expanding the function \(\frac{1}{\sqrt{1 + z}} = (1 + z)^{-1/2}\) around \(z =0\), we see that

\[ \frac{1}{\sqrt{1 + z}} \approx 1 - \frac{1}{2} (1 + 0)^{-3/2} (z - 0) = 1 - \frac{1}{2} z \quad\textrm{for small }z, \]

\[ \frac{1}{\sigmahat} \approx \frac{1}{\sigma} \left( 1 + \left(\frac{\sigmahat^2}{\sigma^2} - 1 \right) \right) \approx \frac{1}{\sigma} \left( 1 + C / \sqrt{N} \right). \]

The variability induced by the randomness in \(\sigmahat\) is thus an order smaller than that induced by \(\betahat - \beta\), simply because \(\sigmahat\) has a nonzerm mean.

A similar argument shows why, for large \(N\), the difference between \(\Xcovhat\) and \(\Xcov\) is negligible. It is perhaps for this reason that, other than the use of student-t intervals motivated by the normal assumption, the variability of \(\sigmahat\) is not tyipcally incorporated into standard error calculations.