$$ \newcommand{\mybold}[1]{\boldsymbol{#1}} \newcommand{\trans}{\intercal} \newcommand{\norm}[1]{\left\Vert#1\right\Vert} \newcommand{\abs}[1]{\left|#1\right|} \newcommand{\bbr}{\mathbb{R}} \newcommand{\bbz}{\mathbb{Z}} \newcommand{\bbc}{\mathbb{C}} \newcommand{\gauss}[1]{\mathcal{N}\left(#1\right)} \newcommand{\chisq}[1]{\mathcal{\chi}^2_{#1}} \newcommand{\studentt}[1]{\mathrm{StudentT}_{#1}} \newcommand{\fdist}[2]{\mathrm{FDist}_{#1,#2}} \newcommand{\iid}{\overset{\mathrm{IID}}{\sim}} \newcommand{\argmin}[1]{\underset{#1}{\mathrm{argmin}}\,} \newcommand{\projop}[1]{\underset{#1}{\mathrm{Proj}}\,} \newcommand{\proj}[1]{\underset{#1}{\mybold{P}}} \newcommand{\expect}[1]{\mathbb{E}\left[#1\right]} \newcommand{\prob}[1]{\mathbb{P}\left(#1\right)} \newcommand{\dens}[1]{\mathit{p}\left(#1\right)} \newcommand{\var}[1]{\mathrm{Var}\left(#1\right)} \newcommand{\cov}[1]{\mathrm{Cov}\left(#1\right)} \newcommand{\sumn}{\sum_{n=1}^N} \newcommand{\meann}{\frac{1}{N} \sumn} \newcommand{\cltn}{\frac{1}{\sqrt{N}} \sumn} \newcommand{\trace}[1]{\mathrm{trace}\left(#1\right)} \newcommand{\diag}[1]{\mathrm{Diag}\left(#1\right)} \newcommand{\grad}[2]{\nabla_{#1} \left. #2 \right.} \newcommand{\gradat}[3]{\nabla_{#1} \left. #2 \right|_{#3}} \newcommand{\fracat}[3]{\left. \frac{#1}{#2} \right|_{#3}} \newcommand{\W}{\mybold{W}} \newcommand{\w}{w} \newcommand{\wbar}{\bar{w}} \newcommand{\wv}{\mybold{w}} \newcommand{\X}{\mybold{X}} \newcommand{\x}{x} \newcommand{\xbar}{\bar{x}} \newcommand{\xv}{\mybold{x}} \newcommand{\Xcov}{\mybold{M}_{\X}} \newcommand{\Xcovhat}{\hat{\mybold{M}}_{\X}} \newcommand{\Covsand}{\Sigmam_{\mathrm{sand}}} \newcommand{\Covsandhat}{\hat{\Sigmam}_{\mathrm{sand}}} \newcommand{\Z}{\mybold{Z}} \newcommand{\z}{z} \newcommand{\zv}{\mybold{z}} \newcommand{\zbar}{\bar{z}} \newcommand{\Y}{\mybold{Y}} \newcommand{\Yhat}{\hat{\Y}} \newcommand{\y}{y} \newcommand{\yv}{\mybold{y}} \newcommand{\yhat}{\hat{\y}} \newcommand{\ybar}{\bar{y}} \newcommand{\res}{\varepsilon} \newcommand{\resv}{\mybold{\res}} \newcommand{\resvhat}{\hat{\mybold{\res}}} \newcommand{\reshat}{\hat{\res}} \newcommand{\betav}{\mybold{\beta}} \newcommand{\betavhat}{\hat{\betav}} \newcommand{\betahat}{\hat{\beta}} \newcommand{\betastar}{{\beta^{*}}} \newcommand{\betavstar}{{\betav^{*}}} \newcommand{\loss}{\mathscr{L}} \newcommand{\losshat}{\hat{\loss}} \newcommand{\f}{f} \newcommand{\fhat}{\hat{f}} \newcommand{\bv}{\mybold{\b}} \newcommand{\bvhat}{\hat{\bv}} \newcommand{\alphav}{\mybold{\alpha}} \newcommand{\alphavhat}{\hat{\av}} \newcommand{\alphahat}{\hat{\alpha}} \newcommand{\omegav}{\mybold{\omega}} \newcommand{\gv}{\mybold{\gamma}} \newcommand{\gvhat}{\hat{\gv}} \newcommand{\ghat}{\hat{\gamma}} \newcommand{\hv}{\mybold{\h}} \newcommand{\hvhat}{\hat{\hv}} \newcommand{\hhat}{\hat{\h}} \newcommand{\gammav}{\mybold{\gamma}} \newcommand{\gammavhat}{\hat{\gammav}} \newcommand{\gammahat}{\hat{\gamma}} \newcommand{\new}{\mathrm{new}} \newcommand{\zerov}{\mybold{0}} \newcommand{\onev}{\mybold{1}} \newcommand{\id}{\mybold{I}} \newcommand{\sigmahat}{\hat{\sigma}} \newcommand{\etav}{\mybold{\eta}} \newcommand{\muv}{\mybold{\mu}} \newcommand{\Sigmam}{\mybold{\Sigma}} \newcommand{\rdom}[1]{\mathbb{R}^{#1}} \newcommand{\RV}[1]{{#1}} \def\A{\mybold{A}} \def\A{\mybold{A}} \def\av{\mybold{a}} \def\a{a} \def\B{\mybold{B}} \def\b{b} \def\S{\mybold{S}} \def\sv{\mybold{s}} \def\s{s} \def\R{\mybold{R}} \def\rv{\mybold{r}} \def\r{r} \def\V{\mybold{V}} \def\vv{\mybold{v}} \def\v{v} \def\vhat{\hat{v}} \def\U{\mybold{U}} \def\uv{\mybold{u}} \def\u{u} \def\W{\mybold{W}} \def\wv{\mybold{w}} \def\w{w} \def\tv{\mybold{t}} \def\t{t} \def\Sc{\mathcal{S}} \def\ev{\mybold{e}} \def\Lammat{\mybold{\Lambda}} \def\Q{\mybold{Q}} \def\eps{\varepsilon} $$

Test statistics without normality

Goals

Leave the normal assumption behind
- Derive limiting distributions of $\betahat$ using the CLT under homoskedasticity
- Derive limiting distributions of $\betahat$ using the CLT under heteroskedasticity
- Derive an assumption-free version of what OLS is estimating

Leaving the normal assumption

Up to now, we’ve been assuming that $\res_n \sim \gauss{0, \sigma^2}$. Under these assumptions, we were able to derive closed-form, finite-sample distributions for $\betahat$ and $\sigmahat^2$. Unfortunately, the normal assumption is unreasonable in practice. So today we use asymptotics to derive confidence intervals for $\betavhat$ for large $N$ using asymptotics under the homoskedastic, heteroskedastic, and machine learning assumptions.

To do so, we will need one additional assumptions:

\[ \frac{1}{N} \X^\trans \X \rightarrow \Xcov \textrm{ as }N \rightarrow \infty \textrm{ where }\Xcov\textrm{ is invertible.} \]

This may happen for deterministic $\xv_n$, in which case this is just an assumption about a real–valued limit, or it may occur for random and IID $\xv_n$, in which case $\Xcov = \cov{\xv_n}$ by the LLN.

The homoskedastic assumption

We still have that

\[ \betahat = \beta + (\X^\trans \X)^{-1} \X^\trans \resv. \]

That means that $\expect{\betahat | \X} = \beta$, so our estimator is still unbiased given $\X$. But the term $(\X^\trans \X)^{-1} \X^\trans \resv$ is no longer normal. It also remains the case that

\[ \begin{aligned} \cov{(\X^\trans \X)^{-1} \X^\trans \resv | \X} ={}& (\X^\trans \X)^{-1} \X^\trans \expect{\resv\resv^\trans} \X (\X^\trans \X)^{-1} \\={}& \sigma^2 (\X^\trans \X)^{-1} \\={}& \frac{1}{N} \sigma^2 \left(\frac{1}{N} \X^\trans \X \right)^{-1} \\\rightarrow{}& \zerov. \end{aligned} \]

This means that $\betahat \rightarrow \beta$ — in other words, $\betahat$ is consistent under the homoskedastic assumptions.

The LLN for consistency

Another way to show the same this is using the LLN $\betahat$:

\[ \begin{aligned} \betahat - \beta ={}& (\X^\trans \X)^{-1} \X^\trans \resv \\ ={}& \left(\frac{1}{N} \X^\trans \X \right)^{-1} \frac{1}{N} \X^\trans \resv \\ ={}& \left(\meann \xv_n \xv_n^\trans \right)^{-1} \meann \xv_n \res_n. \end{aligned} \]

Now, $(\meann \xv_n \xv_n^\trans)^{-1} \rightarrow \Xcov^{-1}$ by the LLN and the continuous mapping theorem, and, since

\[ \begin{aligned} \meann \xv_n \res_n \rightarrow{}& \expect{\xv_n \res_n} \\={}& \expect{\xv_n \expect{\res_n | \xv_n}} \\={}& \expect{\xv_n \cdot 0} \\={}& \zerov, \end{aligned} \]

simply using the fact that, by assumption, $\expect{\res_n | \xv_n} = 0$.

The CLT for normality

Although we don’t know the finite-sample distribution of $\betahat - \beta$, the LLN points to a way to approximation the asymptotic distribution of $\betahat - \beta$ via the CLT. Specifically, note that $\xv_n \res_n$ are not IID, but

\[ \expect{\xv_n \res_n} = 0 \quad\textrm{and}\quad \cov{\xv_n \res_n} = \expect{\xv_n \xv_n^\trans} \sigma^2. \]

Here, $\expect{\xv_n \xv_n^\trans}$ would be the same for all $n$ if $\xv_n$ are IID, or it might be different for each $n$ if the $\xv_n$ are fixed or non–IID. But either way, by assumption, \[ \meann \cov{\xv_n \res_n} = \sigma^2 \meann \expect{\xv_n \xv_n^\trans} \rightarrow \sigma^2 \Xcov. \]

Therefore,by the multivariate CLT, \[ \frac{1}{\sqrt{N}} \sumn \xv_n \res_n \rightarrow \gauss{0, \sigma^2 \Xcov}. \]

Thus, by the continuous mapping theorem,

\[ \sqrt{N}(\betahat - \beta) = \left(\meann \xv_n \xv_n^\trans \right)^{-1} \frac{1}{\sqrt{N}} \sumn \xv_n \res_n \rightarrow \Xcov^{-1} \RV{z} \quad\textrm{where}\quad \RV{\z} \sim \gauss{0, \sigma^2 \Xcov}. \]

Now, by properties of the multivariate normal,

\[ \Xcov^{-1} \RV{z} \sim \gauss{0, \sigma^2 \Xcov^{-1} \Xcov \Xcov^{-1}} = \gauss{0, \sigma^2 \Xcov^{-1}}, \]

\[ \sqrt{N}(\betahat - \beta) \rightarrow \gauss{0, \sigma^2 \Xcov^{-1}}. \]

Plug-in estimators for the variance

Of course, in practice, we do not observe the terms in the variance $\sigma^2 \Xcov^{-1}$. A natural solution is to plug in their consistent estimators,

\[ \begin{aligned} \sigmahat^2 \rightarrow \sigma^2 \quad\textrm{and}\quad \Xcovhat := \frac{1}{N} \X^\trans \X \rightarrow \Xcov. \end{aligned} \]

We thus say that

\[ \betahat - \beta \sim \gauss{0, \frac{1}{N} \sigmahat^2 \left( \frac{1}{N} \X^\trans \X \right)^{-1}} \quad\textrm{approximately for large }N. \]

Recall that, under normality, we had

\[ \betahat - \beta \sim \gauss{0, \frac{1}{N} \sigma^2 \left(\frac{1}{N} \X^\trans \X\right)^{-1}} \quad\textrm{exactly, under the normal assumption, for all }N. \]

We see that the CLT gives the matching distribution for large $N$ — the difference is that the Normal distribution is justified for large $N$ by the CLT rather than by exact normality. In this sense, the normal assumptions are not essential for approximating the sampling distribution of $\betahat - \beta$.

The heteroskedastic assumption

Under the heteroskedastic assumption, we keep $\y_n = \x_n^\trans \betav + \res_n$, with $\expect{\res_n \vert \xv_n} = 0$, but the variance of $\res_n$ might depend on $\xv_n$: $\var{\res_n \vert \xv_n} = \sigma_n^2 < \infty$. Importantly, the residuals are no longer independent of $\xv_n$ when $\xv_n$ is random.

However, maybe surprisingly, we can still study the limiting distribution of $\betahat - \betav$, which will be particularly useful for inference. Recall that the first step of deriving the limiting distribution of $\betahat - \beta$ using the CLT was to write

\[ \begin{aligned} \sqrt{N} (\betahat - \beta) = \Xcovhat^{-1} \cltn \xv_n \res_n. \end{aligned} \]

We then applied the CLT to $\cltn \xv_n \res_n$, using the fact that $\cov{\xv_n \res_n} = \sigma^2 \Xcov$ under the homoskedastic assumptions. Under heteroskedastic assumptions, we now have

\[ \cov{\xv_n \res_n} = \expect{\xv_n \xv_n^\trans \res_n^2} = \expect{\xv_n \xv_n^\trans \var{\res_n | \xv_n}} = \expect{\xv_n \xv_n^\trans \sigma_n^2}. \]

Again, if $\xv_n$ is random and IID, then the preceding quantity is a single fixed number, or if $\xv_n$ is non–random, then the preceding quantity is different for each $\xv_n$. But, in either case, by the LLN, \[ \meann \xv_n \xv_n^\trans \res_n^2 \rightarrow \cov{\xv_n \res_n}, \] as long as we can apply the LLN to each entry of the corresponding matrix. Furthermore, although we won’t prove it here, the plug–in estimate is also consistent as $N \rightarrow \infty$: \[ \hat{S} := \meann \xv_n \xv_n^\trans \reshat_n^2 \rightarrow \cov{\xv_n \res_n}, \]

By the continuous mapping theorem, we thus see that, under the heteroskedastic assumptions, \[ \begin{aligned} \sqrt{N} (\betahat - \beta) \rightarrow{}& \gauss{\zerov, \Covsand} \quad\textrm{where}\\ \Covsand :={}& \Xcov^{-1} \cov{\xv_n \res_n} \Xcov^{-1} \quad\textrm{and}\\ \Covsandhat :={}& \Xcovhat^{-1} \hat{S} \Xcovhat^{-1} \rightarrow \Covsand. \end{aligned} \]

The matrix $\Covsand$ is often called the “sandwich covariance matrix” and $\Covsandhat$ the “sandwich covariance estimator,”” since the $\Xcov$ term is like the “bread” and the $\cov{\xv_n \res_n}$ is like the “meat”. It’s also known as the “Huber-White covariance estimator” (after some of its first promoters), or the “heteroskedasticity-robust” covariance estimator, since it will give correct asymptotic intervals for the OLS coefficients even in the presence of heteroskedasticity.

Sandwich covariance estimators can be computed in R using the sandwich package.

It is often — but not necessarily — the case that sandwich covariance estimates are larger than the “standard” homoskedastic estimates. You therefore pay a price for robustness. It is also the case that sandwich covariance matrices can be more unstable for small $N$.

Exercise

When $P = 1$ (so $\xv_n$ is a scalar), prove that it is possible for $\Covsandhat < \sigmahat^2 \Xcovhat^{-1}$.

Checking for heteroskedasticity

For low–dimensional problems, is common to plot $\xv_n$ versus $\reshat_n$ to look for evidence of heteroskedasticity. But when $\xv_n$ is high–dimensional, this might be difficult to visualize. In that case, what should you plot?

Note that an alternative formulation of the CLT for $\betahat_p$, a particular component of $\betavhat$, is

\[ \begin{aligned} \sqrt{N}(\betahat_p - \beta_p) ={}& \sqrt{N} \ev_p^\trans \left( \X^\trans \X \right)^{-1} \X^\trans \resv \\={}& \ev_p^\trans \left(\frac{1}{N} \X^\trans \X \right)^{-1} \cltn \xv_n \res_n \\\approx{}& \ev_p^\trans \Xcov^{-1} \cltn \xv_n \res_n \\={}& \cltn \ev_p^\trans \Xcov^{-1} \xv_n \res_n. \\={}& \cltn \r_{n} \res_n \quad\textrm{ where } \r_{n} := \ev_p^\trans \Xcov^{-1} \xv_n \in \mathbb{R}. \end{aligned} \]

Ideally, for checking for heteroskedasticity in the $p$–th component, we would then plot $\res_n$ versus the scalar $\r_n$. Neither are observable, but can be approximated with $\reshat_n$ and

\[ \hat{r}_n := \ev_p^\trans \left(\frac{1}{N} \X^\trans \X \right)^{-1} \xv_n. \]

(I’m not sure whether this is standard practice, but it seems like a good idea.)