Test statistics without normality
Goals
- Leave the normal assumption behind
- Derive limiting distributions of \(\betahat\) using the CLT under homoskedasticity
- Derive limiting distributions of \(\betahat\) using the CLT under heteroskedasticity
- Derive an assumption-free version of what OLS is estimating
Leaving the normal assumption
Up to now, we’ve been assuming that \(\res_n \sim \gauss{0, \sigma^2}\). Under these assumptions, we were able to derive closed-form, finite-sample distributions for \(\betahat\) and \(\sigmahat^2\). Unfortunately, the normal assumption is unreasonable in practice. So today we use asymptotics to derive confidence intervals for \(\betavhat\) for large \(N\) using asymptotics under the homoskedastic, heteroskedastic, and machine learning assumptions.
To do so, we will need one additional assumptions:
\[ \frac{1}{N} \X^\trans \X \rightarrow \Xcov \textrm{ as }N \rightarrow \infty \textrm{ where }\Xcov\textrm{ is invertible.} \]
This may happen for deterministic \(\xv_n\), in which case this is just an assumption about a real–valued limit, or it may occur for random and IID \(\xv_n\), in which case \(\Xcov = \cov{\xv_n}\) by the LLN.
The homoskedastic assumption
We still have that
\[ \betahat = \beta + (\X^\trans \X)^{-1} \X^\trans \resv. \]
That means that \(\expect{\betahat | \X} = \beta\), so our estimator is still unbiased given \(\X\). But the term \((\X^\trans \X)^{-1} \X^\trans \resv\) is no longer normal. It also remains the case that
\[ \begin{aligned} \cov{(\X^\trans \X)^{-1} \X^\trans \resv | \X} ={}& (\X^\trans \X)^{-1} \X^\trans \expect{\resv\resv^\trans} \X (\X^\trans \X)^{-1} \\={}& \sigma^2 (\X^\trans \X)^{-1} \\={}& \frac{1}{N} \sigma^2 \left(\frac{1}{N} \X^\trans \X \right)^{-1} \\\rightarrow{}& \zerov. \end{aligned} \]
This means that \(\betahat \rightarrow \beta\) — in other words, \(\betahat\) is consistent under the homoskedastic assumptions.
The LLN for consistency
Another way to show the same this is using the LLN \(\betahat\):
\[ \begin{aligned} \betahat - \beta ={}& (\X^\trans \X)^{-1} \X^\trans \resv \\ ={}& \left(\frac{1}{N} \X^\trans \X \right)^{-1} \frac{1}{N} \X^\trans \resv \\ ={}& \left(\meann \xv_n \xv_n^\trans \right)^{-1} \meann \xv_n \res_n. \end{aligned} \]
Now, \((\meann \xv_n \xv_n^\trans)^{-1} \rightarrow \Xcov^{-1}\) by the LLN and the continuous mapping theorem, and, since
\[ \begin{aligned} \meann \xv_n \res_n \rightarrow{}& \expect{\xv_n \res_n} \\={}& \expect{\xv_n \expect{\res_n | \xv_n}} \\={}& \expect{\xv_n \cdot 0} \\={}& \zerov, \end{aligned} \]
simply using the fact that, by assumption, \(\expect{\res_n | \xv_n} = 0\).
The CLT for normality
Although we don’t know the finite-sample distribution of \(\betahat - \beta\), the LLN points to a way to approximation the asymptotic distribution of \(\betahat - \beta\) via the CLT. Specifically, note that \(\xv_n \res_n\) are not IID, but
\[ \expect{\xv_n \res_n} = 0 \quad\textrm{and}\quad \cov{\xv_n \res_n} = \expect{\xv_n \xv_n^\trans} \sigma^2. \]
Here, \(\expect{\xv_n \xv_n^\trans}\) would be the same for all \(n\) if \(\xv_n\) are IID, or it might be different for each \(n\) if the \(\xv_n\) are fixed or non–IID. But either way, by assumption, \[ \meann \cov{\xv_n \res_n} = \sigma^2 \meann \expect{\xv_n \xv_n^\trans} \rightarrow \sigma^2 \Xcov. \]
Therefore,by the multivariate CLT, \[ \frac{1}{\sqrt{N}} \sumn \xv_n \res_n \rightarrow \gauss{0, \sigma^2 \Xcov}. \]
Thus, by the continuous mapping theorem,
\[ \sqrt{N}(\betahat - \beta) = \left(\meann \xv_n \xv_n^\trans \right)^{-1} \frac{1}{\sqrt{N}} \sumn \xv_n \res_n \rightarrow \Xcov^{-1} \RV{z} \quad\textrm{where}\quad \RV{\z} \sim \gauss{0, \sigma^2 \Xcov}. \]
Now, by properties of the multivariate normal,
\[ \Xcov^{-1} \RV{z} \sim \gauss{0, \sigma^2 \Xcov^{-1} \Xcov \Xcov^{-1}} = \gauss{0, \sigma^2 \Xcov^{-1}}, \]
so
\[ \sqrt{N}(\betahat - \beta) \rightarrow \gauss{0, \sigma^2 \Xcov^{-1}}. \]
Plug-in estimators for the variance
Of course, in practice, we do not observe the terms in the variance \(\sigma^2 \Xcov^{-1}\). A natural solution is to plug in their consistent estimators,
\[ \begin{aligned} \sigmahat^2 \rightarrow \sigma^2 \quad\textrm{and}\quad \Xcovhat := \frac{1}{N} \X^\trans \X \rightarrow \Xcov. \end{aligned} \]
We thus say that
\[ \betahat - \beta \sim \gauss{0, \frac{1}{N} \sigmahat^2 \left( \frac{1}{N} \X^\trans \X \right)^{-1}} \quad\textrm{approximately for large }N. \]
Recall that, under normality, we had
\[ \betahat - \beta \sim \gauss{0, \frac{1}{N} \sigma^2 \left(\frac{1}{N} \X^\trans \X\right)^{-1}} \quad\textrm{exactly, under the normal assumption, for all }N. \]
We see that the CLT gives the matching distribution for large \(N\) — the difference is that the Normal distribution is justified for large \(N\) by the CLT rather than by exact normality. In this sense, the normal assumptions are not essential for approximating the sampling distribution of \(\betahat - \beta\).
The heteroskedastic assumption
Under the heteroskedastic assumption, we keep \(\y_n = \x_n^\trans \betav + \res_n\), with \(\expect{\res_n \vert \xv_n} = 0\), but the variance of \(\res_n\) might depend on \(\xv_n\): \(\var{\res_n \vert \xv_n} = \sigma_n^2 < \infty\). Importantly, the residuals are no longer independent of \(\xv_n\) when \(\xv_n\) is random.
However, maybe surprisingly, we can still study the limiting distribution of \(\betahat - \betav\), which will be particularly useful for inference. Recall that the first step of deriving the limiting distribution of \(\betahat - \beta\) using the CLT was to write
\[ \begin{aligned} \sqrt{N} (\betahat - \beta) = \Xcovhat^{-1} \cltn \xv_n \res_n. \end{aligned} \]
We then applied the CLT to \(\cltn \xv_n \res_n\), using the fact that \(\cov{\xv_n \res_n} = \sigma^2 \Xcov\) under the homoskedastic assumptions. Under heteroskedastic assumptions, we now have
\[ \cov{\xv_n \res_n} = \expect{\xv_n \xv_n^\trans \res_n^2} = \expect{\xv_n \xv_n^\trans \var{\res_n | \xv_n}} = \expect{\xv_n \xv_n^\trans \sigma_n^2}. \]
Again, if \(\xv_n\) is random and IID, then the preceding quantity is a single fixed number, or if \(\xv_n\) is non–random, then the preceding quantity is different for each \(\xv_n\). But, in either case, by the LLN, \[ \meann \xv_n \xv_n^\trans \res_n^2 \rightarrow \cov{\xv_n \res_n}, \] as long as we can apply the LLN to each entry of the corresponding matrix. Furthermore, although we won’t prove it here, the plug–in estimate is also consistent as \(N \rightarrow \infty\): \[ \hat{S} := \meann \xv_n \xv_n^\trans \reshat_n^2 \rightarrow \cov{\xv_n \res_n}, \]
By the continuous mapping theorem, we thus see that, under the heteroskedastic assumptions, \[ \begin{aligned} \sqrt{N} (\betahat - \beta) \rightarrow{}& \gauss{\zerov, \Covsand} \quad\textrm{where}\\ \Covsand :={}& \Xcov^{-1} \cov{\xv_n \res_n} \Xcov^{-1} \quad\textrm{and}\\ \Covsandhat :={}& \Xcovhat^{-1} \hat{S} \Xcovhat^{-1} \rightarrow \Covsand. \end{aligned} \]
The matrix \(\Covsand\) is often called the “sandwich covariance matrix” and \(\Covsandhat\) the “sandwich covariance estimator,”” since the \(\Xcov\) term is like the “bread” and the \(\cov{\xv_n \res_n}\) is like the “meat”. It’s also known as the “Huber-White covariance estimator” (after some of its first promoters), or the “heteroskedasticity-robust” covariance estimator, since it will give correct asymptotic intervals for the OLS coefficients even in the presence of heteroskedasticity.
Sandwich covariance estimators can be computed in R
using the sandwich
package.
It is often — but not necessarily — the case that sandwich covariance estimates are larger than the “standard” homoskedastic estimates. You therefore pay a price for robustness. It is also the case that sandwich covariance matrices can be more unstable for small \(N\).
When \(P = 1\) (so \(\xv_n\) is a scalar), prove that it is possible for \(\Covsandhat < \sigmahat^2 \Xcovhat^{-1}\).
Checking for heteroskedasticity
For low–dimensional problems, is common to plot \(\xv_n\) versus \(\reshat_n\) to look for evidence of heteroskedasticity. But when \(\xv_n\) is high–dimensional, this might be difficult to visualize. In that case, what should you plot?
Note that an alternative formulation of the CLT for \(\betahat_p\), a particular component of \(\betavhat\), is
\[ \begin{aligned} \sqrt{N}(\betahat_p - \beta_p) ={}& \sqrt{N} \ev_p^\trans \left( \X^\trans \X \right)^{-1} \X^\trans \resv \\={}& \ev_p^\trans \left(\frac{1}{N} \X^\trans \X \right)^{-1} \cltn \xv_n \res_n \\\approx{}& \ev_p^\trans \Xcov^{-1} \cltn \xv_n \res_n \\={}& \cltn \ev_p^\trans \Xcov^{-1} \xv_n \res_n. \\={}& \cltn \r_{n} \res_n \quad\textrm{ where } \r_{n} := \ev_p^\trans \Xcov^{-1} \xv_n \in \mathbb{R}. \end{aligned} \]
Ideally, for checking for heteroskedasticity in the \(p\)–th component, we would then plot \(\res_n\) versus the scalar \(\r_n\). Neither are observable, but can be approximated with \(\reshat_n\) and
\[ \hat{r}_n := \ev_p^\trans \left(\frac{1}{N} \X^\trans \X \right)^{-1} \xv_n. \]
(I’m not sure whether this is standard practice, but it seems like a good idea.)