$$ \newcommand{\mybold}[1]{\boldsymbol{#1}} \newcommand{\trans}{\intercal} \newcommand{\norm}[1]{\left\Vert#1\right\Vert} \newcommand{\abs}[1]{\left|#1\right|} \newcommand{\bbr}{\mathbb{R}} \newcommand{\bbz}{\mathbb{Z}} \newcommand{\bbc}{\mathbb{C}} \newcommand{\gauss}[1]{\mathcal{N}\left(#1\right)} \newcommand{\chisq}[1]{\mathcal{\chi}^2_{#1}} \newcommand{\studentt}[1]{\mathrm{StudentT}_{#1}} \newcommand{\fdist}[2]{\mathrm{FDist}_{#1,#2}} \newcommand{\argmin}[1]{\underset{#1}{\mathrm{argmin}}\,} \newcommand{\projop}[1]{\underset{#1}{\mathrm{Proj}}\,} \newcommand{\proj}[1]{\underset{#1}{\mybold{P}}} \newcommand{\expect}[1]{\mathbb{E}\left[#1\right]} \newcommand{\prob}[1]{\mathbb{P}\left(#1\right)} \newcommand{\dens}[1]{\mathit{p}\left(#1\right)} \newcommand{\var}[1]{\mathrm{Var}\left(#1\right)} \newcommand{\cov}[1]{\mathrm{Cov}\left(#1\right)} \newcommand{\sumn}{\sum_{n=1}^N} \newcommand{\meann}{\frac{1}{N} \sumn} \newcommand{\cltn}{\frac{1}{\sqrt{N}} \sumn} \newcommand{\trace}[1]{\mathrm{trace}\left(#1\right)} \newcommand{\diag}[1]{\mathrm{Diag}\left(#1\right)} \newcommand{\grad}[2]{\nabla_{#1} \left. #2 \right.} \newcommand{\gradat}[3]{\nabla_{#1} \left. #2 \right|_{#3}} \newcommand{\fracat}[3]{\left. \frac{#1}{#2} \right|_{#3}} \newcommand{\W}{\mybold{W}} \newcommand{\w}{w} \newcommand{\wbar}{\bar{w}} \newcommand{\wv}{\mybold{w}} \newcommand{\X}{\mybold{X}} \newcommand{\x}{x} \newcommand{\xbar}{\bar{x}} \newcommand{\xv}{\mybold{x}} \newcommand{\Xcov}{\Sigmam_{\X}} \newcommand{\Xcovhat}{\hat{\Sigmam}_{\X}} \newcommand{\Covsand}{\Sigmam_{\mathrm{sand}}} \newcommand{\Covsandhat}{\hat{\Sigmam}_{\mathrm{sand}}} \newcommand{\Z}{\mybold{Z}} \newcommand{\z}{z} \newcommand{\zv}{\mybold{z}} \newcommand{\zbar}{\bar{z}} \newcommand{\Y}{\mybold{Y}} \newcommand{\Yhat}{\hat{\Y}} \newcommand{\y}{y} \newcommand{\yv}{\mybold{y}} \newcommand{\yhat}{\hat{\y}} \newcommand{\ybar}{\bar{y}} \newcommand{\res}{\varepsilon} \newcommand{\resv}{\mybold{\res}} \newcommand{\resvhat}{\hat{\mybold{\res}}} \newcommand{\reshat}{\hat{\res}} \newcommand{\betav}{\mybold{\beta}} \newcommand{\betavhat}{\hat{\betav}} \newcommand{\betahat}{\hat{\beta}} \newcommand{\betastar}{{\beta^{*}}} \newcommand{\bv}{\mybold{\b}} \newcommand{\bvhat}{\hat{\bv}} \newcommand{\alphav}{\mybold{\alpha}} \newcommand{\alphavhat}{\hat{\av}} \newcommand{\alphahat}{\hat{\alpha}} \newcommand{\omegav}{\mybold{\omega}} \newcommand{\gv}{\mybold{\gamma}} \newcommand{\gvhat}{\hat{\gv}} \newcommand{\ghat}{\hat{\gamma}} \newcommand{\hv}{\mybold{\h}} \newcommand{\hvhat}{\hat{\hv}} \newcommand{\hhat}{\hat{\h}} \newcommand{\gammav}{\mybold{\gamma}} \newcommand{\gammavhat}{\hat{\gammav}} \newcommand{\gammahat}{\hat{\gamma}} \newcommand{\new}{\mathrm{new}} \newcommand{\zerov}{\mybold{0}} \newcommand{\onev}{\mybold{1}} \newcommand{\id}{\mybold{I}} \newcommand{\sigmahat}{\hat{\sigma}} \newcommand{\etav}{\mybold{\eta}} \newcommand{\muv}{\mybold{\mu}} \newcommand{\Sigmam}{\mybold{\Sigma}} \newcommand{\rdom}[1]{\mathbb{R}^{#1}} \newcommand{\RV}[1]{\tilde{#1}} \def\A{\mybold{A}} \def\A{\mybold{A}} \def\av{\mybold{a}} \def\a{a} \def\B{\mybold{B}} \def\S{\mybold{S}} \def\sv{\mybold{s}} \def\s{s} \def\R{\mybold{R}} \def\rv{\mybold{r}} \def\r{r} \def\V{\mybold{V}} \def\vv{\mybold{v}} \def\v{v} \def\U{\mybold{U}} \def\uv{\mybold{u}} \def\u{u} \def\W{\mybold{W}} \def\wv{\mybold{w}} \def\w{w} \def\tv{\mybold{t}} \def\t{t} \def\Sc{\mathcal{S}} \def\ev{\mybold{e}} \def\Lammat{\mybold{\Lambda}} $$

Machine learning assumptions and the sandwich covariance matrix

$\,$

Goals

Study what changes for the prediction model under the heteroskedastic and machine learning assumptions
- The sandwich covariance matrix
- The loss function perspective on prediction

A hierarchy of assumptions

Here is the hierarchy of assumptions we introduced earlier.

Gaussian assumption: Assume that $\y_n = \x_n^\trans \betav + \res_n$ for some $\beta$, where $\res_n \sim \gauss{0, \sigma^2}$, and $\res_n$ is independent of $\x_n$.
Homeskedastic assumption: Assume that $\y_n = \x_n^\trans \betav + \res_n$ for some $\beta$, where $\expect{\res_n} = 0$, $\var{\res_n} = \sigma^2 < \infty$, and $\res_n$ is independent of $\x_n$.
Heteroskedastic assumption: Assume that $\y_n = \x_n^\trans \betav + \res_n$ for some $\beta$, where $\expect{\res_n | \xv_n} = 0$ and $\var{\res_n \vert \xv_n} = \sigma_n^2 < \infty$.
Machine learning assumption: Assume that $\y_n = f(\x_n) + \res_n$ for some $f$, where $\expect{\res_n | \xv_n} = 0$ and $\var{\res_n \vert \xv_n} = \sigma_n^2 < \infty$.

Definition

To save space, from now on I’ll write $\Xcovhat := \meann \xv_n\xv_n^\trans, = \frac{1}{N} \X^\trans \X$. In this class, we’ll typically be operating under the assumption that $\Xcovhat \rightarrow \Xcov$ as $N \rightarrow \infty$.

Heteroskedastic assumption

Under the heteroskedastic assumption, we keep $\y_n = \x_n^\trans \betav + \res_n$, with $\expect{\res_n} = 0$, but the variance of $\res_n$ might depend on $\xv_n$: $\var{\res_n \vert \xv_n} = \sigma_n^2 < \infty$. Importantly, the residuals are no longer independent of $\xv_n$. Furthermore, even if we knew $\betav$, the distribution of $\y_new - \betav^\trans \xv_\new$ now depends on $\xv_n$. Estimating this distribution is complicated. For example, even if you assume that $\res_\new$ is normally distributed, you would need to model and estimate the dependence $\xv_\new \mapsto \var{\res_\new | \xv_\new}$. This is beyond the scope of this class.

However, maybe surprisingly, we can still study the limiting distribution of $\betahat - \betav$, which will be particularly useful for inference. Recall that the first step of deriving the limiting distribution of $\betahat - \beta$ using the CLT was to write

\[ \begin{aligned} \sqrt{N} (\betahat - \beta) = \Xcovhat^{-1} \cltn \xv_n \res_n. \end{aligned} \]

We then applied the CLT to $\cltn \xv_n \res_n$, using the fact that $\cov{\xv_n \res_n} = \sigma^2 \Xcov$ under the homoskedastic assumptions. Under heteroskedastic assumptions, we now have

\[ \cov{\xv_n \res_n} = \expect{\xv_n \xv_n^\trans \res_n^2} = \expect{\xv_n \xv_n^\trans \var{\res_n | \xv_n}}, \]

which is complicated. However, it can be consistently estimated by the simple expression

\[ \meann \xv_n \xv_n^\trans \res_n^2 \rightarrow \cov{\xv_n \res_n}, \]

as long as we can apply the LLN to each entry of the corresponding matrix.

Exercise

State precisely an assumption on the moments of $\xv_n$ and $\res_n$ that allows the application of the LLN in the previous expression.

By the continuous mapping theorem, we thus see that, under the heteroskedastic assumptions, \[ \begin{aligned} \sqrt{N} (\betahat - \beta) \rightarrow{}& \gauss{\zerov, \Covsand} \quad\textrm{where}\\ \Covsand :={}& \Xcov^{-1} \cov{\xv_n \res_n} \Xcov^{-1} \quad\textrm{and}\\ \Covsandhat :={}& \Xcovhat^{-1} \left( \meann \xv_n \xv_n^\trans \res_n^2 \right) \Xcovhat^{-1} \rightarrow \Covsand. \end{aligned} \]

The matrix $\Covsand$ is often called the “sandwich covariance matrix” and $\Covsandhat$ the “sandwich covariance estimator,”” since the $\Xcov$ term is like the “bread” and the $\cov{\xv_n \res_n}$ is like the “meat”. It’s also known as the “Huber-White covariance estimator” (after some of its first promoters), or the “heteroskedasticity-robust” covariance estimator, since it will give correct asymptotic intervals for the OLS coefficients even in the presence of heteroskedasticity.

Sandwich covariance estimators can be computed in R using the sandwich package.

It is often — but not necessarily — the case that sandwich covariance estimates are larger than the “standard” homoskedastic estimates. You therefore pay a price for robustness. It is also the case that sandwich covariance matrices can be more unstable for small $N$.

Exercise

When $P = 1$ (so $\xv_n$ is a scalar), prove that it is possible for $\Covsandhat < \sigmahat^2 \Xcovhat^{-1}$.

Machine learning assumption

What can we say about the OLS coefficients when we do not assume that $\y_n = \beta^\trans \xv_n + \res_n$ for any $\beta$? In that case we have no model to compare to. However, we can still try to find the best linear estimator in terms of squared loss.

Suppose we simply want to have a good fit to training data. That is, given some $\y_\new$, we want to find a best guess $f(\xv_\new)$ for the value of $\y_\new$: specifically, we want to find the function that minimizes the “loss”

\[ \hat{f}(\cdot) := \argmin{f} \expect{\left( \y_\new - f(\xv_\new) \right)^2}. \]

This is a functional optimization problem over an infinite dimensional space! You can solve this using some fancy math, but you can also prove directly that the best choice is

\[ \hat{f}(\xv_\new) = \expect{\y_\new | \xv_\new}. \]

Exercise

Prove that any other $f(\cdot)$ results in larger loss.

Of course, we don’t know the functional form of $\expect{\y_\new | \xv_\new}$. But suppose we approximate it with

\[ \expect{\y_\new | \xv_\new} \approx \betav^\trans \xv_\new. \]

Note that we are not assuming this is true, so this is different from our “correct specification” assumption before that $\expect{\y_\new} = \betav^\trans \xv_\new$. Rather, we are finding the best approximation to the unknown $\expect{\y_\new | \xv_\new}$ amongst the class of functions of the form $\betav^\trans \xv_\new$.

Under this approximation, the problem becomes \[ \begin{aligned} \betastar :={}& \argmin{\beta} \expect{\left( \y_\new - \betav^\trans\xv_\new \right)^2} \\ ={}& \argmin{\beta} \left(\expect{\y_\new^2} - 2 \betav^\trans \expect{\y_\new \xv_\new} + \betav^\trans \expect{\xv_\new \xv_\new^\trans} \betav \right). \end{aligned} \]

Differentiating with respect to $\beta$ and solving gives

\[ \betastar = \expect{\xv_\new \xv_\new^\trans} ^{-1} \expect{\y_\new \xv_\new}. \]

This might look familiar. In fact, recall our first “useless” application of the LLN to $\betahat$:

\[ \betahat = (\X^\trans \X)^{-1} \X^\trans \Y \rightarrow \expect{\xv_\new \xv_\new^\trans} ^{-1} \expect{\y_\new \xv_\new} = \betastar. \]

Finally, we have an interpretation — our $\betahat$ converges to the minimizer of the (intractable) expected loss. Why is this? We can’t compute the expectated loss, since we don’t know the joint distribution of $\xv_\new$ and $\y_\new$. However, if we have a training set, we can approximate this expectation:

\[ \betahat := \argmin{\beta} \meann \left( \y_n - \betav^\trans\xv_n \right)^2 \approx \argmin{\beta} \expect{\left( \y_\new - \betav^\trans\xv_\new \right)^2} = \betastar. \]

Here, we are not positing any true model — we are simply assuming that the draws $(\y_n, \xv_n)$ are IID, and using an approximate loss. Still, we can analyze $\betahat$’s asymptotic behavior, using the fact that it minimizes the empirical loss.

Define the gradient of the loss function: \[ G(\beta) := \frac{\partial}{\partial \beta} \meann \left( \y_n - \betav^\trans\xv_n \right)^2 = -2 \meann \left( \y_n - \betav^\trans\xv_n \right) \xv_n. \]

Exercise

Prove that $G(\betahat) = \zerov$ and $\expect{G(\betastar)} = 0$.

Noting that $G(\beta)$ is linear in $\beta$ (so the second derivative with respect to $\beta$ is zero), we can Taylor expand $G(\cdot)$ around $\betahat$, giving

\[ G(\betastar) = G(\betahat) + \frac{\partial G}{\partial \beta^\trans} (\betahat) (\betastar - \betahat). \]

Exercise

Derive the previous equation without using a Taylor series using \[ \begin{aligned} \meann \left( \y_n - \betavhat^\trans\xv_n \right) \xv_n ={}& \zerov \Rightarrow \\ \meann \left( \y_n - \betavhat^\trans\xv_n \right) \xv_n - \meann \left( \y_n - \betastar^\trans\xv_n \right) \xv_n ={}& - \meann \left( \y_n - \betastar^\trans\xv_n \right) \xv_n, \end{aligned} \] and collecting terms. The Taylor series version is more general, since the analysis here can also be applied to non-linear gradients as long as $\betavhat - \betastar$ is small and the second derivative bounded, so that the first-order series expansion becomes accurate to leading order as $N$ grows.

Using the fact that $G(\betahat) = \zerov$, we can solve

\[ \sqrt{N} (\betahat - \betastar) = - \left( \frac{\partial G}{\partial \beta^\trans} (\betahat) \right)^{-1} \sqrt{N} G(\betastar). \]

Plugging in our particular loss gradient,

\[ - \frac{\partial G}{\partial \beta^\trans} = \meann \xv_n \xv_n^\trans \quad\textrm{and}\quad \sqrt{N} G(\betastar) = \cltn (\y_n - \betastar^\trans \xv_n) \xv_n. \]

Because $\expect{(\y_n - \betastar \xv_n)^\trans \xv_n} = \zerov$, we can apply the CLT to get

\[ \sqrt{N} G(\betastar) \rightarrow \gauss{\zerov, \cov{(\y_n - \betastar^\trans \xv_n) \xv_n}}. \]

Finally, using the fact that $\betahat \rightarrow \betastar$ , we find that we can consistently estimate the covariance as

\[ \begin{aligned} \meann (\y_n - \betahat^\trans \xv_n) \xv_n ={}& \meann (\y_n - \betastar^\trans \xv_n + (\betastar - \betahat)^\trans \xv_n) \xv_n \\={}& \meann (\y_n - \betastar^\trans \xv_n) \xv_n + \left( \meann \xv_n \xv_n^\trans \right) (\betastar - \betahat) \\\rightarrow{}& \cov{(\y_n - \betastar \xv_n)^\trans \xv_n}. \end{aligned} \]

If we call $\reshat_n := \y_n - \betahat^\trans \xv_n$, we see that the estimator of the limiting covariance of $\sqrt{N}( \betahat- \betastar)$ is precisely the sandwich covariance.

Note that we derived this result without any assumptions on the relationship between $\y_n$ and $\xv_n$, other than moment conditions allowing us to apply the LLN and CLT.