Test statistics without normality

Goals

  • Derive statistical measures under only independence assumptions
    • Derive an assumption-free version of what OLS is estimating
    • Find the sandwich variance estimator (again)

Machine learning assumption

What can we say about the OLS coefficients when we do not assume that \(y_n = \beta^\intercal\boldsymbol{x}_n + \varepsilon_n\) for any \(\beta\)? In that case we have no model to compare to. However, we can still try to find the best linear estimator in terms of squared loss.

Suppose we simply want to have a good fit to training data. That is, given some \(y_\mathrm{new}\), we want to find a best guess \(f(\boldsymbol{x}_\mathrm{new})\) for the value of \(y_\mathrm{new}\): specifically, we want to find the function that minimizes the “loss”

\[ \hat{f}(\cdot) := \underset{f}{\mathrm{argmin}}\, \mathbb{E}\left[\left( y_\mathrm{new}- f(\boldsymbol{x}_\mathrm{new}) \right)^2\right]. \]

This is a functional optimization problem over an infinite dimensional space! You can solve this using some fancy math, but you can also prove directly that the best choice is

\[ \hat{f}(\boldsymbol{x}_\mathrm{new}) = \mathbb{E}\left[y_\mathrm{new}| \boldsymbol{x}_\mathrm{new}\right]. \]

Exercise

Prove that any other \(f(\cdot)\) results in larger loss.

Of course, we don’t know the functional form of \(\mathbb{E}\left[y_\mathrm{new}| \boldsymbol{x}_\mathrm{new}\right]\). But suppose we approximate it with

\[ \mathbb{E}\left[y_\mathrm{new}| \boldsymbol{x}_\mathrm{new}\right] \approx \boldsymbol{\beta}^\intercal\boldsymbol{x}_\mathrm{new}. \]

Note that we are not assuming this is true, so this is different from our “correct specification” assumption before that \(\mathbb{E}\left[y_\mathrm{new}\right] = \boldsymbol{\beta}^\intercal\boldsymbol{x}_\mathrm{new}\). Rather, we are finding the best approximation to the unknown \(\mathbb{E}\left[y_\mathrm{new}| \boldsymbol{x}_\mathrm{new}\right]\) amongst the class of functions of the form \(\boldsymbol{\beta}^\intercal\boldsymbol{x}_\mathrm{new}\).

Under this approximation, the problem becomes \[ \begin{aligned} {\beta^{*}}:={}& \underset{\beta}{\mathrm{argmin}}\, \mathbb{E}\left[\left( y_\mathrm{new}- \boldsymbol{\beta}^\intercal\boldsymbol{x}_\mathrm{new}\right)^2\right] \\ ={}& \underset{\beta}{\mathrm{argmin}}\, \left(\mathbb{E}\left[y_\mathrm{new}^2\right] - 2 \boldsymbol{\beta}^\intercal\mathbb{E}\left[y_\mathrm{new}\boldsymbol{x}_\mathrm{new}\right] + \boldsymbol{\beta}^\intercal\mathbb{E}\left[\boldsymbol{x}_\mathrm{new}\boldsymbol{x}_\mathrm{new}^\intercal\right] \boldsymbol{\beta}\right). \end{aligned} \]

Differentiating with respect to \(\beta\) and solving gives

\[ {\beta^{*}}= \mathbb{E}\left[\boldsymbol{x}_\mathrm{new}\boldsymbol{x}_\mathrm{new}^\intercal\right] ^{-1} \mathbb{E}\left[y_\mathrm{new}\boldsymbol{x}_\mathrm{new}\right]. \]

This might look familiar. In fact, recall our first “useless” application of the LLN to \(\hat{\beta}\):

\[ \hat{\beta}= (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{X}^\intercal\boldsymbol{Y}\rightarrow \mathbb{E}\left[\boldsymbol{x}_\mathrm{new}\boldsymbol{x}_\mathrm{new}^\intercal\right] ^{-1} \mathbb{E}\left[y_\mathrm{new}\boldsymbol{x}_\mathrm{new}\right] = {\beta^{*}}. \]

Finally, we have an interpretation — our \(\hat{\beta}\) converges to the minimizer of the (intractable) expected loss. Why is this? We can’t compute the expectated loss, since we don’t know the joint distribution of \(\boldsymbol{x}_\mathrm{new}\) and \(y_\mathrm{new}\). However, if we have a training set, we can approximate this expectation:

\[ \hat{\beta}:= \underset{\beta}{\mathrm{argmin}}\, \frac{1}{N} \sum_{n=1}^N\left( y_n - \boldsymbol{\beta}^\intercal\boldsymbol{x}_n \right)^2 \approx \underset{\beta}{\mathrm{argmin}}\, \mathbb{E}\left[\left( y_\mathrm{new}- \boldsymbol{\beta}^\intercal\boldsymbol{x}_\mathrm{new}\right)^2\right] = {\beta^{*}}. \]

Here, we are not positing any true model — we are simply assuming that the draws \((y_n, \boldsymbol{x}_n)\) are IID, and using an approximate loss. Still, we can analyze \(\hat{\beta}\)’s asymptotic behavior, using the fact that it minimizes the empirical loss.

Define the gradient of the loss function: \[ G(\beta) := \frac{\partial}{\partial \beta} \frac{1}{N} \sum_{n=1}^N\left( y_n - \boldsymbol{\beta}^\intercal\boldsymbol{x}_n \right)^2 = -2 \frac{1}{N} \sum_{n=1}^N\left( y_n - \boldsymbol{\beta}^\intercal\boldsymbol{x}_n \right) \boldsymbol{x}_n. \]

Exercise

Prove that \(G(\hat{\beta}) = \boldsymbol{0}\) and \(\mathbb{E}\left[G({\beta^{*}})\right] = 0\).

Noting that \(G(\beta)\) is linear in \(\beta\) (so the second derivative with respect to \(\beta\) is zero), we can Taylor expand \(G(\cdot)\) around \(\hat{\beta}\), giving

\[ G({\beta^{*}}) = G(\hat{\beta}) + \frac{\partial G}{\partial \beta^\intercal} (\hat{\beta}) ({\beta^{*}}- \hat{\beta}). \]

Exercise

Derive the previous equation without using a Taylor series using \[ \begin{aligned} \frac{1}{N} \sum_{n=1}^N\left( y_n - \hat{\boldsymbol{\beta}}^\intercal\boldsymbol{x}_n \right) \boldsymbol{x}_n ={}& \boldsymbol{0}\Rightarrow \\ \frac{1}{N} \sum_{n=1}^N\left( y_n - \hat{\boldsymbol{\beta}}^\intercal\boldsymbol{x}_n \right) \boldsymbol{x}_n - \frac{1}{N} \sum_{n=1}^N\left( y_n - {\beta^{*}}^\intercal\boldsymbol{x}_n \right) \boldsymbol{x}_n ={}& - \frac{1}{N} \sum_{n=1}^N\left( y_n - {\beta^{*}}^\intercal\boldsymbol{x}_n \right) \boldsymbol{x}_n, \end{aligned} \] and collecting terms. The Taylor series version is more general, since the analysis here can also be applied to non-linear gradients as long as \(\hat{\boldsymbol{\beta}}- {\beta^{*}}\) is small and the second derivative bounded, so that the first-order series expansion becomes accurate to leading order as \(N\) grows.

Using the fact that \(G(\hat{\beta}) = \boldsymbol{0}\), we can solve

\[ \sqrt{N} (\hat{\beta}- {\beta^{*}}) = - \left( \frac{\partial G}{\partial \beta^\intercal} (\hat{\beta}) \right)^{-1} \sqrt{N} G({\beta^{*}}). \]

Plugging in our particular loss gradient,

\[ - \frac{\partial G}{\partial \beta^\intercal} = \frac{1}{N} \sum_{n=1}^N\boldsymbol{x}_n \boldsymbol{x}_n^\intercal \quad\textrm{and}\quad \sqrt{N} G({\beta^{*}}) = \frac{1}{\sqrt{N}} \sum_{n=1}^N(y_n - {\beta^{*}}^\intercal\boldsymbol{x}_n) \boldsymbol{x}_n. \]

Because \(\mathbb{E}\left[(y_n - {\beta^{*}}\boldsymbol{x}_n)^\intercal\boldsymbol{x}_n\right] = \boldsymbol{0}\), we can apply the CLT to get

\[ \sqrt{N} G({\beta^{*}}) \rightarrow \mathcal{N}\left(\boldsymbol{0}, \mathrm{Cov}\left((y_n - {\beta^{*}}^\intercal\boldsymbol{x}_n) \boldsymbol{x}_n\right)\right). \]

Finally, using the fact that \(\hat{\beta}\rightarrow {\beta^{*}}\) , we find that we can consistently estimate the covariance as

\[ \begin{aligned} \frac{1}{N} \sum_{n=1}^N(y_n - \hat{\beta}^\intercal\boldsymbol{x}_n) \boldsymbol{x}_n ={}& \frac{1}{N} \sum_{n=1}^N(y_n - {\beta^{*}}^\intercal\boldsymbol{x}_n + ({\beta^{*}}- \hat{\beta})^\intercal\boldsymbol{x}_n) \boldsymbol{x}_n \\={}& \frac{1}{N} \sum_{n=1}^N(y_n - {\beta^{*}}^\intercal\boldsymbol{x}_n) \boldsymbol{x}_n + \left( \frac{1}{N} \sum_{n=1}^N\boldsymbol{x}_n \boldsymbol{x}_n^\intercal\right) ({\beta^{*}}- \hat{\beta}) \\\rightarrow{}& \mathrm{Cov}\left((y_n - {\beta^{*}}\boldsymbol{x}_n)^\intercal\boldsymbol{x}_n\right). \end{aligned} \]

If we call \(\hat{\varepsilon}_n := y_n - \hat{\beta}^\intercal\boldsymbol{x}_n\), we see that the estimator of the limiting covariance of \(\sqrt{N}( \hat{\beta}- {\beta^{*}})\) is precisely the sandwich covariance.

Note that we derived this result without any assumptions on the relationship between \(y_n\) and \(\boldsymbol{x}_n\), other than moment conditions allowing us to apply the LLN and CLT.