Test statistics without normality
Goals
- Leave the normal assumption behind
- Derive limiting distributions of \(\hat{\boldsymbol{\beta}}\) using the CLT under homoskedasticity
- Derive limiting distributions of \(\hat{\boldsymbol{\beta}}\) using the CLT under heteroskedasticity
- Derive an assumption-free version of what OLS is estimating
Leaving the normal assumption
Up to now, we’ve been assuming that \(\varepsilon_n \sim \mathcal{N}\left(0, \sigma^2\right)\). Under these assumptions, we were able to derive closed-form, finite-sample distributions for \(\hat{\boldsymbol{\beta}}\) and \(\hat{\sigma}^2\). Unfortunately, the normal assumption is unreasonable in practice. So today we use asymptotics to derive confidence intervals for \(\\betavhat\) for large \(N\) using asymptotics under the homoskedastic, heteroskedastic, and machine learning assumptions.
To do so, we will need one additional assumptions:
\[ \frac{1}{N} \boldsymbol{X}^\intercal\boldsymbol{X}\rightarrow \boldsymbol{M}_{\boldsymbol{X}} \textrm{ as }N \rightarrow \infty \textrm{ where }\boldsymbol{M}_{\boldsymbol{X}}\textrm{ is invertible.} \]
This may happen for deterministic \(\boldsymbol{x}_n\), in which case this is just an assumption about a real–valued limit, or it may occur for random and IID \(\boldsymbol{x}_n\), in which case \(\boldsymbol{M}_{\boldsymbol{X}}= \mathrm{Cov}\left(\boldsymbol{x}_n\right)\) by the LLN.
The homoskedastic assumption
Assume only that there exists a \({\boldsymbol{\beta}^{*}}\) so that \(\varepsilon_n := y_n - {\boldsymbol{\beta}^{*}}^\intercal\boldsymbol{x}_n\) is mean zero, independent of \(\boldsymbol{x}_n\), and has constant variance \(\sigma^2\). We still have that
\[ \hat{\boldsymbol{\beta}}= {\boldsymbol{\beta}^{*}}+ (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{X}^\intercal\boldsymbol{\varepsilon}. \]
That means that \(\mathbb{E}\left[\hat{\boldsymbol{\beta}}| \boldsymbol{X}\right] = \beta\), so our estimator is still unbiased given \(\boldsymbol{X}\). But the term \((\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{X}^\intercal\boldsymbol{\varepsilon}\) is no longer normal. It also remains the case that
\[ \begin{aligned} \mathrm{Cov}\left((\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{X}^\intercal\boldsymbol{\varepsilon}| \boldsymbol{X}\right) ={}& (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{X}^\intercal\mathbb{E}\left[\boldsymbol{\varepsilon}\boldsymbol{\varepsilon}^\intercal\right] \boldsymbol{X}(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \\={}& \sigma^2 (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \\={}& \frac{1}{N} \sigma^2 \left(\frac{1}{N} \boldsymbol{X}^\intercal\boldsymbol{X}\right)^{-1} \\\rightarrow{}& \boldsymbol{0}. \end{aligned} \]
This means that \(\hat{\boldsymbol{\beta}}\rightarrow {\boldsymbol{\beta}^{*}}\) — in other words, \(\hat{\boldsymbol{\beta}}\) is consistent under the homoskedastic assumptions.
The LLN for consistency
Another way to show the same this is using the LLN \(\hat{\boldsymbol{\beta}}\):
\[ \begin{aligned} \hat{\boldsymbol{\beta}}- {\boldsymbol{\beta}^{*}}={}& (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{X}^\intercal\boldsymbol{\varepsilon}\\ ={}& \left(\frac{1}{N} \boldsymbol{X}^\intercal\boldsymbol{X}\right)^{-1} \frac{1}{N} \boldsymbol{X}^\intercal\boldsymbol{\varepsilon}\\ ={}& \left(\frac{1}{N} \sum_{n=1}^N\boldsymbol{x}_n \boldsymbol{x}_n^\intercal\right)^{-1} \frac{1}{N} \sum_{n=1}^N\boldsymbol{x}_n \varepsilon_n. \end{aligned} \]
Now, \((\frac{1}{N} \sum_{n=1}^N\boldsymbol{x}_n \boldsymbol{x}_n^\intercal)^{-1} \rightarrow \boldsymbol{M}_{\boldsymbol{X}}^{-1}\) by the LLN and the continuous mapping theorem, and, since
\[ \begin{aligned} \frac{1}{N} \sum_{n=1}^N\boldsymbol{x}_n \varepsilon_n \rightarrow{}& \mathbb{E}\left[\boldsymbol{x}_n \varepsilon_n\right] \\={}& \mathbb{E}\left[\boldsymbol{x}_n \mathbb{E}\left[\varepsilon_n | \boldsymbol{x}_n\right]\right] \\={}& \mathbb{E}\left[\boldsymbol{x}_n \cdot 0\right] \\={}& \boldsymbol{0}, \end{aligned} \]
simply using the fact that, by assumption, \(\mathbb{E}\left[\varepsilon_n | \boldsymbol{x}_n\right] = 0\).
The CLT for normality
Although we don’t know the finite-sample distribution of \(\hat{\boldsymbol{\beta}}- \beta\), the LLN points to a way to approximation the asymptotic distribution of \(\hat{\boldsymbol{\beta}}- \beta\) via the CLT. Specifically, note that \(\boldsymbol{x}_n \varepsilon_n\) are not IID, but
\[ \mathbb{E}\left[\boldsymbol{x}_n \varepsilon_n\right] = 0 \quad\textrm{and}\quad \mathrm{Cov}\left(\boldsymbol{x}_n \varepsilon_n\right) = \mathbb{E}\left[\boldsymbol{x}_n \boldsymbol{x}_n^\intercal\right] \sigma^2. \]
Here, \(\mathbb{E}\left[\boldsymbol{x}_n \boldsymbol{x}_n^\intercal\right]\) would be the same for all \(n\) if \(\boldsymbol{x}_n\) are IID, or it might be different for each \(n\) if the \(\boldsymbol{x}_n\) are fixed or non–IID. But either way, by assumption, \[ \frac{1}{N} \sum_{n=1}^N\mathrm{Cov}\left(\boldsymbol{x}_n \varepsilon_n\right) = \sigma^2 \frac{1}{N} \sum_{n=1}^N\mathbb{E}\left[\boldsymbol{x}_n \boldsymbol{x}_n^\intercal\right] \rightarrow \sigma^2 \boldsymbol{M}_{\boldsymbol{X}}. \]
Therefore,by the multivariate CLT, \[ \frac{1}{\sqrt{N}} \sum_{n=1}^N\boldsymbol{x}_n \varepsilon_n \rightarrow \mathcal{N}\left(0, \sigma^2 \boldsymbol{M}_{\boldsymbol{X}}\right). \]
Thus, by the continuous mapping theorem,
\[ \sqrt{N}(\hat{\boldsymbol{\beta}}- {\boldsymbol{\beta}^{*}}) = \left(\frac{1}{N} \sum_{n=1}^N\boldsymbol{x}_n \boldsymbol{x}_n^\intercal\right)^{-1} \frac{1}{\sqrt{N}} \sum_{n=1}^N\boldsymbol{x}_n \varepsilon_n \rightarrow \boldsymbol{M}_{\boldsymbol{X}}^{-1} {z} \quad\textrm{where}\quad {z} \sim \mathcal{N}\left(0, \sigma^2 \boldsymbol{M}_{\boldsymbol{X}}\right). \]
Now, by properties of the multivariate normal,
\[ \boldsymbol{M}_{\boldsymbol{X}}^{-1} {z} \sim \mathcal{N}\left(0, \sigma^2 \boldsymbol{M}_{\boldsymbol{X}}^{-1} \boldsymbol{M}_{\boldsymbol{X}}\boldsymbol{M}_{\boldsymbol{X}}^{-1}\right) = \mathcal{N}\left(0, \sigma^2 \boldsymbol{M}_{\boldsymbol{X}}^{-1}\right), \]
so
\[ \sqrt{N}(\hat{\boldsymbol{\beta}}- {\boldsymbol{\beta}^{*}}) \rightarrow \mathcal{N}\left(0, \sigma^2 \boldsymbol{M}_{\boldsymbol{X}}^{-1}\right). \]
Plug-in estimators for the variance
Of course, in practice, we do not observe the terms in the variance \(\sigma^2 \boldsymbol{M}_{\boldsymbol{X}}^{-1}\). A natural solution is to plug in their consistent estimators,
\[ \begin{aligned} \hat{\sigma}^2 \rightarrow \sigma^2 \quad\textrm{and}\quad \hat{\boldsymbol{M}}_{\boldsymbol{X}}:= \frac{1}{N} \boldsymbol{X}^\intercal\boldsymbol{X}\rightarrow \boldsymbol{M}_{\boldsymbol{X}}. \end{aligned} \]
We thus say that
\[ \hat{\boldsymbol{\beta}}- {\boldsymbol{\beta}^{*}}\sim \mathcal{N}\left(0, \frac{1}{N} \hat{\sigma}^2 \left( \frac{1}{N} \boldsymbol{X}^\intercal\boldsymbol{X}\right)^{-1}\right) \quad\textrm{approximately for large }N. \]
Recall that, under normality, we had
\[ \hat{\boldsymbol{\beta}}- {\boldsymbol{\beta}^{*}}\sim \mathcal{N}\left(0, \frac{1}{N} \sigma^2 \left(\frac{1}{N} \boldsymbol{X}^\intercal\boldsymbol{X}\right)^{-1}\right) \quad\textrm{exactly, under the normal assumption, for all }N. \]
We see that the CLT gives the matching distribution for large \(N\) — the difference is that the Normal distribution is justified for large \(N\) by the CLT rather than by exact normality. In this sense, the normal assumptions are not essential for approximating the sampling distribution of \(\hat{\boldsymbol{\beta}}- {\boldsymbol{\beta}^{*}}\).
The heteroskedastic assumption
Under the heteroskedastic assumption, we keep \(y_n = x_n^\intercal\boldsymbol{\beta}+ \varepsilon_n\), with \(\mathbb{E}\left[\varepsilon_n \vert \boldsymbol{x}_n\right] = 0\), but the variance of \(\varepsilon_n\) might depend on \(\boldsymbol{x}_n\): \(\mathrm{Var}\left(\varepsilon_n \vert \boldsymbol{x}_n\right) = \sigma_n^2 < \infty\). Importantly, the residuals are no longer independent of \(\boldsymbol{x}_n\) when \(\boldsymbol{x}_n\) is random.
However, maybe surprisingly, we can still study the limiting distribution of \(\hat{\boldsymbol{\beta}}- \boldsymbol{\beta}\), which will be particularly useful for inference. Recall that the first step of deriving the limiting distribution of \(\hat{\boldsymbol{\beta}}- \beta\) using the CLT was to write
\[ \begin{aligned} \sqrt{N} (\hat{\boldsymbol{\beta}}- {\boldsymbol{\beta}^{*}}) = \hat{\boldsymbol{M}}_{\boldsymbol{X}}^{-1} \frac{1}{\sqrt{N}} \sum_{n=1}^N\boldsymbol{x}_n \varepsilon_n. \end{aligned} \]
We then applied the CLT to \(\frac{1}{\sqrt{N}} \sum_{n=1}^N\boldsymbol{x}_n \varepsilon_n\), using the fact that \(\mathrm{Cov}\left(\boldsymbol{x}_n \varepsilon_n\right) = \sigma^2 \boldsymbol{M}_{\boldsymbol{X}}\) under the homoskedastic assumptions. Under heteroskedastic assumptions, we now have
\[ \mathrm{Cov}\left(\boldsymbol{x}_n \varepsilon_n\right) = \mathbb{E}\left[\boldsymbol{x}_n \boldsymbol{x}_n^\intercal\varepsilon_n^2\right] = \mathbb{E}\left[\boldsymbol{x}_n \boldsymbol{x}_n^\intercal\mathrm{Var}\left(\varepsilon_n | \boldsymbol{x}_n\right)\right] = \mathbb{E}\left[\boldsymbol{x}_n \boldsymbol{x}_n^\intercal\sigma_n^2\right]. \]
Again, if \(\boldsymbol{x}_n\) is random and IID, then the preceding quantity is a single fixed number, or if \(\boldsymbol{x}_n\) is non–random, then the preceding quantity is different for each \(\boldsymbol{x}_n\). But, in either case, by the LLN, \[ \frac{1}{N} \sum_{n=1}^N\boldsymbol{x}_n \boldsymbol{x}_n^\intercal\varepsilon_n^2 \rightarrow \mathrm{Cov}\left(\boldsymbol{x}_n \varepsilon_n\right), \] as long as we can apply the LLN to each entry of the corresponding matrix. Furthermore, although we won’t prove it here, the plug–in estimate is also consistent as \(N \rightarrow \infty\): \[ \hat{S} := \frac{1}{N} \sum_{n=1}^N\boldsymbol{x}_n \boldsymbol{x}_n^\intercal\hat{\varepsilon}_n^2 \rightarrow \mathrm{Cov}\left(\boldsymbol{x}_n \varepsilon_n\right), \]
By the continuous mapping theorem, we thus see that, under the heteroskedastic assumptions, \[ \begin{aligned} \sqrt{N} (\hat{\boldsymbol{\beta}}- {\boldsymbol{\beta}^{*}}) \rightarrow{}& \mathcal{N}\left(\boldsymbol{0}, \boldsymbol{\Sigma}_{\mathrm{sand}}\right) \quad\textrm{where}\\ \boldsymbol{\Sigma}_{\mathrm{sand}}:={}& \boldsymbol{M}_{\boldsymbol{X}}^{-1} \mathrm{Cov}\left(\boldsymbol{x}_n \varepsilon_n\right) \boldsymbol{M}_{\boldsymbol{X}}^{-1} \quad\textrm{and}\\ \hat{\boldsymbol{\Sigma}}_{\mathrm{sand}}:={}& \hat{\boldsymbol{M}}_{\boldsymbol{X}}^{-1} \hat{S} \hat{\boldsymbol{M}}_{\boldsymbol{X}}^{-1} \rightarrow \boldsymbol{\Sigma}_{\mathrm{sand}}. \end{aligned} \]
The matrix \(\boldsymbol{\Sigma}_{\mathrm{sand}}\) is often called the “sandwich covariance matrix” and \(\hat{\boldsymbol{\Sigma}}_{\mathrm{sand}}\) the “sandwich covariance estimator,”” since the \(\boldsymbol{M}_{\boldsymbol{X}}\) term is like the “bread” and the \(\mathrm{Cov}\left(\boldsymbol{x}_n \varepsilon_n\right)\) is like the “meat”. It’s also known as the “Huber-White covariance estimator” (after some of its first promoters), or the “heteroskedasticity-robust” covariance estimator, since it will give correct asymptotic intervals for the OLS coefficients even in the presence of heteroskedasticity.
Sandwich covariance estimators can be computed in R using the sandwich package.
It is often — but not necessarily — the case that sandwich covariance estimates are larger than the “standard” homoskedastic estimates. You therefore pay a price for robustness. It is also the case that sandwich covariance matrices can be more unstable for small \(N\).
When \(P = 1\) (so \(\boldsymbol{x}_n\) is a scalar), prove that it is possible for \(\hat{\boldsymbol{\Sigma}}_{\mathrm{sand}}< \hat{\sigma}^2 \hat{\boldsymbol{M}}_{\boldsymbol{X}}^{-1}\).
Checking for heteroskedasticity
For low–dimensional problems, is common to plot \(\boldsymbol{x}_n\) versus \(\hat{\varepsilon}_n\) to look for evidence of heteroskedasticity. But when \(\boldsymbol{x}_n\) is high–dimensional, this might be difficult to visualize. In that case, what should you plot?
Note that an alternative formulation of the CLT for \(\hat{\boldsymbol{\beta}}_p\), a particular component of \(\hat{\boldsymbol{\beta}}\), is
\[ \begin{aligned} \sqrt{N}(\hat{\boldsymbol{\beta}}_p - {\boldsymbol{\beta}^{*}}_p) ={}& \sqrt{N} \boldsymbol{e}_p^\intercal\left( \boldsymbol{X}^\intercal\boldsymbol{X}\right)^{-1} \boldsymbol{X}^\intercal\boldsymbol{\varepsilon} \\={}& \boldsymbol{e}_p^\intercal\left(\frac{1}{N} \boldsymbol{X}^\intercal\boldsymbol{X}\right)^{-1} \frac{1}{\sqrt{N}} \sum_{n=1}^N\boldsymbol{x}_n \varepsilon_n \\\approx{}& \boldsymbol{e}_p^\intercal\boldsymbol{M}_{\boldsymbol{X}}^{-1} \frac{1}{\sqrt{N}} \sum_{n=1}^N\boldsymbol{x}_n \varepsilon_n \\={}& \frac{1}{\sqrt{N}} \sum_{n=1}^N\boldsymbol{e}_p^\intercal\boldsymbol{M}_{\boldsymbol{X}}^{-1} \boldsymbol{x}_n \varepsilon_n. \\={}& \frac{1}{\sqrt{N}} \sum_{n=1}^Nr_{n} \varepsilon_n \quad\textrm{ where } r_{n} := \boldsymbol{e}_p^\intercal\boldsymbol{M}_{\boldsymbol{X}}^{-1} \boldsymbol{x}_n \in \mathbb{R}. \end{aligned} \]
Ideally, for checking for heteroskedasticity in the \(p\)–th component, we would then plot \(\varepsilon_n\) versus the scalar \(r_n\). Neither are observable, but can be approximated with \(\hat{\varepsilon}_n\) and
\[ \hat{r}_n := \boldsymbol{e}_p^\intercal\left(\frac{1}{N} \boldsymbol{X}^\intercal\boldsymbol{X}\right)^{-1} \boldsymbol{x}_n. \]
The machine learning assumption
If we only assume that \((y_n, \boldsymbol{x}_n)\) are IID, then there is no true \({\boldsymbol{\beta}^{*}}\) in our assumptions. What can we form hypothesis tests about?
As we have shown above, we can still define \[ {\boldsymbol{\beta}^{*}}= \lim_{N\rightarrow \infty} \hat{\boldsymbol{\beta}}= \boldsymbol{M}_{\boldsymbol{X}}^{-1} \boldsymbol{M}_{xy}. \] If we then define \(\varepsilon_n = y_n - \boldsymbol{x}_n^\intercal{\boldsymbol{\beta}^{*}}\), note that, marginally, \[ \begin{aligned} \mathbb{E}\left[\boldsymbol{x}_n \varepsilon_n\right] ={}& \mathbb{E}\left[\boldsymbol{x}_n \left( y_n - \boldsymbol{x}_n^\intercal{\boldsymbol{\beta}^{*}}\right)\right] \\={}& \mathbb{E}\left[y_n \boldsymbol{x}_n - \boldsymbol{x}_n^\intercal\boldsymbol{x}_n {\boldsymbol{\beta}^{*}}\right] \\={}& \mathbb{E}\left[y_n \boldsymbol{x}_n\right] - \mathbb{E}\left[\boldsymbol{x}_n^\intercal\boldsymbol{x}_n\right] \boldsymbol{M}_{\boldsymbol{X}}^{-1} \boldsymbol{M}_{xy} \\={}& \boldsymbol{M}_{xy} - \boldsymbol{M}_{xy} = 0. \end{aligned} \] So even though we haven’t assumed a “true” model, we can still define residuals which are uncorrelated with \(\boldsymbol{x}_n\) as the difference between \(y_n\) and the limiting fit. Note that these residuals are not independent! In fact,
\[ \mathbb{E}\left[\varepsilon_n \vert \boldsymbol{x}_n\right] = \mathbb{E}\left[y_n \vert \boldsymbol{x}_n\right] - \boldsymbol{x}_n^\intercal{\boldsymbol{\beta}^{*}}, \] Since we have not assumed anything about the relationship between \(y_n\) and \(\boldsymbol{x}_n\), there is every reason to think that this difference depends on \(\boldsymbol{x}_n\). In particular, there is no reason to think that it would be zero for all \(\boldsymbol{x}_n\), as we assumed above.
However, using only uncorrelatedness, we can write \[ \begin{aligned} \sqrt{N} \left( \hat{\boldsymbol{M}}_{\boldsymbol{X}}^{-1} \frac{1}{N} \sum_{n=1}^N\boldsymbol{x}_n y_n - {\boldsymbol{\beta}^{*}}\right) ={}& \hat{\boldsymbol{M}}_{\boldsymbol{X}}^{-1} \sqrt{N} \left( \frac{1}{N} \sum_{n=1}^N\boldsymbol{x}_n y_n - \hat{\boldsymbol{M}}_{\boldsymbol{X}}{\boldsymbol{\beta}^{*}}\right) \\={}& \hat{\boldsymbol{M}}_{\boldsymbol{X}}^{-1} \sqrt{N} \left( \frac{1}{N} \sum_{n=1}^N\boldsymbol{x}_n \left( y_n - \boldsymbol{x}_n^\intercal{\boldsymbol{\beta}^{*}}\right) \right) \\={}& \hat{\boldsymbol{M}}_{\boldsymbol{X}}^{-1} \frac{1}{\sqrt{N}} \sum_{n=1}^N\boldsymbol{x}_n \varepsilon_n. \end{aligned} \] It follows that we can apply the rest of the asymptotics for the heteroskedastic assumption, but with the understanding that we are testing not the “true” value of \({\boldsymbol{\beta}^{*}}\) but simply the limiting value of \(\hat{\boldsymbol{\beta}}\) if we had an infinite amount of data.
Non-IID data
Finally, the sandwich covariance has another use, which is for non-IID data. This gives rise to a “grouped” covariance matrix, but the idea is in fact essentially the same as the sandwich covariance.
Suppose we think that our data are not IID, but in fact correlated within “groups,” though independent across groups. A classic example is the microcredit data, where we might imagine that the income residuals of families from the same town may be correlated with one another even after our regression adujustment, but independent of the income residuals in a different town.
We can represent this formally as follows. Let \(g=1,\ldots,G\) denote the groups, and let \(g_n\) denote the group membership of observation \(n\). For example, if observation \(100\) is from Oaxaca, we might have \(g_n = \textrm{Oaxaca}\). Note that I have assumed that each observation belongs to exactly one group.
How can we find the sampling distribution of \(\hat{\boldsymbol{\beta}}\) when the data looks like this? If we ignore the fact that observations are correlated within groups, in general we will compute the wrong sampling variance (typically underestimating it). The trick is to re-write our CLT term as \[ \sum_{n=1}^N\boldsymbol{x}_n \varepsilon_n = \sum_{g=1}^G \sum_{n: g_n = g} \boldsymbol{x}_n \varepsilon_n = \sum_{g=1}^G s_g, \quad\textrm{where}\quad s_g := \sum_{n: g_n = g} \boldsymbol{x}_n \varepsilon_n. \] We have simply re-arranged our sum so that the observations are grouped according to \(g\), and summed within groups before summing across groups. Now the \(s_g\) are independent of one another and mean zero, so we can apply a CLT to get \[ \frac{1}{\sqrt{G}} \sum_{g=1}^G s_g \rightarrow \mathcal{N}\left(0, \mathrm{Cov}\left(s_g\right)\right). \]
Also, note that \[ \frac{1}{G} \boldsymbol{X}^\intercal\boldsymbol{X}= \frac{1}{G} \sum_{n=1}^N\sum_{n: g_n = g} \boldsymbol{x}_n \boldsymbol{x}_n^\intercal \] converges, but the LLN, to the expected sum of regressors within a group, a matrix which we can call \(\boldsymbol{M}_{\boldsymbol{X}}'\). On average, we expect \(\boldsymbol{M}_{\boldsymbol{X}}'\) to be larger than \(\boldsymbol{M}_{\boldsymbol{X}}\) by a factor of the limiting value of \(N / G\), the average number of observations per group.
Putting this together, we get that
\[ \sqrt{G} (\hat{\boldsymbol{\beta}}- {\boldsymbol{\beta}^{*}}) = \left(\frac{1}{G} \sum_{n=1}^N\sum_{n: g_n = g} \boldsymbol{x}_n \boldsymbol{x}_n^\intercal\right)^{-1} \frac{1}{\sqrt{G}} \sum_{g=1}^G s_g \rightarrow \mathcal{N}\left(\boldsymbol{0}, \left( \boldsymbol{M}_{\boldsymbol{X}}' \right)^{-1} \mathrm{Cov}\left(s_g\right) \left( \boldsymbol{M}_{\boldsymbol{X}}' \right)^{-1}\right). \]
This is precisely the sandwich covariance, except where the group sums are independent rather than the individual observations!