Test statistics under normality

Goals

  • Derive confidence intervals for the coefficients under the normal assumption
    • Introduce the chi squared and student t distributions
    • For a fixed variance (normal intervals)
    • For an unknown variance (student-t intervals)

Gaussian residuals

In order to understand the sampling behavior of OLS, we must impose some (tentative) stochastic assumptions on how the data is generated. We begin with the most classic assumption: fully Gaussian, homoskedastic residuals. This assumption has the disadvantage of being typically unreasonable, but the advantage of being highly tractable. It is also the assumption behind most of the standard test statistics — including those reported by lm in R — so it is important to understand if you want to read other peoples’ statistical analyses.

Gaussian residuals assumption

Assume that the data \((y_n, \boldsymbol{x}_n)\) are IID, and assume additionally that there exists a \({\boldsymbol{\beta}^{*}}\) such that, for all \(\boldsymbol{x}_n\), \(p(y_n \vert \boldsymbol{x}_n) = \mathcal{N}\left({\boldsymbol{\beta}^{*}}^\intercal\boldsymbol{x}_n, \sigma^2\right)\).

Note that we can equivalently write the Gaussian residuals assumption as \(y_n = {\boldsymbol{\beta}^{*}}^\intercal\boldsymbol{x}_n + \varepsilon_n\), where \(\varepsilon_n \sim \mathcal{N}\left(0, \sigma^2\right)\) independently of \(\boldsymbol{x}_n\).

And yet another equivalent way to write this expression is the matrix expression \[ \boldsymbol{Y}= \boldsymbol{X}{\boldsymbol{\beta}^{*}}+ \boldsymbol{\varepsilon}, \] where the entries of \(\boldsymbol{\varepsilon}\) are IID \(\mathcal{N}\left(0, \sigma^2\right)\) random variables. If we assume that \(\boldsymbol{X}\) is full-rank, this matrix expression allows us to write \[ \begin{aligned} \hat{\boldsymbol{\beta}}={}& \left( \boldsymbol{X}^\intercal\boldsymbol{X}\right)^{-1} \boldsymbol{X}^\intercal\boldsymbol{Y}\\={}& \left( \boldsymbol{X}^\intercal\boldsymbol{X}\right)^{-1} \left(\boldsymbol{X}^\intercal\boldsymbol{X}{\boldsymbol{\beta}^{*}}+ \boldsymbol{X}^\intercal\boldsymbol{\varepsilon}\right) \\={}& {\boldsymbol{\beta}^{*}}+ \left( \boldsymbol{X}^\intercal\boldsymbol{X}\right)^{-1} \boldsymbol{X}^\intercal\boldsymbol{\varepsilon}. \end{aligned} \]

Let us now try to understand the distribution of \(\hat{\boldsymbol{\beta}}\) when \(\boldsymbol{X}\) is fixed.

Exercise

In ordinary language, what does it mean to assume that \(\boldsymbol{X}\) is fixed when we study the sampling behavior of \(\hat{\boldsymbol{\beta}}\)?

We can see that \(\hat{\boldsymbol{\beta}}\) is a random vector, where the randomness is entirely determined by the randomness of the normal random variable \(\boldsymbol{\varepsilon}\). Its expectation is given by

\[ \mathbb{E}\left[\hat{\boldsymbol{\beta}}\vert \boldsymbol{X}\right] = {\boldsymbol{\beta}^{*}}+ \left( \boldsymbol{X}^\intercal\boldsymbol{X}\right)^{-1} \boldsymbol{X}^\intercal\mathbb{E}\left[\boldsymbol{\varepsilon}\vert \boldsymbol{X}\right] = {\boldsymbol{\beta}^{*}}. \] And, recalling the definition of covariance for random vectors, it follows that its covariance is given by \[ \begin{aligned} \mathrm{Cov}\left(\hat{\boldsymbol{\beta}}\vert \boldsymbol{X}\right) ={}& \mathbb{E}\left[(\hat{\boldsymbol{\beta}}- {\boldsymbol{\beta}^{*}}) (\hat{\boldsymbol{\beta}}- {\boldsymbol{\beta}^{*}})^\intercal\right] \\={}& \mathbb{E}\left[ \boldsymbol{\varepsilon} \boldsymbol{\varepsilon}^\intercal\boldsymbol{X}\left( \boldsymbol{X}^\intercal\boldsymbol{X}\right)^{-1}\right] \\={}& \left( \boldsymbol{X}^\intercal\boldsymbol{X}\right)^{-1} \boldsymbol{X}^\intercal\mathbb{E}\left[\boldsymbol{\varepsilon} \boldsymbol{\varepsilon}^\intercal\right] \boldsymbol{X}\left( \boldsymbol{X}^\intercal\boldsymbol{X}\right)^{-1}. \end{aligned} \] Now, we know a lot about \(\mathbb{E}\left[\boldsymbol{\varepsilon}\boldsymbol{\varepsilon}^\intercal\right]\). In particular, its \(n\)–th diagonal is \(\mathbb{E}\left[\boldsymbol{\varepsilon}_n^2\right] = \sigma^2\), and off-diagonal terms (\(n \ne m\)) are \(\mathbb{E}\left[\boldsymbol{\varepsilon}_n \varepsilon_m\right] = 0\). So in fact \(\mathbb{E}\left[\boldsymbol{\varepsilon}\boldsymbol{\varepsilon}^\intercal\right] = \sigma^2 \boldsymbol{I}_N\), and \[ \begin{aligned} \mathrm{Cov}\left(\hat{\boldsymbol{\beta}}\vert \boldsymbol{X}\right) ={}& \left( \boldsymbol{X}^\intercal\boldsymbol{X}\right)^{-1} \boldsymbol{X}^\intercal \left(\sigma^2 \boldsymbol{I}_N \right) \boldsymbol{X}\left( \boldsymbol{X}^\intercal\boldsymbol{X}\right)^{-1} \\={}& \sigma^2 \left( \boldsymbol{X}^\intercal\boldsymbol{X}\right)^{-1} \boldsymbol{X}^\intercal \boldsymbol{X}\left( \boldsymbol{X}^\intercal\boldsymbol{X}\right)^{-1} \\={}& \sigma^2\left( \boldsymbol{X}^\intercal\boldsymbol{X}\right)^{-1}. \end{aligned} \]

Now, the mean and covariance of a random variable are not typically enough to fully characterize its distribution. But we additionally know that \(\hat{\boldsymbol{\beta}}\) is a linear combination of IID normal random variables, and such random variables are given the special name of “multivariate normal” random variables.

Multivariate normal random variables

Suppose that \(\boldsymbol{Z}\) is a vector of IID standard normal random variables (mean zero and variance one). We say that \(\boldsymbol{A}\) is a “multivariate normal” random variable (MVN) if, for some vector \(\boldsymbol{\mu}\) and some matrix \(\boldsymbol{S}\), \[ \boldsymbol{A}= \mu + \boldsymbol{S}\boldsymbol{Z}. \] Several properties follow:

  • \(\mathbb{E}\left[\boldsymbol{A}\right] = \boldsymbol{\mu}\)
  • \(\mathrm{Cov}\left(\boldsymbol{A}\right) = \mathbb{E}\left[(\boldsymbol{A}- \boldsymbol{\mu}) (\boldsymbol{A}- \boldsymbol{\mu})^\intercal\right] = \boldsymbol{S}\boldsymbol{S}^\intercal\)
  • If \(\boldsymbol{v}\) is a vector, then \(\boldsymbol{v}^\intercal\boldsymbol{A}\sim \mathcal{N}\left(\boldsymbol{v}^\intercal\boldsymbol{\mu}, \boldsymbol{v}^\intercal\boldsymbol{S}(\boldsymbol{S}\boldsymbol{v})^\intercal\right)\).

We write \(\boldsymbol{A}\sim \mathcal{N}\left(\boldsymbol{\mu}, \Sigma\right)\) where \(\Sigma = \boldsymbol{S}\boldsymbol{S}^\intercal\).

One can define the MVN distribtuion in many ways, but the above will do the trick most easily for this class.

Note that, for a given covariance \(\Sigma\), there will be many different \(\boldsymbol{S}\) that give the same distribution. In particular, \(\boldsymbol{S}\boldsymbol{U}\) gives the same \(\Sigma\) for any orthogonal matrix \(\boldsymbol{U}\), which is the same as to note that \(\boldsymbol{U}\boldsymbol{Z}\) is also an IID standard normal vector when \(\boldsymbol{U}\) is orthogonal.

Since we have that \(\boldsymbol{\varepsilon}= \sigma \boldsymbol{Z}\) for IID standard normal \(\boldsymbol{Z}\), it follows from the above that \[ p(\hat{\boldsymbol{\beta}}\vert \boldsymbol{X}) = \mathcal{N}\left({\boldsymbol{\beta}^{*}}, \sigma^2 (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1}\right). \]

Properties

We can see from the above that \(\hat{\boldsymbol{\beta}}\) is unbiased, even unconditionally, since

\[ \mathbb{E}\left[\hat{\boldsymbol{\beta}}\right] = \mathbb{E}\left[\mathbb{E}\left[\hat{\boldsymbol{\beta}}| \boldsymbol{X}\right]\right] = {\boldsymbol{\beta}^{*}}. \] If we additionally have that \(\frac{1}{N} \sum_{n=1}^N\boldsymbol{x}_n \boldsymbol{x}_n^\intercal\rightarrow \boldsymbol{M}_{\boldsymbol{X}}\) for some invertible \(\boldsymbol{M}_{\boldsymbol{X}}\) (either deterministically or stochastically), then \[ \sigma^2 (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} = \frac{1}{N} \sigma^2 \left(\frac{1}{N}\boldsymbol{X}^\intercal\boldsymbol{X}\right)^{-1} \approx \frac{1}{N} \sigma^2 \boldsymbol{M}_{\boldsymbol{X}}^{-1} \rightarrow 0. \] This means that \(\hat{\boldsymbol{\beta}}\rightarrow {\boldsymbol{\beta}^{*}}\) — it is not only unbiased, but consistent.

Pivots for fixed variance

Suppose we want to form a confidence interval for coefficient \(p\) of \(\boldsymbol{\beta}\), i.e., for \(\beta_p\). How can we use the distribution of \(\hat{\boldsymbol{\beta}}\) to get a pivot depending on \(\beta_p\)?

First, take \(\boldsymbol{e}_p\) to be the vector with \(1\) in the \(p\) location and \(0\) elsewhere, and note that \(\beta_p = \boldsymbol{e}_p^\intercal\boldsymbol{\beta}\). Then, by normality,

\[ \begin{aligned} \hat{\beta}_p = \boldsymbol{e}_p^\intercal\hat{\boldsymbol{\beta}}\sim& \mathcal{N}\left(\boldsymbol{e}_p^\intercal\boldsymbol{\beta}, \sigma^2 \boldsymbol{e}_p^\intercal(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{e}_p \right) = \mathcal{N}\left(\beta_p, \sigma^2 v_p \right), \textrm{ where }\\ v_p =& \left( (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \right)_{pp} \textrm{, the p-th diagonal of }(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1}, \end{aligned} \]

Note that \(\left( (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \right)_{pp} \ne \left( (\boldsymbol{X}^\intercal\boldsymbol{X})_{pp} \right)^{-1}\)! You must invert first, then take the \(p\)–th diagonal.

If we knew \(\sigma^2\) then we could form the pivot

\[ \frac{\hat{\beta}_p - \beta_p}{\sigma \sqrt{v_p}} \sim \mathcal{N}\left(0, 1\right), \]

giving the corresponding two–sided confidence interval for \(\beta_p\),

\[ \hat{\beta}_p \pm q_\alpha \sigma \sqrt{v_p}, \]

where \(-q_\alpha\) is the \(\alpha / 2\) quantile of the standard normal distribution. The problem is we typically don’t actually know \(\sigma^2\), and so have to estimate it. And that estimate will have some sampling variability which we must account for.

Estimating the residual variance

How can we estimate \(\sigma^2\)? If we actually observed the residuals, by the LLN we could use

\[ \frac{1}{N} \sum_{n=1}^N\varepsilon_n^2 \rightarrow \sigma^2. \]

But \(\varepsilon_n = y_n - \boldsymbol{x}_n^\intercal\boldsymbol{\beta}\), and we don’t know \(\boldsymbol{\beta}\). A reasonable substitute is sample residuals \(\hat{\varepsilon}_n = y_n - \boldsymbol{x}_n^\intercal\hat{\boldsymbol{\beta}}\), since

\[ \begin{aligned} \hat{\varepsilon}_n^2 =& \left(y_n - \hat{\boldsymbol{\beta}}^\intercal\boldsymbol{x}_n \right)^2 \\=& \left(y_n - \hat{\boldsymbol{\beta}}^\intercal\boldsymbol{x}_n - \boldsymbol{\beta}^\intercal\boldsymbol{x}_n + \boldsymbol{\beta}^\intercal\boldsymbol{x}_n \right)^2 \\=& \left(\varepsilon_n + (\boldsymbol{\beta}- \hat{\boldsymbol{\beta}})^\intercal\boldsymbol{x}_n \right)^2, \end{aligned} \]

and we know that \(\hat{\boldsymbol{\beta}}- \boldsymbol{\beta}\rightarrow 0\) as \(N \rightarrow \infty\). It turns out we can be even more precise than this under the normal assumption, using some insights from linear algebra, and defining a new distribution.

It turns out that, under normality, we can show that the residual estimator has a special distribution known as the “chi squared” distribution (I will write \(\mathcal{\chi}^2_{K}\)). One way to define it is as follows.

Notation

Let \(z_k \sim \mathcal{N}\left(0, 1\right)\) denote \(k=1,\ldots,K\) IID standard normal random variables. The random variable \[ s:= \sum_{k=1}^K z_k^2 \] is called a chi-squared random variable with \(K\) degrees of freedom, writing \(s\sim \mathcal{\chi}^2_{K}\).

Thus, just like the multivariate normal random variable, chi squared random variables are defined as transformations of standard normal random variables.

We show below (but won’t prove in class) that, when \(\boldsymbol{X}\) is full rank, then \[ \frac{1}{\sigma^2} p(\hat{\boldsymbol{\varepsilon}}^\intercal\hat{\boldsymbol{\varepsilon}}\vert \boldsymbol{X}) = \mathcal{\chi}^2_{N - P}. \]

As you will show in your homework, if \(s\sim \mathcal{\chi}^2_{N - P}\), then \(\mathbb{E}\left[s\right] = N - P\). It follows that we can define \[ \hat{\sigma}^2 := \frac{1}{N-P} \sum_{n=1}^N\hat{\varepsilon}_n^2 \quad\Rightarrow\quad \mathbb{E}\left[\hat{\sigma}^2 \vert \boldsymbol{X}\right] = \sigma^2. \] We get an unbiased estimate by normalizing by \(N - P\) instead of \(N\).

Furthermore, we can show that \(\hat{\boldsymbol{\varepsilon}}\) — and so \(\hat{\sigma}^2\) — is independent of \(\hat{\boldsymbol{\beta}}\). Note that

\[ \hat{\boldsymbol{\varepsilon}}= \boldsymbol{Y}- \boldsymbol{X}\hat{\boldsymbol{\beta}}= \boldsymbol{X}{\boldsymbol{\beta}^{*}}+ \boldsymbol{\varepsilon}- \boldsymbol{X}\left( {\boldsymbol{\beta}^{*}}+ \left(\boldsymbol{X}^\intercal\boldsymbol{X}\right)^{-1} \boldsymbol{X}^\intercal\boldsymbol{\varepsilon}\right) = \left(\boldsymbol{I}_N - \boldsymbol{X}(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{X}^\intercal\right) \boldsymbol{\varepsilon}. \]

It follows that \(\hat{\boldsymbol{\varepsilon}}\) is MVN and mean zero. Furthermore,

\[ \begin{aligned} \mathrm{Cov}\left(\hat{\boldsymbol{\beta}}, \hat{\boldsymbol{\varepsilon}}\vert \boldsymbol{X}\right) ={}& \mathrm{Cov}\left((\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{X}^\intercal\boldsymbol{\varepsilon}, \left(\boldsymbol{I}_N - \boldsymbol{X}(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{X}^\intercal\right) \boldsymbol{\varepsilon}\right) \\={}& (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{X}^\intercal \mathbb{E}\left[\boldsymbol{\varepsilon}\boldsymbol{\varepsilon}^\intercal\right] \left(\boldsymbol{I}_N - \boldsymbol{X}(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{X}^\intercal\right) \\={}& \sigma^2 (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{X}^\intercal\left(\boldsymbol{I}_N - \boldsymbol{X}(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{X}^\intercal\right) \\={}& \sigma^2 \left((\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{X}^\intercal- (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{X}^\intercal\right) \\={}& \boldsymbol{0}. \end{aligned} \] So it follows that not only are \(\hat{\boldsymbol{\beta}}\) and \(\hat{\boldsymbol{\varepsilon}}\) normal, they are independent, since for normal random variables, uncorrelatedness implies independence.

Further, since \(\hat{\sigma}^2\) is simply a deterministic function of \(\hat{\boldsymbol{\varepsilon}}\), it follows that \(\hat{\boldsymbol{\beta}}\) and \(\hat{\sigma}^2\) are independent!

T-tests (accounting for variance variance)

Recall that we have shown that

\[ \hat{\sigma}^2 := \frac{1}{N - P} \sum_{n=1}^N\hat{\varepsilon}_n^2 = \frac{\sigma^2}{N-P} s \quad\textrm{where}\quad s\sim \mathcal{\chi}^2_{N-P}, \]

and that, furthermore, \(\hat{\sigma}\) and \(\hat{\beta}\) are independent of one another.

Let’s try to use this fact to get a probability distribution for \(\hat{\beta}_p\) that takes the variability of \(\hat{\sigma}\) into account, where we substitute \(\hat{\sigma}\) in for \(\sigma\) in the pivot

\[ \frac{\hat{\beta}_p - \beta_p}{v_p \sigma}. \]

Write

\[ \begin{aligned} \frac{\hat{\beta}_p - \beta_p}{v_p \hat{\sigma}} ={}& \frac{\hat{\beta}_p - \beta_p}{v_p \sqrt{\frac{\sigma^2}{N-P} s}} \\={}& \left(\frac{\hat{\beta}_p - \beta_k}{v_p \sigma}\right) \Big/ \sqrt{\frac{s}{N-P} }. \end{aligned} \]

As we showed before, \(\frac{\hat{\beta}_k - \beta_k}{v\sigma} \sim \mathcal{N}\left(0,1\right)\). Furthermore, it is independent of \(s\). The denominator is approximately \(1\) for large \(N - P\), but has some sampling variability. Now, all we have to do is give the ratio of a normal to an independent chi squared happens a name: we call it a “student t-distribution.”

Definition

Let \(z\sim \mathcal{N}\left(0,1\right)\), and \(s\sim \mathcal{\chi}^2_{N-P}\), independently of one another. Then the distribution of \[ t:= \frac{z}{\sqrt{s/ (N-P)}} \]

is called a “student-t distribution with \(N-P\) degrees of freedom.” We write \(t\sim \mathrm{StudentT}_{N - P}\). As long as \(N - P > 2\), \(\mathbb{E}\left[t\right] = 0\) and \(\mathrm{Var}\left(t\right) = (N - P) / (N - P - 2)\).

You can find quantiles of the student-t distributions using the R function qt(), just as you would use rnorm().

Given all this, we have shown that

\[ \mathbb{P}\left(\frac{\hat{\beta}_p - \beta_p}{v_p \hat{\sigma}} \le z_\alpha\right) = \mathbb{P}\left(t\le z_\alpha\right) \quad\textrm{where}\quad t\sim \mathrm{StudentT}_{N - P}. \]

Using this formula, we can find exact intervals for our marginal prediction error under normality. Specifically, if we choose \(z_\alpha\) so that \(\mathbb{P}\left(t\le -z_\alpha\right) = \alpha / 2\) for the student-T distribution rather than the normal distribution, then the confidence interval

\[ \hat{\beta}_p \pm v_p \hat{\sigma}z_\alpha \]

is a valid two–sided confidence interval for \(\beta_p\), despite the fact that we used \(\hat{\sigma}\) instead of the true \(\sigma\). The only difference is that we replaced the normal quanitle, \(q_\alpha\), with the (slightly larger) student-T quantile, \(z_\alpha\).

Exercise