Test statistics under normality

Goals

  • Derive confidence intervals for the coefficients under the normal assumption
    • Introduce the chi squared and student t distributions
    • For a fixed variance (normal intervals)
    • For an unknown variance (student-t intervals)

Gaussian residuals

In order to understand the sampling behavior of OLS, we must impose some (tentative) stochastic assumptions on how the data is generated. We begin with the most classic assumption: fully Gaussian, homoskedastic residuals. This assumption has the disadvantage of being typically unreasonable, but the advantage of being highly tractable. It is also the assumption behind most of the standard test statistics — including those reported by lm in R — so it is important to understand if you want to read other peoples’ statistical analyses.

Gaussian residuals assumption

Assume that the data \((y_n, \boldsymbol{x}_n)\) are IID, and assume additionally that there exists a \({\boldsymbol{\beta}^{*}}\) such that, for all \(\boldsymbol{x}_n\), \(p(y_n \vert \boldsymbol{x}_n) = \mathcal{N}\left({\boldsymbol{\beta}^{*}}^\intercal\boldsymbol{x}_n, \sigma^2\right)\).

Note that we can equivalently write the Gaussian residuals assumption as \(y_n = {\boldsymbol{\beta}^{*}}^\intercal\boldsymbol{x}_n + \varepsilon_n\), where \(\varepsilon_n \sim \mathcal{N}\left(0, \sigma^2\right)\) independently of \(\boldsymbol{x}_n\).

And yet another equivalent way to write this expression is the matrix expression \[ \boldsymbol{Y}= \boldsymbol{X}{\boldsymbol{\beta}^{*}}+ \boldsymbol{\varepsilon}, \] where the entries of \(\boldsymbol{\varepsilon}\) are IID \(\mathcal{N}\left(0, \sigma^2\right)\) random variables. If we assume that \(\boldsymbol{X}\) is full-rank, this matrix expression allows us to write \[ \begin{aligned} \hat{\boldsymbol{\beta}}={}& \left( \boldsymbol{X}^\intercal\boldsymbol{X}\right)^{-1} \boldsymbol{X}^\intercal\boldsymbol{Y}\\={}& \left( \boldsymbol{X}^\intercal\boldsymbol{X}\right)^{-1} \left(\boldsymbol{X}^\intercal\boldsymbol{X}{\boldsymbol{\beta}^{*}}+ \boldsymbol{X}^\intercal\boldsymbol{\varepsilon}\right) \\={}& {\boldsymbol{\beta}^{*}}+ \left( \boldsymbol{X}^\intercal\boldsymbol{X}\right)^{-1} \boldsymbol{X}^\intercal\boldsymbol{\varepsilon}. \end{aligned} \]

Let us now try to understand the distribution of \(\hat{\boldsymbol{\beta}}\) when \(\boldsymbol{X}\) is fixed.

Exercise

In ordinary language, what does it mean to assume that \(\boldsymbol{X}\) is fixed when we study the sampling behavior of \(\hat{\boldsymbol{\beta}}\)?

We can see that \(\hat{\boldsymbol{\beta}}\) is a random vector, where the randomness is entirely determined by the randomness of the normal random variable \(\boldsymbol{\varepsilon}\). Its expectation is given by

\[ \mathbb{E}\left[\hat{\boldsymbol{\beta}}\vert \boldsymbol{X}\right] = {\boldsymbol{\beta}^{*}}+ \left( \boldsymbol{X}^\intercal\boldsymbol{X}\right)^{-1} \boldsymbol{X}^\intercal\mathbb{E}\left[\boldsymbol{\varepsilon}\vert \boldsymbol{X}\right] = {\boldsymbol{\beta}^{*}}. \] And, recalling the definition of covariance for random vectors, it follows that its covariance is given by \[ \begin{aligned} \mathrm{Cov}\left(\hat{\boldsymbol{\beta}}\vert \boldsymbol{X}\right) ={}& \mathbb{E}\left[(\hat{\boldsymbol{\beta}}- {\boldsymbol{\beta}^{*}}) (\hat{\boldsymbol{\beta}}- {\boldsymbol{\beta}^{*}})^\intercal\right] \\={}& \mathbb{E}\left[ \boldsymbol{\varepsilon} \boldsymbol{\varepsilon}^\intercal\boldsymbol{X}\left( \boldsymbol{X}^\intercal\boldsymbol{X}\right)^{-1}\right] \\={}& \left( \boldsymbol{X}^\intercal\boldsymbol{X}\right)^{-1} \boldsymbol{X}^\intercal\mathbb{E}\left[\boldsymbol{\varepsilon} \boldsymbol{\varepsilon}^\intercal\right] \boldsymbol{X}\left( \boldsymbol{X}^\intercal\boldsymbol{X}\right)^{-1}. \end{aligned} \] Now, we know a lot about \(\mathbb{E}\left[\boldsymbol{\varepsilon}\boldsymbol{\varepsilon}^\intercal\right]\). In particular, its \(n\)–th diagonal is \(\mathbb{E}\left[\boldsymbol{\varepsilon}_n^2\right] = \sigma^2\), and off-diagonal terms (\(n \ne m\)) are \(\mathbb{E}\left[\boldsymbol{\varepsilon}_n \varepsilon_m\right] = 0\). So in fact \(\mathbb{E}\left[\boldsymbol{\varepsilon}\boldsymbol{\varepsilon}^\intercal\right] = \sigma^2 \boldsymbol{I}_N\), and \[ \begin{aligned} \mathrm{Cov}\left(\hat{\boldsymbol{\beta}}\vert \boldsymbol{X}\right) ={}& \left( \boldsymbol{X}^\intercal\boldsymbol{X}\right)^{-1} \boldsymbol{X}^\intercal \left(\sigma^2 \boldsymbol{I}_N \right) \boldsymbol{X}\left( \boldsymbol{X}^\intercal\boldsymbol{X}\right)^{-1} \\={}& \sigma^2 \left( \boldsymbol{X}^\intercal\boldsymbol{X}\right)^{-1} \boldsymbol{X}^\intercal \boldsymbol{X}\left( \boldsymbol{X}^\intercal\boldsymbol{X}\right)^{-1} \\={}& \sigma^2\left( \boldsymbol{X}^\intercal\boldsymbol{X}\right)^{-1}. \end{aligned} \]

Now, the mean and covariance of a random variable are not typically enough to fully characterize its distribution. But we additionally know that \(\hat{\boldsymbol{\beta}}\) is a linear combination of IID normal random variables, and such random variables are given the special name of “multivariate normal” random variables.

Multivariate normal random variables

Suppose that \(\boldsymbol{Z}\) is a vector of IID standard normal random variables (mean zero and variance one). We say that \(\boldsymbol{A}\) is a “multivariate normal” random variable (MVN) if, for some vector \(\boldsymbol{\mu}\) and some matrix \(\boldsymbol{S}\), \[ \boldsymbol{A}= \mu + \boldsymbol{S}\boldsymbol{Z}. \] Several properties follow:

  • \(\mathbb{E}\left[\boldsymbol{A}\right] = \boldsymbol{\mu}\)
  • \(\mathrm{Cov}\left(\boldsymbol{A}\right) = \mathbb{E}\left[(\boldsymbol{A}- \boldsymbol{\mu}) (\boldsymbol{A}- \boldsymbol{\mu})^\intercal\right] = \boldsymbol{S}\boldsymbol{S}^\intercal\)
  • If \(\boldsymbol{v}\) is a vector, then \(\boldsymbol{v}^\intercal\boldsymbol{A}\sim \mathcal{N}\left(\boldsymbol{v}^\intercal\boldsymbol{\mu}, \boldsymbol{v}^\intercal\boldsymbol{S}(\boldsymbol{S}\boldsymbol{v})^\intercal\right)\).

We write \(\boldsymbol{A}\sim \mathcal{N}\left(\boldsymbol{\mu}, \Sigma\right)\) where \(\Sigma = \boldsymbol{S}\boldsymbol{S}^\intercal\).

One can define the MVN distribtuion in many ways, but the above will do the trick most easily for this class.

Note that, for a given covariance \(\Sigma\), there will always be many different \(\boldsymbol{S}\) that give the same distribution. In particular, \(\boldsymbol{S}\boldsymbol{U}\) gives the same \(\Sigma\) for any orthogonal matrix \(\boldsymbol{U}\), which is the same as to note that \(\boldsymbol{U}\boldsymbol{Z}\) is also an IID standard normal vector when \(\boldsymbol{U}\) is orthogonal. To see that any symmetric matrix \(\Sigma\) has a square root, write the eigenvalue decomposition \[ \Sigma = \boldsymbol{U}\Lambda \boldsymbol{U}^\intercal, \] where \(\boldsymbol{U}\) has orthonormal columns and \(\Lambda\) is diagonal with entries \(\lambda_{p}\). If we take \(\Lambda^{1/2}\) to be diagonal with entries \(\lambda_{p}^{1/2}\), then the matrix \(\Sigma^{1/2} := \boldsymbol{U}\Lambda^{1/2} \boldsymbol{U}^\intercal\) is a square root of \(\Sigma\):

\[ \Sigma^{1/2} (\Sigma^{1/2})^\intercal= \boldsymbol{U}\Lambda^{1/2} \boldsymbol{U}^\intercal\boldsymbol{U}\Lambda^{1/2} \boldsymbol{U}^\intercal= \boldsymbol{U}\Lambda^{1/2} \Lambda^{1/2} \boldsymbol{U}^\intercal= \boldsymbol{U}\Lambda \boldsymbol{U}^\intercal= \Sigma. \]

Since we have that \(\boldsymbol{\varepsilon}= \sigma \boldsymbol{Z}\) for IID standard normal \(\boldsymbol{Z}\), it follows from the above that \[ p(\hat{\boldsymbol{\beta}}\vert \boldsymbol{X}) = \mathcal{N}\left({\boldsymbol{\beta}^{*}}, \sigma^2 (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1}\right). \]

Properties

We can see from the above that \(\hat{\boldsymbol{\beta}}\) is unbiased, even unconditionally, since

\[ \mathbb{E}\left[\hat{\boldsymbol{\beta}}\right] = \mathbb{E}\left[\mathbb{E}\left[\hat{\boldsymbol{\beta}}| \boldsymbol{X}\right]\right] = {\boldsymbol{\beta}^{*}}. \] If we additionally have that \(\frac{1}{N} \sum_{n=1}^N\boldsymbol{x}_n \boldsymbol{x}_n^\intercal\rightarrow \boldsymbol{M}_{\boldsymbol{X}}\) for some invertible \(\boldsymbol{M}_{\boldsymbol{X}}\) (either deterministically or stochastically), then \[ \sigma^2 (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} = \frac{1}{N} \sigma^2 \left(\frac{1}{N}\boldsymbol{X}^\intercal\boldsymbol{X}\right)^{-1} \approx \frac{1}{N} \sigma^2 \boldsymbol{M}_{\boldsymbol{X}}^{-1} \rightarrow 0. \] This means that \(\hat{\boldsymbol{\beta}}\rightarrow {\boldsymbol{\beta}^{*}}\) — it is not only unbiased, but consistent.

Pivots for fixed variance

Using the above, we see that for any \(\boldsymbol{v}\), \[ p(\boldsymbol{v}^\intercal\hat{\boldsymbol{\beta}}\vert \boldsymbol{X}) = \mathcal{N}\left(\boldsymbol{v}^\intercal{\boldsymbol{\beta}^{*}}, \sigma^2 \boldsymbol{v}^\intercal(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{v}\right). \] This means we can form a pivotal quantity for testing the hypothesis \(\boldsymbol{v}^\intercal{\boldsymbol{\beta}^{*}}= b\) *assuming we know \(\sigma\). This is sometimes called a “z-test.”

Definition

The “z-test statistic” for the null \(H_0: \boldsymbol{v}^\intercal{\boldsymbol{\beta}^{*}}= b\) is given by \[ z:= \frac{\boldsymbol{v}^\intercal\hat{\boldsymbol{\beta}}- b}{\sigma \sqrt{\boldsymbol{v}^\intercal(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{v}}}. \] Here, \(\sigma\) and \(\boldsymbol{v}\) are taken as known.

Under normality, you can show that \(z\sim \mathcal{N}\left(0,1\right)\). This can be used form two–sided confidence interval for the quantity \(b\):

\[ C = \boldsymbol{v}^\intercal\hat{\boldsymbol{\beta}}\pm q_{\alpha/2} \sigma \sqrt{\boldsymbol{v}^\intercal(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{v}}, \] where \(-q_{\alpha/2}\) is the \(\alpha / 2\) quantile of the standard normal distribution.

Let’s write \(s_p^2 := \boldsymbol{v}^\intercal(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{v}\), which is the contribution of the regressors to the variance. It’s worth noting that \(\frac{1}{N} \boldsymbol{X}^\intercal\boldsymbol{X}\rightarrow \boldsymbol{M}_{\boldsymbol{X}}\) by the LLN, so by the continuous mapping theorem we expect that \[ N s_p^2 = \boldsymbol{v}^\intercal\left(\boldsymbol{X}^\intercal\boldsymbol{X}\right)^{-1} \boldsymbol{v}= \boldsymbol{v}^\intercal\left(\frac{1}{N} \boldsymbol{X}^\intercal\boldsymbol{X}\right)^{-1} \boldsymbol{v}\rightarrow \boldsymbol{v}^\intercal\boldsymbol{M}_{\boldsymbol{X}}^{-1} \boldsymbol{v}. \] From this it follows that \(s_p \approx \boldsymbol{v}^\intercal\boldsymbol{M}_{\boldsymbol{X}}^{-1} \boldsymbol{v}/ \sqrt{N}\) for large \(N\) — that is, \(s_p\) gets smaller as \(N\) grows, and the test statistic gets larger for any fixed value of \(\hat{\boldsymbol{\beta}}\).

How can we use the above trick to get a confidence interval for hypothesis tests involving just one component, \(\beta_p\)? We can take \(\boldsymbol{e}_p\) to be the vector with \(1\) in the \(p\) location and \(0\) elsewhere, and note that \(\beta_p = \boldsymbol{e}_p^\intercal\boldsymbol{\beta}\). Then, by normality,

\[ \begin{aligned} \hat{\beta}_p = \boldsymbol{e}_p^\intercal\hat{\boldsymbol{\beta}}\sim& \mathcal{N}\left(\boldsymbol{e}_p^\intercal\boldsymbol{\beta}, \sigma^2 \boldsymbol{e}_p^\intercal(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{e}_p \right) = \mathcal{N}\left(\beta_p, \sigma^2 v_p \right), \textrm{ where }\\ s_p^2 =& \left( (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \right)_{pp} \textrm{, the p-th diagonal of }(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1}, \end{aligned} \]

Note that \(\left( (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \right)_{pp} \ne \left( (\boldsymbol{X}^\intercal\boldsymbol{X})_{pp} \right)^{-1}\)! You must invert first, then take the \(p\)–th diagonal.

This gives the corresponding two–sided confidence interval for \(\beta_p\),

\[ \hat{\beta}_p \pm q_\alpha \sigma s_p, \]

where \(-q_\alpha\) is the \(\alpha / 2\) quantile of the standard normal distribution. The problem is we typically don’t actually know \(\sigma^2\), and so have to estimate it. And that estimate will have some sampling variability which we must account for.

Estimating the residual variance

How can we estimate \(\sigma^2\)? If we actually observed the residuals, by the LLN we could use

\[ \frac{1}{N} \sum_{n=1}^N\varepsilon_n^2 \rightarrow \sigma^2. \]

But \(\varepsilon_n = y_n - \boldsymbol{x}_n^\intercal\boldsymbol{\beta}\), and we don’t know \(\boldsymbol{\beta}\). A reasonable substitute is sample residuals \(\hat{\varepsilon}_n = y_n - \boldsymbol{x}_n^\intercal\hat{\boldsymbol{\beta}}\), since

\[ \begin{aligned} \hat{\varepsilon}_n^2 =& \left(y_n - \hat{\boldsymbol{\beta}}^\intercal\boldsymbol{x}_n \right)^2 \\=& \left(y_n - \hat{\boldsymbol{\beta}}^\intercal\boldsymbol{x}_n - \boldsymbol{\beta}^\intercal\boldsymbol{x}_n + \boldsymbol{\beta}^\intercal\boldsymbol{x}_n \right)^2 \\=& \left(\varepsilon_n + (\boldsymbol{\beta}- \hat{\boldsymbol{\beta}})^\intercal\boldsymbol{x}_n \right)^2, \end{aligned} \]

and we know that \(\hat{\boldsymbol{\beta}}- \boldsymbol{\beta}\rightarrow 0\) as \(N \rightarrow \infty\). It turns out we can be even more precise than this under the normal assumption, using some insights from linear algebra, and defining a new distribution.

It turns out that, under normality, we can show that the residual estimator has a special distribution known as the “chi squared” distribution (I will write \(\mathcal{\chi}^2_{K}\)). One way to define it is as follows.

Notation

Let \(z_k \sim \mathcal{N}\left(0, 1\right)\) denote \(k=1,\ldots,K\) IID standard normal random variables. The random variable \[ s:= \sum_{k=1}^K z_k^2 \] is called a chi-squared random variable with \(K\) degrees of freedom, writing \(s\sim \mathcal{\chi}^2_{K}\).

Thus, just like the multivariate normal random variable, chi squared random variables are defined as transformations of standard normal random variables.

We show below (but won’t prove in class) that, when \(\boldsymbol{X}\) is full rank, then \[ \frac{1}{\sigma^2} p(\hat{\boldsymbol{\varepsilon}}^\intercal\hat{\boldsymbol{\varepsilon}}\vert \boldsymbol{X}) = \mathcal{\chi}^2_{N - P}. \]

As you will show in your homework, if \(s\sim \mathcal{\chi}^2_{N - P}\), then \(\mathbb{E}\left[s\right] = N - P\). It follows that we can define \[ \hat{\sigma}^2 := \frac{1}{N-P} \sum_{n=1}^N\hat{\varepsilon}_n^2 \quad\Rightarrow\quad \mathbb{E}\left[\hat{\sigma}^2 \vert \boldsymbol{X}\right] = \sigma^2. \] We get an unbiased estimate by normalizing by \(N - P\) instead of \(N\).

Furthermore, we can show that \(\hat{\boldsymbol{\varepsilon}}\) — and so \(\hat{\sigma}^2\) — is independent of \(\hat{\boldsymbol{\beta}}\). Note that

\[ \hat{\boldsymbol{\varepsilon}}= \boldsymbol{Y}- \boldsymbol{X}\hat{\boldsymbol{\beta}}= \boldsymbol{X}{\boldsymbol{\beta}^{*}}+ \boldsymbol{\varepsilon}- \boldsymbol{X}\left( {\boldsymbol{\beta}^{*}}+ \left(\boldsymbol{X}^\intercal\boldsymbol{X}\right)^{-1} \boldsymbol{X}^\intercal\boldsymbol{\varepsilon}\right) = \left(\boldsymbol{I}_N - \boldsymbol{X}(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{X}^\intercal\right) \boldsymbol{\varepsilon}. \]

It follows that \(\hat{\boldsymbol{\varepsilon}}\) is MVN and mean zero. Furthermore,

\[ \begin{aligned} \mathrm{Cov}\left(\hat{\boldsymbol{\beta}}, \hat{\boldsymbol{\varepsilon}}\vert \boldsymbol{X}\right) ={}& \mathrm{Cov}\left((\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{X}^\intercal\boldsymbol{\varepsilon}, \left(\boldsymbol{I}_N - \boldsymbol{X}(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{X}^\intercal\right) \boldsymbol{\varepsilon}\right) \\={}& (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{X}^\intercal \mathbb{E}\left[\boldsymbol{\varepsilon}\boldsymbol{\varepsilon}^\intercal\right] \left(\boldsymbol{I}_N - \boldsymbol{X}(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{X}^\intercal\right) \\={}& \sigma^2 (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{X}^\intercal\left(\boldsymbol{I}_N - \boldsymbol{X}(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{X}^\intercal\right) \\={}& \sigma^2 \left((\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{X}^\intercal- (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{X}^\intercal\right) \\={}& \boldsymbol{0}. \end{aligned} \] So it follows that not only are \(\hat{\boldsymbol{\beta}}\) and \(\hat{\boldsymbol{\varepsilon}}\) normal, they are independent, since for normal random variables, uncorrelatedness implies independence.

Further, since \(\hat{\sigma}^2\) is simply a deterministic function of \(\hat{\boldsymbol{\varepsilon}}\), it follows that \(\hat{\boldsymbol{\beta}}\) and \(\hat{\sigma}^2\) are independent!

T-tests (accounting for variance variance)

Recall that we have shown that

\[ \hat{\sigma}^2 := \frac{1}{N - P} \sum_{n=1}^N\hat{\varepsilon}_n^2 = \frac{\sigma^2}{N-P} s \quad\textrm{where}\quad s\sim \mathcal{\chi}^2_{N-P}, \]

and that, furthermore, \(\hat{\sigma}\) and \(\hat{\beta}\) are independent of one another.

Let’s try to use this fact to get a probability distribution for \(\hat{\beta}_p\) that takes the variability of \(\hat{\sigma}\) into account, where we substitute \(\hat{\sigma}\) in for \(\sigma\) in the pivot

\[ \frac{\hat{\beta}_p - \beta_p}{v_p \sigma}. \]

Write

\[ \begin{aligned} \frac{\hat{\beta}_p - \beta_p}{v_p \hat{\sigma}} ={}& \frac{\hat{\beta}_p - \beta_p}{v_p \sqrt{\frac{\sigma^2}{N-P} s}} \\={}& \left(\frac{\hat{\beta}_p - \beta_k}{v_p \sigma}\right) \Big/ \sqrt{\frac{s}{N-P} }. \end{aligned} \]

As we showed before, \(\frac{\hat{\beta}_k - \beta_k}{v\sigma} \sim \mathcal{N}\left(0,1\right)\). Furthermore, it is independent of \(s\). The denominator is approximately \(1\) for large \(N - P\), but has some sampling variability. Now, all we have to do is give the ratio of a normal to an independent chi squared happens a name: we call it a “student t-distribution.”

Definition

Let \(z\sim \mathcal{N}\left(0,1\right)\), and \(s\sim \mathcal{\chi}^2_{N-P}\), independently of one another. Then the distribution of \[ t:= \frac{z}{\sqrt{s/ (N-P)}} \]

is called a “student-t distribution with \(N-P\) degrees of freedom.” We write \(t\sim \mathrm{StudentT}_{N - P}\). As long as \(N - P > 2\), \(\mathbb{E}\left[t\right] = 0\) and \(\mathrm{Var}\left(t\right) = (N - P) / (N - P - 2)\).

You can find quantiles of the student-t distributions using the R function qt(), just as you would use rnorm().

Definition

The “t-test statistic” for the null \(H_0: \boldsymbol{v}^\intercal{\boldsymbol{\beta}^{*}}= b\) is given by \[ t:= \frac{\boldsymbol{v}^\intercal\hat{\boldsymbol{\beta}}- b}{\hat{\sigma}\sqrt{\boldsymbol{v}^\intercal(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{v}}}. \] Here is \(\boldsymbol{v}\) taken as known. A t-test of \(H_0\) compares the above statistic to the quantiles of the Student T distribution with \(N-P\) degrees of freedom.

Given all this, we have shown that

\[ \mathbb{P}\left(\frac{\hat{\beta}_p - \beta_p}{v_p \hat{\sigma}} \le z_\alpha\right) = \mathbb{P}\left(t\le z_\alpha\right) \quad\textrm{where}\quad t\sim \mathrm{StudentT}_{N - P}. \]

Using this formula, we can find exact intervals for our marginal prediction error under normality. Specifically, if we choose \(z_\alpha\) so that \(\mathbb{P}\left(t\le -z_\alpha\right) = \alpha / 2\) for the student-T distribution rather than the normal distribution, then the confidence interval

\[ \hat{\beta}_p \pm v_p \hat{\sigma}z_\alpha \]

is a valid two–sided confidence interval for \(\beta_p\), despite the fact that we used \(\hat{\sigma}\) instead of the true \(\sigma\). The only difference is that we replaced the normal quanitle, \(q_\alpha\), with the (slightly larger) student-T quantile, \(z_\alpha\).