$$ \newcommand{\mybold}[1]{\boldsymbol{#1}} \newcommand{\trans}{\intercal} \newcommand{\norm}[1]{\left\Vert#1\right\Vert} \newcommand{\abs}[1]{\left|#1\right|} \newcommand{\bbr}{\mathbb{R}} \newcommand{\bbz}{\mathbb{Z}} \newcommand{\bbc}{\mathbb{C}} \newcommand{\gauss}[1]{\mathcal{N}\left(#1\right)} \newcommand{\chisq}[1]{\mathcal{\chi}^2_{#1}} \newcommand{\studentt}[1]{\mathrm{StudentT}_{#1}} \newcommand{\fdist}[2]{\mathrm{FDist}_{#1,#2}} \newcommand{\iid}{\overset{\mathrm{IID}}{\sim}} \newcommand{\argmin}[1]{\underset{#1}{\mathrm{argmin}}\,} \newcommand{\projop}[1]{\underset{#1}{\mathrm{Proj}}\,} \newcommand{\proj}[1]{\underset{#1}{\mybold{P}}} \newcommand{\expect}[1]{\mathbb{E}\left[#1\right]} \newcommand{\prob}[1]{\mathbb{P}\left(#1\right)} \newcommand{\dens}[1]{\mathit{p}\left(#1\right)} \newcommand{\var}[1]{\mathrm{Var}\left(#1\right)} \newcommand{\cov}[1]{\mathrm{Cov}\left(#1\right)} \newcommand{\sumn}{\sum_{n=1}^N} \newcommand{\meann}{\frac{1}{N} \sumn} \newcommand{\cltn}{\frac{1}{\sqrt{N}} \sumn} \newcommand{\trace}[1]{\mathrm{trace}\left(#1\right)} \newcommand{\diag}[1]{\mathrm{Diag}\left(#1\right)} \newcommand{\grad}[2]{\nabla_{#1} \left. #2 \right.} \newcommand{\gradat}[3]{\nabla_{#1} \left. #2 \right|_{#3}} \newcommand{\fracat}[3]{\left. \frac{#1}{#2} \right|_{#3}} \newcommand{\W}{\mybold{W}} \newcommand{\w}{w} \newcommand{\wbar}{\bar{w}} \newcommand{\wv}{\mybold{w}} \newcommand{\X}{\mybold{X}} \newcommand{\x}{x} \newcommand{\xbar}{\bar{x}} \newcommand{\xv}{\mybold{x}} \newcommand{\Xcov}{\mybold{M}_{\X}} \newcommand{\Xcovhat}{\hat{\mybold{M}}_{\X}} \newcommand{\Covsand}{\Sigmam_{\mathrm{sand}}} \newcommand{\Covsandhat}{\hat{\Sigmam}_{\mathrm{sand}}} \newcommand{\Z}{\mybold{Z}} \newcommand{\z}{z} \newcommand{\zv}{\mybold{z}} \newcommand{\zbar}{\bar{z}} \newcommand{\Y}{\mybold{Y}} \newcommand{\Yhat}{\hat{\Y}} \newcommand{\y}{y} \newcommand{\yv}{\mybold{y}} \newcommand{\yhat}{\hat{\y}} \newcommand{\ybar}{\bar{y}} \newcommand{\res}{\varepsilon} \newcommand{\resv}{\mybold{\res}} \newcommand{\resvhat}{\hat{\mybold{\res}}} \newcommand{\reshat}{\hat{\res}} \newcommand{\betav}{\mybold{\beta}} \newcommand{\betavhat}{\hat{\betav}} \newcommand{\betahat}{\hat{\beta}} \newcommand{\betastar}{{\beta^{*}}} \newcommand{\betavstar}{{\betav^{*}}} \newcommand{\loss}{\mathscr{L}} \newcommand{\losshat}{\hat{\loss}} \newcommand{\f}{f} \newcommand{\fhat}{\hat{f}} \newcommand{\bv}{\mybold{\b}} \newcommand{\bvhat}{\hat{\bv}} \newcommand{\alphav}{\mybold{\alpha}} \newcommand{\alphavhat}{\hat{\av}} \newcommand{\alphahat}{\hat{\alpha}} \newcommand{\omegav}{\mybold{\omega}} \newcommand{\gv}{\mybold{\gamma}} \newcommand{\gvhat}{\hat{\gv}} \newcommand{\ghat}{\hat{\gamma}} \newcommand{\hv}{\mybold{\h}} \newcommand{\hvhat}{\hat{\hv}} \newcommand{\hhat}{\hat{\h}} \newcommand{\gammav}{\mybold{\gamma}} \newcommand{\gammavhat}{\hat{\gammav}} \newcommand{\gammahat}{\hat{\gamma}} \newcommand{\new}{\mathrm{new}} \newcommand{\zerov}{\mybold{0}} \newcommand{\onev}{\mybold{1}} \newcommand{\id}{\mybold{I}} \newcommand{\sigmahat}{\hat{\sigma}} \newcommand{\etav}{\mybold{\eta}} \newcommand{\muv}{\mybold{\mu}} \newcommand{\Sigmam}{\mybold{\Sigma}} \newcommand{\rdom}[1]{\mathbb{R}^{#1}} \newcommand{\RV}[1]{{#1}} \def\A{\mybold{A}} \def\A{\mybold{A}} \def\av{\mybold{a}} \def\a{a} \def\B{\mybold{B}} \def\b{b} \def\S{\mybold{S}} \def\sv{\mybold{s}} \def\s{s} \def\R{\mybold{R}} \def\rv{\mybold{r}} \def\r{r} \def\V{\mybold{V}} \def\vv{\mybold{v}} \def\v{v} \def\vhat{\hat{v}} \def\U{\mybold{U}} \def\uv{\mybold{u}} \def\u{u} \def\W{\mybold{W}} \def\wv{\mybold{w}} \def\w{w} \def\tv{\mybold{t}} \def\t{t} \def\Sc{\mathcal{S}} \def\ev{\mybold{e}} \def\Lammat{\mybold{\Lambda}} \def\Q{\mybold{Q}} \def\eps{\varepsilon} $$

Test statistics under normality

\(\,\)

Goals

  • Derive confidence intervals for the coefficients under the normal assumption
    • Introduce the chi squared and student t distributions
    • For a fixed variance (normal intervals)
    • For an unknown variance (student-t intervals)

Pivots for fixed variance

In the last lecture, we computed confidence intervals for a sample mean by constructing a pivot — an invertible transformation of our data, depending on the unknown parameter, which has a fixed, known distribution. We then inverted the pivot to form confidence intervals. In this lecture, we’ll do the same thing for a coefficient in a linear regression under the “Normal Assumptions:”

\[ \begin{aligned} \y_n =& \betav^\trans \xv_n + \res_n & \res_n \sim \gauss{0, \sigma^2}. \end{aligned} \]

Recall that we showed that, under the normal assumptions,

\[ \betavhat \sim \gauss{\betav, \sigma^2 (\X^\trans \X)^{-1}}. \]

Suppose we want to form a confidence interval for coefficient \(p\) of \(\betav\), i.e., for \(\beta_p\). How can we use the distribution of \(\betavhat\) to get a pivot depending on \(\beta_p\)?

First, take \(\ev_p\) to be the vector with \(1\) in the \(p\) location and \(0\) elsewhere, and note that \(\beta_p = \ev_p^\trans \betav\). Then, by normality,

\[ \begin{aligned} \betahat_p = \ev_p^\trans \betavhat \sim& \gauss{\ev_p^\trans \betav, \sigma^2 \ev_p^\trans (\X^\trans \X)^{-1} \ev_p } = \gauss{\beta_p, \sigma^2 \v_p }, \textrm{ where }\\ \v_p =& \left( (\X^\trans \X)^{-1} \right)_{pp} \textrm{, the p-th diagonal of }(\X^\trans\X)^{-1}, \end{aligned} \]

Note that \(\left( (\X^\trans \X)^{-1} \right)_{pp} \ne \left( (\X^\trans \X)_{pp} \right)^{-1}\)! You must invert first, then take the \(p\)–th diagonal.

If we knew \(\sigma^2\) then we could form the pivot

\[ \frac{\betahat_p - \beta_p}{\sigma \sqrt{\v_p}} \sim \gauss{0, 1}, \]

giving the corresponding two–sided confidence interval for \(\beta_p\),

\[ \betahat_p \pm q_\alpha \sigma \sqrt{\v_p}, \]

where \(-q_\alpha\) is the \(\alpha / 2\) quantile of the standard normal distribution. The problem is we typically don’t actually know \(\sigma^2\), and so have to estimate it. And that estimate will have some sampling variability which we must account for.

Estimating the residual variance

How can we estimate \(\sigma^2\)? If we actually observed the residuals, by the LLN we could use

\[ \frac{1}{N} \sumn \res_n^2 \rightarrow \sigma^2. \]

But \(\res_n = \y_n - \xv_n^\trans \betav\), and we don’t know \(\betav\). A reasonable substitute is sample residuals \(\reshat_n = \y_n - \xv_n^\trans \betavhat\), since

\[ \begin{aligned} \reshat_n^2 =& \left(\y_n - \betavhat^\trans \xv_n \right)^2 \\=& \left(\y_n - \betavhat^\trans \xv_n - \betav^\trans \xv_n + \betav^\trans \xv_n \right)^2 \\=& \left(\res_n + (\betav - \betavhat)^\trans \xv_n \right)^2, \end{aligned} \]

and we know that \(\betavhat - \betav \rightarrow 0\) as \(N \rightarrow \infty\). It turns out we can be even more precise than this under the normal assumption, using some insights from linear algebra, and defining a new distribution.

Residual variance as a chi squared variable

We’ll need a special random variable called a chi-squared (I will write \(\chisq{K}\)) random variable. One way to define it is as follows.

Notation

Let \(\z_k \sim \gauss{0, 1}\) denote \(k=1,\ldots,K\) IID standard normal random variables. The random variable \[ \s := \sum_{k=1}^K \z_k^2 \] is called a chi-squared random variable with \(K\) degrees of freedom, writing \(\s \sim \chisq{K}\).

Thus, just like the multivariate normal random variable, chi squared random variables are defined as transformations of standard normal random variables.

Exercise

Let \(\s \sim \chisq{K}\). Prove that

  • \(\expect{\s} = K\)
  • \(\var{\s} = 2 K\) (hint: if \(\z \sim \gauss{0,\sigma^2}\), then \(\expect{z^4} = 3\sigma^4\))
  • If \(\a_n \sim \gauss{0,\sigma^2}\) IID for \(1,\ldots,N\), then \(\frac{1}{\sigma^2} \sumn a_n^2 \sim \chisq{N}\)
  • \(\frac{1}{K} \s \rightarrow 1\) as \(K \rightarrow \infty\)
  • \(\frac{1}{\sqrt{K}} (\s - K) \rightarrow \gauss{0, 2}\) as \(K \rightarrow \infty\)
  • Let \(\av \sim \gauss{\zerov, \id}\) where \(\a \in \rdom{K}\). Then \(\norm{\av}_2^2 \sim \chisq{K}\)
  • Let \(\av \sim \gauss{\zerov, \Sigmam}\) where \(\a \in \rdom{K}\). Then \(\av^\trans \Sigmam^{-1} \av \sim \chisq{K}\)

We will begin by writing \(\reshat\) in matrix form. Writing the vector of fitted residuals as \(\resvhat := \Y - \X \betahat\), we get \[ \begin{aligned} \resvhat ={}& \Y - \X (\X^\trans \X)^{-1} \X^\trans \Y \\={}& \left(\id - \X (\X^\trans \X)^{-1} \X^\trans \right) \Y, \end{aligned} \]

where we can recognize the projection matrix

\[ \proj{\X^\perp} = \id - \X (\X^\trans \X)^{-1} \X^\trans \]

that projects vectors to the orthogonal complement of the column space of \(\X\). Plugging in our model for \(\Y\),

\[ \resvhat = \proj{\X^\perp} (\X \beta + \resv) = \proj{\X^\perp} \resv, \]

since \(\proj{\X^\perp} \X = \zerov\). We thus have

\[ \resvhat^\trans \resvhat = \resv^\trans \proj{\X^\perp} \proj{\X^\perp} \resv = \resv^\trans \proj{\X^\perp} \resv, \]

since projecting onto a subspace twice is the same as projecting once.

The problem with the preceding expression is that \(\proj{\X^\perp}\) is complicated, and ties up the IID entries of \(\resv\) in a messy way. We can disentangle them by writing the eigenvalue decomposition of the square, symmetric matrix \(\proj{\X^\perp}\):

\[ \proj{\X^\perp} = \U \Lambda \U^\trans, \]

where \(\U\) is orthonormal and \(\Lambda\) has on the diagonal either \(1\) (for columns of \(\U\) perpendicular to the columns of \(\X\)) or \(0\) (for columns of \(\U\) in the span of the columns of \(\X\)).

Plugging in gives

\[ \resvhat^\trans \resvhat = \resv^\trans \U \Lambda \U^\trans \resv = (\U^\trans \resv) \Lambda \U^\trans \resv. \]

What is the distribution of \(\U^\trans \resv\)? Recall that \(\U\) is fixed — it is simply a (very complicated) function of the regressors, \(\X\). So \(\U^\trans \resv\) is a linear combination of normal random variables, and is itself normal. Its mean is \(\expect{\U^\trans \resv} = \U^\trans \expect{\resv} = \zerov\), and its covariance is

\[ \cov{\U^\trans \resv} = \expect{\U^\trans \resv \resv^\trans \U} = \U^\trans \expect{\resv \resv^\trans} \U = \U^\trans \sigma^2 \id \U = \sigma^2 \U^\trans \U = \sigma^2 \id, \]

where the final line is by the orthonormality of \(\U\) (which followed from the symmetry of \(\proj{\X^\perp}\)). It follows that

\[ \U^\trans \resv \sim \gauss{\zerov, \sigma^2 \id}. \]

So \(\U^\trans \resv\) has the same distribution as \(\resv\)! They are not the same random variable — the two are correlated in a complicated way by the matrix \(\U\) — but they have the same distribution.

Let’s write \(\resv' := \U^\trans\) for this random variable, so that

\[ \resvhat^\trans \resvhat = \resv'^\trans \Lambda \resv' = \sumn {\res'_n}^2 \Lambda_{nn} = \sum_{n=1}^{N-P} {\res'_n}^2, = \sigma^2 \sum_{n=1}^{N-P} \left(\frac{\res'_n}{\sigma}\right)^2, \]

where without loss of generality we take the non-zero eigenvalues to be listed first, and recalling that there are \(N - P\) such eigenvalues, since the dimension of the space perpindicular to the columns of \(\X\) has dimension \(N-P\).

By recognizing the definition of a chi-squared random variable, we can define an estimator of the residual variance,

\[ \sigmahat^2 = \frac{1}{N - P} \resvhat^\trans \resvhat \sim \frac{\sigma^2}{N - P} \s, \quad\textrm{where}\quad \s \sim \chisq{N-P}. \]

This distribution is true in finite sample — it is not asymptotic. But this beautiful result is only possible because of the normal assumption, which may not be very reasonable in practice.

From standard properties of the chi-squared distribution, some immediate results follow.

Variance

As expected, the variance of our estimator around the truth goes to zero as \(N \rightarrow \infty\), since

\[ \var{\sigmahat^2 - \sigma^2} = \sigma^4 \var{\frac{\s}{N - P} - 1} = \frac{\sigma^4}{(N - P)^2} \var{\s} = \frac{2(N - P)}{(N - P)^2}\sigma^4 \rightarrow 0. \]

This is essentially the chi-squared version of the proof we already gave using the LLN.

Bias

First, we can see that

\[ \expect{\sigmahat^2} = \sigma^2 \frac{N-P}{N - P} = \sigma^2. \]

In other words, \(\sigmahat^2\) is unbiased. Normalizing by \(N - P\) instead of \(N\), as we might naively have done, results in an unbiased estimator.

Independence

Recall that we’ve shown that

\[ \betavhat = \betav + (\X^\trans \X)^{-1} \X^\trans \resv \quad\textrm{and}\quad \resvhat = \proj{\X^\perp} \resv. \]

As a consequence, we have shown that both are multivariate normal. Furthermore,

\[ \cov{\betavhat, \resvhat} = \cov{(\X^\trans \X)^{-1} \X^\trans \resv, \proj{\X^\perp} \resv} = \expect{(\X^\trans \X)^{-1} \X^\trans \resv \resv^\trans \proj{\X^\perp} } = \sigma^2 (\X^\trans \X)^{-1} \X^\trans \proj{\X^\perp} = \zerov, \]

since \(\proj{\X^\perp} \X = \zerov\). So it follows that not only are \(\betavhat\) and \(\resvhat\) normal, they are independent, since for normal random variables, uncorrelatedness implies independence.

Further, since \(\sigmahat^2\) is simply a deterministic function of \(\resvhat\), it follows that \(\betavhat\) and \(\sigmahat^2\) are independent! This will be useful shortly when we derive confidence intervals for \(\betahat\) under the normal assumptions.

Bonus content: Alternative (simpler) derivation of the bias of the variance estimator

If all we want is \(\expect{\sigmahat^2}\), we can do a trick to avoid the chi squared distribution and eigendecomposition of the projection matrix.

In particualr, \(\reshat^\trans \reshat\) is a scalar, and the matrix trace of a scalar is also a scalar, so

\[ \resv^\trans \proj{\X^\perp} \resv = \trace{\resv^\trans \proj{\X^\perp} \resv} = \trace{\proj{\X^\perp} \resv \resv^\trans }, \]

where we have used the general property of the trace that \(\trace{AB} = \trace{BA}\). Combining the above results, we see that

\[ \begin{aligned} \expect{\resvhat^\trans \resvhat} ={}& \expect{\trace{\proj{\X^\perp} \resv \resv^\trans }} \\={}& \trace{\proj{\X^\perp} \expect{\resv \resv^\trans }} \\={}& \trace{\proj{\X^\perp} \sigma^2 \id } & \textrm{(by A2)} \\={}& \sigma^2 \trace{\proj{\X^\perp} } & \textrm{(by A2)}. \end{aligned} \]

Now, since \(\proj{\X^\perp}\) is a projection matrix onto an \(N - P\) dimensional space, \(\trace{\proj{\X^\perp}} = N - P\), so we get

\[ \expect{\resvhat^\trans \resvhat} = (N - P) \sigma^2. \]

Using the residual variance estimator in the coefficient confidence interval (accounting for variance variance)

Recall that we have shown that

\[ \sigmahat^2 := \frac{1}{N - P} \sumn \reshat_n^2 = \frac{\sigma^2}{N-P} \s \quad\textrm{where}\quad \s \sim \chisq{N-P}, \]

and that, furthermore, \(\sigmahat\) and \(\betahat\) are independent of one another.

Let’s try to use this fact to get a probability distribution for \(\betahat_p\) that takes the variability of \(\sigmahat\) into account, where we substitute \(\sigmahat\) in for \(\sigma\) in the pivot

\[ \frac{\betahat_p - \beta_p}{\v_p \sigma}. \]

Write

\[ \begin{aligned} \frac{\betahat_p - \beta_p}{\v_p \sigmahat} ={}& \frac{\betahat_p - \beta_p}{\v_p \sqrt{\frac{\sigma^2}{N-P} \s}} \\={}& \left(\frac{\betahat_p - \beta_k}{\v_p \sigma}\right) \Big/ \sqrt{\frac{\s}{N-P} }. \end{aligned} \]

As we showed before, \(\frac{\betahat_k - \beta_k}{\v \sigma} \sim \gauss{0,1}\). Furthermore, it is independent of \(\s\). The denominator is approximately \(1\) for large \(N - P\), but has some sampling variability. The ratio of a normal to an independent chi squared happens to be a known distribution.

Definition

Let \(\z \sim \gauss{0,1}\), and \(\s \sim \chisq{N-P}\), independently of one another. Then the distribution of \[ \t := \frac{\z}{\sqrt{\s / (N-P)}} \]

is called a “student-t distribution with \(N-P\) degrees of freedom.” We write \(\t \sim \studentt{N - P}\). As long as \(N - P > 2\), \(\expect{\t} = 0\) and \(\var{\t} = (N - P) / (N - P - 2)\).

Exercise

Let \(\t \sim \studentt{K}\) with \(K > 2\).

  • Prove that \(\expect{\t} = 0\).
  • Without using the above explicit formula, prove that \(\var{\t} > 1\).
    (Hint: use the fact that, for any positive non-constant random variable, \(\RV{x}\), \(\expect{1 / \RV{x}} > 1 / \expect{\RV{x}}\), by Jensen’s inequality.)

You can find quantiles of the student-t distributions using the R function qt(), just as you would use rnorm().

Given all this, we have shown that

\[ \prob{\frac{\betahat_p - \beta_p}{\v_p \sigmahat} \le \z_\alpha} = \prob{\t \le \z_\alpha} \quad\textrm{where}\quad \t \sim \studentt{N - P}. \]

Using this formula, we can find exact intervals for our marginal prediction error under normality. Specifically, if we choose \(\z_\alpha\) so that \(\prob{\t \le -\z_\alpha} = \alpha / 2\) for the student-T distribution rather than the normal distribution, then the confidence interval

\[ \betahat_p \pm \v_p \sigmahat \z_\alpha \]

is a valid two–sided confidence interval for \(\beta_p\), despite the fact that we used \(\sigmahat\) instead of the true \(\sigma\). The only difference is that we replaced the normal quanitle, \(q_\alpha\), with the (slightly larger) student-T quantile, \(\z_\alpha\).

Exercise

Derive student t intervals for \(\av^\trans \beta\), for a generic vector \(\av\).