$$ \newcommand{\mybold}[1]{\boldsymbol{#1}} \newcommand{\trans}{\intercal} \newcommand{\norm}[1]{\left\Vert#1\right\Vert} \newcommand{\abs}[1]{\left|#1\right|} \newcommand{\bbr}{\mathbb{R}} \newcommand{\bbz}{\mathbb{Z}} \newcommand{\bbc}{\mathbb{C}} \newcommand{\gauss}[1]{\mathcal{N}\left(#1\right)} \newcommand{\chisq}[1]{\mathcal{\chi}^2_{#1}} \newcommand{\studentt}[1]{\mathrm{StudentT}_{#1}} \newcommand{\fdist}[2]{\mathrm{FDist}_{#1,#2}} \newcommand{\iid}{\overset{\mathrm{IID}}{\sim}} \newcommand{\argmin}[1]{\underset{#1}{\mathrm{argmin}}\,} \newcommand{\projop}[1]{\underset{#1}{\mathrm{Proj}}\,} \newcommand{\proj}[1]{\underset{#1}{\mybold{P}}} \newcommand{\expect}[1]{\mathbb{E}\left[#1\right]} \newcommand{\prob}[1]{\mathbb{P}\left(#1\right)} \newcommand{\dens}[1]{\mathit{p}\left(#1\right)} \newcommand{\var}[1]{\mathrm{Var}\left(#1\right)} \newcommand{\cov}[1]{\mathrm{Cov}\left(#1\right)} \newcommand{\sumn}{\sum_{n=1}^N} \newcommand{\meann}{\frac{1}{N} \sumn} \newcommand{\cltn}{\frac{1}{\sqrt{N}} \sumn} \newcommand{\trace}[1]{\mathrm{trace}\left(#1\right)} \newcommand{\diag}[1]{\mathrm{Diag}\left(#1\right)} \newcommand{\grad}[2]{\nabla_{#1} \left. #2 \right.} \newcommand{\gradat}[3]{\nabla_{#1} \left. #2 \right|_{#3}} \newcommand{\fracat}[3]{\left. \frac{#1}{#2} \right|_{#3}} \newcommand{\W}{\mybold{W}} \newcommand{\w}{w} \newcommand{\wbar}{\bar{w}} \newcommand{\wv}{\mybold{w}} \newcommand{\X}{\mybold{X}} \newcommand{\x}{x} \newcommand{\xbar}{\bar{x}} \newcommand{\xv}{\mybold{x}} \newcommand{\Xcov}{\mybold{M}_{\X}} \newcommand{\Xcovhat}{\hat{\mybold{M}}_{\X}} \newcommand{\Covsand}{\Sigmam_{\mathrm{sand}}} \newcommand{\Covsandhat}{\hat{\Sigmam}_{\mathrm{sand}}} \newcommand{\Z}{\mybold{Z}} \newcommand{\z}{z} \newcommand{\zv}{\mybold{z}} \newcommand{\zbar}{\bar{z}} \newcommand{\Y}{\mybold{Y}} \newcommand{\Yhat}{\hat{\Y}} \newcommand{\y}{y} \newcommand{\yv}{\mybold{y}} \newcommand{\yhat}{\hat{\y}} \newcommand{\ybar}{\bar{y}} \newcommand{\res}{\varepsilon} \newcommand{\resv}{\mybold{\res}} \newcommand{\resvhat}{\hat{\mybold{\res}}} \newcommand{\reshat}{\hat{\res}} \newcommand{\betav}{\mybold{\beta}} \newcommand{\betavhat}{\hat{\betav}} \newcommand{\betahat}{\hat{\beta}} \newcommand{\betastar}{{\beta^{*}}} \newcommand{\betavstar}{{\betav^{*}}} \newcommand{\loss}{\mathscr{L}} \newcommand{\losshat}{\hat{\loss}} \newcommand{\f}{f} \newcommand{\fhat}{\hat{f}} \newcommand{\bv}{\mybold{\b}} \newcommand{\bvhat}{\hat{\bv}} \newcommand{\alphav}{\mybold{\alpha}} \newcommand{\alphavhat}{\hat{\av}} \newcommand{\alphahat}{\hat{\alpha}} \newcommand{\omegav}{\mybold{\omega}} \newcommand{\gv}{\mybold{\gamma}} \newcommand{\gvhat}{\hat{\gv}} \newcommand{\ghat}{\hat{\gamma}} \newcommand{\hv}{\mybold{\h}} \newcommand{\hvhat}{\hat{\hv}} \newcommand{\hhat}{\hat{\h}} \newcommand{\gammav}{\mybold{\gamma}} \newcommand{\gammavhat}{\hat{\gammav}} \newcommand{\gammahat}{\hat{\gamma}} \newcommand{\new}{\mathrm{new}} \newcommand{\zerov}{\mybold{0}} \newcommand{\onev}{\mybold{1}} \newcommand{\id}{\mybold{I}} \newcommand{\sigmahat}{\hat{\sigma}} \newcommand{\etav}{\mybold{\eta}} \newcommand{\muv}{\mybold{\mu}} \newcommand{\Sigmam}{\mybold{\Sigma}} \newcommand{\rdom}[1]{\mathbb{R}^{#1}} \newcommand{\RV}[1]{{#1}} \def\A{\mybold{A}} \def\A{\mybold{A}} \def\av{\mybold{a}} \def\a{a} \def\B{\mybold{B}} \def\b{b} \def\S{\mybold{S}} \def\sv{\mybold{s}} \def\s{s} \def\R{\mybold{R}} \def\rv{\mybold{r}} \def\r{r} \def\V{\mybold{V}} \def\vv{\mybold{v}} \def\v{v} \def\vhat{\hat{v}} \def\U{\mybold{U}} \def\uv{\mybold{u}} \def\u{u} \def\W{\mybold{W}} \def\wv{\mybold{w}} \def\w{w} \def\tv{\mybold{t}} \def\t{t} \def\Sc{\mathcal{S}} \def\ev{\mybold{e}} \def\Lammat{\mybold{\Lambda}} \def\Q{\mybold{Q}} \def\eps{\varepsilon} $$

Test statistics under normality

$\,$

Goals

Derive confidence intervals for the coefficients under the normal assumption
- Introduce the chi squared and student t distributions
- For a fixed variance (normal intervals)
- For an unknown variance (student-t intervals)

Pivots for fixed variance

In the last lecture, we computed confidence intervals for a sample mean by constructing a pivot — an invertible transformation of our data, depending on the unknown parameter, which has a fixed, known distribution. We then inverted the pivot to form confidence intervals. In this lecture, we’ll do the same thing for a coefficient in a linear regression under the “Normal Assumptions:”

\[ \begin{aligned} \y_n =& \betav^\trans \xv_n + \res_n & \res_n \sim \gauss{0, \sigma^2}. \end{aligned} \]

Recall that we showed that, under the normal assumptions,

\[ \betavhat \sim \gauss{\betav, \sigma^2 (\X^\trans \X)^{-1}}. \]

Suppose we want to form a confidence interval for coefficient $p$ of $\betav$, i.e., for $\beta_p$. How can we use the distribution of $\betavhat$ to get a pivot depending on $\beta_p$?

First, take $\ev_p$ to be the vector with $1$ in the $p$ location and $0$ elsewhere, and note that $\beta_p = \ev_p^\trans \betav$. Then, by normality,

\[ \begin{aligned} \betahat_p = \ev_p^\trans \betavhat \sim& \gauss{\ev_p^\trans \betav, \sigma^2 \ev_p^\trans (\X^\trans \X)^{-1} \ev_p } = \gauss{\beta_p, \sigma^2 \v_p }, \textrm{ where }\\ \v_p =& \left( (\X^\trans \X)^{-1} \right)_{pp} \textrm{, the p-th diagonal of }(\X^\trans\X)^{-1}, \end{aligned} \]

Note that $\left( (\X^\trans \X)^{-1} \right)_{pp} \ne \left( (\X^\trans \X)_{pp} \right)^{-1}$! You must invert first, then take the $p$–th diagonal.

If we knew $\sigma^2$ then we could form the pivot

\[ \frac{\betahat_p - \beta_p}{\sigma \sqrt{\v_p}} \sim \gauss{0, 1}, \]

giving the corresponding two–sided confidence interval for $\beta_p$,

\[ \betahat_p \pm q_\alpha \sigma \sqrt{\v_p}, \]

where $-q_\alpha$ is the $\alpha / 2$ quantile of the standard normal distribution. The problem is we typically don’t actually know $\sigma^2$, and so have to estimate it. And that estimate will have some sampling variability which we must account for.

Estimating the residual variance

How can we estimate $\sigma^2$? If we actually observed the residuals, by the LLN we could use

\[ \frac{1}{N} \sumn \res_n^2 \rightarrow \sigma^2. \]

But $\res_n = \y_n - \xv_n^\trans \betav$, and we don’t know $\betav$. A reasonable substitute is sample residuals $\reshat_n = \y_n - \xv_n^\trans \betavhat$, since

\[ \begin{aligned} \reshat_n^2 =& \left(\y_n - \betavhat^\trans \xv_n \right)^2 \\=& \left(\y_n - \betavhat^\trans \xv_n - \betav^\trans \xv_n + \betav^\trans \xv_n \right)^2 \\=& \left(\res_n + (\betav - \betavhat)^\trans \xv_n \right)^2, \end{aligned} \]

and we know that $\betavhat - \betav \rightarrow 0$ as $N \rightarrow \infty$. It turns out we can be even more precise than this under the normal assumption, using some insights from linear algebra, and defining a new distribution.

Residual variance as a chi squared variable

We’ll need a special random variable called a chi-squared (I will write $\chisq{K}$) random variable. One way to define it is as follows.

Notation

Let $\z_k \sim \gauss{0, 1}$ denote $k=1,\ldots,K$ IID standard normal random variables. The random variable \[ \s := \sum_{k=1}^K \z_k^2 \] is called a chi-squared random variable with $K$ degrees of freedom, writing $\s \sim \chisq{K}$.

Thus, just like the multivariate normal random variable, chi squared random variables are defined as transformations of standard normal random variables.

Exercise

Let $\s \sim \chisq{K}$. Prove that

$\expect{\s} = K$
$\var{\s} = 2 K$ (hint: if $\z \sim \gauss{0,\sigma^2}$, then $\expect{z^4} = 3\sigma^4$)
If $\a_n \sim \gauss{0,\sigma^2}$ IID for $1,\ldots,N$, then $\frac{1}{\sigma^2} \sumn a_n^2 \sim \chisq{N}$
$\frac{1}{K} \s \rightarrow 1$ as $K \rightarrow \infty$
$\frac{1}{\sqrt{K}} (\s - K) \rightarrow \gauss{0, 2}$ as $K \rightarrow \infty$
Let $\av \sim \gauss{\zerov, \id}$ where $\a \in \rdom{K}$. Then $\norm{\av}_2^2 \sim \chisq{K}$
Let $\av \sim \gauss{\zerov, \Sigmam}$ where $\a \in \rdom{K}$. Then $\av^\trans \Sigmam^{-1} \av \sim \chisq{K}$

We will begin by writing $\reshat$ in matrix form. Writing the vector of fitted residuals as $\resvhat := \Y - \X \betahat$, we get \[ \begin{aligned} \resvhat ={}& \Y - \X (\X^\trans \X)^{-1} \X^\trans \Y \\={}& \left(\id - \X (\X^\trans \X)^{-1} \X^\trans \right) \Y, \end{aligned} \]

where we can recognize the projection matrix

\[ \proj{\X^\perp} = \id - \X (\X^\trans \X)^{-1} \X^\trans \]

that projects vectors to the orthogonal complement of the column space of $\X$. Plugging in our model for $\Y$,

\[ \resvhat = \proj{\X^\perp} (\X \beta + \resv) = \proj{\X^\perp} \resv, \]

since $\proj{\X^\perp} \X = \zerov$. We thus have

\[ \resvhat^\trans \resvhat = \resv^\trans \proj{\X^\perp} \proj{\X^\perp} \resv = \resv^\trans \proj{\X^\perp} \resv, \]

since projecting onto a subspace twice is the same as projecting once.

The problem with the preceding expression is that $\proj{\X^\perp}$ is complicated, and ties up the IID entries of $\resv$ in a messy way. We can disentangle them by writing the eigenvalue decomposition of the square, symmetric matrix $\proj{\X^\perp}$:

\[ \proj{\X^\perp} = \U \Lambda \U^\trans, \]

where $\U$ is orthonormal and $\Lambda$ has on the diagonal either $1$ (for columns of $\U$ perpendicular to the columns of $\X$) or $0$ (for columns of $\U$ in the span of the columns of $\X$).

Plugging in gives

\[ \resvhat^\trans \resvhat = \resv^\trans \U \Lambda \U^\trans \resv = (\U^\trans \resv) \Lambda \U^\trans \resv. \]

What is the distribution of $\U^\trans \resv$? Recall that $\U$ is fixed — it is simply a (very complicated) function of the regressors, $\X$. So $\U^\trans \resv$ is a linear combination of normal random variables, and is itself normal. Its mean is $\expect{\U^\trans \resv} = \U^\trans \expect{\resv} = \zerov$, and its covariance is

\[ \cov{\U^\trans \resv} = \expect{\U^\trans \resv \resv^\trans \U} = \U^\trans \expect{\resv \resv^\trans} \U = \U^\trans \sigma^2 \id \U = \sigma^2 \U^\trans \U = \sigma^2 \id, \]

where the final line is by the orthonormality of $\U$ (which followed from the symmetry of $\proj{\X^\perp}$). It follows that

\[ \U^\trans \resv \sim \gauss{\zerov, \sigma^2 \id}. \]

So $\U^\trans \resv$ has the same distribution as $\resv$! They are not the same random variable — the two are correlated in a complicated way by the matrix $\U$ — but they have the same distribution.

Let’s write $\resv' := \U^\trans$ for this random variable, so that

\[ \resvhat^\trans \resvhat = \resv'^\trans \Lambda \resv' = \sumn {\res'_n}^2 \Lambda_{nn} = \sum_{n=1}^{N-P} {\res'_n}^2, = \sigma^2 \sum_{n=1}^{N-P} \left(\frac{\res'_n}{\sigma}\right)^2, \]

where without loss of generality we take the non-zero eigenvalues to be listed first, and recalling that there are $N - P$ such eigenvalues, since the dimension of the space perpindicular to the columns of $\X$ has dimension $N-P$.

By recognizing the definition of a chi-squared random variable, we can define an estimator of the residual variance,

\[ \sigmahat^2 = \frac{1}{N - P} \resvhat^\trans \resvhat \sim \frac{\sigma^2}{N - P} \s, \quad\textrm{where}\quad \s \sim \chisq{N-P}. \]

This distribution is true in finite sample — it is not asymptotic. But this beautiful result is only possible because of the normal assumption, which may not be very reasonable in practice.

From standard properties of the chi-squared distribution, some immediate results follow.

Variance

As expected, the variance of our estimator around the truth goes to zero as $N \rightarrow \infty$, since

\[ \var{\sigmahat^2 - \sigma^2} = \sigma^4 \var{\frac{\s}{N - P} - 1} = \frac{\sigma^4}{(N - P)^2} \var{\s} = \frac{2(N - P)}{(N - P)^2}\sigma^4 \rightarrow 0. \]

This is essentially the chi-squared version of the proof we already gave using the LLN.

Bias

First, we can see that

\[ \expect{\sigmahat^2} = \sigma^2 \frac{N-P}{N - P} = \sigma^2. \]

In other words, $\sigmahat^2$ is unbiased. Normalizing by $N - P$ instead of $N$, as we might naively have done, results in an unbiased estimator.

Independence

Recall that we’ve shown that

\[ \betavhat = \betav + (\X^\trans \X)^{-1} \X^\trans \resv \quad\textrm{and}\quad \resvhat = \proj{\X^\perp} \resv. \]

As a consequence, we have shown that both are multivariate normal. Furthermore,

\[ \cov{\betavhat, \resvhat} = \cov{(\X^\trans \X)^{-1} \X^\trans \resv, \proj{\X^\perp} \resv} = \expect{(\X^\trans \X)^{-1} \X^\trans \resv \resv^\trans \proj{\X^\perp} } = \sigma^2 (\X^\trans \X)^{-1} \X^\trans \proj{\X^\perp} = \zerov, \]

since $\proj{\X^\perp} \X = \zerov$. So it follows that not only are $\betavhat$ and $\resvhat$ normal, they are independent, since for normal random variables, uncorrelatedness implies independence.

Further, since $\sigmahat^2$ is simply a deterministic function of $\resvhat$, it follows that $\betavhat$ and $\sigmahat^2$ are independent! This will be useful shortly when we derive confidence intervals for $\betahat$ under the normal assumptions.

Bonus content: Alternative (simpler) derivation of the bias of the variance estimator

If all we want is $\expect{\sigmahat^2}$, we can do a trick to avoid the chi squared distribution and eigendecomposition of the projection matrix.

In particualr, $\reshat^\trans \reshat$ is a scalar, and the matrix trace of a scalar is also a scalar, so

\[ \resv^\trans \proj{\X^\perp} \resv = \trace{\resv^\trans \proj{\X^\perp} \resv} = \trace{\proj{\X^\perp} \resv \resv^\trans }, \]

where we have used the general property of the trace that $\trace{AB} = \trace{BA}$. Combining the above results, we see that

\[ \begin{aligned} \expect{\resvhat^\trans \resvhat} ={}& \expect{\trace{\proj{\X^\perp} \resv \resv^\trans }} \\={}& \trace{\proj{\X^\perp} \expect{\resv \resv^\trans }} \\={}& \trace{\proj{\X^\perp} \sigma^2 \id } & \textrm{(by A2)} \\={}& \sigma^2 \trace{\proj{\X^\perp} } & \textrm{(by A2)}. \end{aligned} \]

Now, since $\proj{\X^\perp}$ is a projection matrix onto an $N - P$ dimensional space, $\trace{\proj{\X^\perp}} = N - P$, so we get

\[ \expect{\resvhat^\trans \resvhat} = (N - P) \sigma^2. \]

Using the residual variance estimator in the coefficient confidence interval (accounting for variance variance)

Recall that we have shown that

\[ \sigmahat^2 := \frac{1}{N - P} \sumn \reshat_n^2 = \frac{\sigma^2}{N-P} \s \quad\textrm{where}\quad \s \sim \chisq{N-P}, \]

and that, furthermore, $\sigmahat$ and $\betahat$ are independent of one another.

Let’s try to use this fact to get a probability distribution for $\betahat_p$ that takes the variability of $\sigmahat$ into account, where we substitute $\sigmahat$ in for $\sigma$ in the pivot

\[ \frac{\betahat_p - \beta_p}{\v_p \sigma}. \]

Write

\[ \begin{aligned} \frac{\betahat_p - \beta_p}{\v_p \sigmahat} ={}& \frac{\betahat_p - \beta_p}{\v_p \sqrt{\frac{\sigma^2}{N-P} \s}} \\={}& \left(\frac{\betahat_p - \beta_k}{\v_p \sigma}\right) \Big/ \sqrt{\frac{\s}{N-P} }. \end{aligned} \]

As we showed before, $\frac{\betahat_k - \beta_k}{\v \sigma} \sim \gauss{0,1}$. Furthermore, it is independent of $\s$. The denominator is approximately $1$ for large $N - P$, but has some sampling variability. The ratio of a normal to an independent chi squared happens to be a known distribution.

Definition

Let $\z \sim \gauss{0,1}$, and $\s \sim \chisq{N-P}$, independently of one another. Then the distribution of \[ \t := \frac{\z}{\sqrt{\s / (N-P)}} \]

is called a “student-t distribution with $N-P$ degrees of freedom.” We write $\t \sim \studentt{N - P}$. As long as $N - P > 2$, $\expect{\t} = 0$ and $\var{\t} = (N - P) / (N - P - 2)$.

Exercise

Let $\t \sim \studentt{K}$ with $K > 2$.

Prove that $\expect{\t} = 0$.
Without using the above explicit formula, prove that $\var{\t} > 1$.
(Hint: use the fact that, for any positive non-constant random variable, $\RV{x}$, $\expect{1 / \RV{x}} > 1 / \expect{\RV{x}}$, by Jensen’s inequality.)

You can find quantiles of the student-t distributions using the R function qt(), just as you would use rnorm().

Given all this, we have shown that

\[ \prob{\frac{\betahat_p - \beta_p}{\v_p \sigmahat} \le \z_\alpha} = \prob{\t \le \z_\alpha} \quad\textrm{where}\quad \t \sim \studentt{N - P}. \]

Using this formula, we can find exact intervals for our marginal prediction error under normality. Specifically, if we choose $\z_\alpha$ so that $\prob{\t \le -\z_\alpha} = \alpha / 2$ for the student-T distribution rather than the normal distribution, then the confidence interval

\[ \betahat_p \pm \v_p \sigmahat \z_\alpha \]

is a valid two–sided confidence interval for $\beta_p$, despite the fact that we used $\sigmahat$ instead of the true $\sigma$. The only difference is that we replaced the normal quanitle, $q_\alpha$, with the (slightly larger) student-T quantile, $\z_\alpha$.

Exercise

Derive student t intervals for $\av^\trans \beta$, for a generic vector $\av$.