$$ \newcommand{\mybold}[1]{\boldsymbol{#1}} \newcommand{\trans}{\intercal} \newcommand{\norm}[1]{\left\Vert#1\right\Vert} \newcommand{\abs}[1]{\left|#1\right|} \newcommand{\bbr}{\mathbb{R}} \newcommand{\bbz}{\mathbb{Z}} \newcommand{\bbc}{\mathbb{C}} \newcommand{\gauss}[1]{\mathcal{N}\left(#1\right)} \newcommand{\chisq}[1]{\mathcal{\chi}^2_{#1}} \newcommand{\studentt}[1]{\mathrm{StudentT}_{#1}} \newcommand{\fdist}[2]{\mathrm{FDist}_{#1,#2}} \newcommand{\argmin}[1]{\underset{#1}{\mathrm{argmin}}\,} \newcommand{\projop}[1]{\underset{#1}{\mathrm{Proj}}\,} \newcommand{\proj}[1]{\underset{#1}{\mybold{P}}} \newcommand{\expect}[1]{\mathbb{E}\left[#1\right]} \newcommand{\prob}[1]{\mathbb{P}\left(#1\right)} \newcommand{\dens}[1]{\mathit{p}\left(#1\right)} \newcommand{\var}[1]{\mathrm{Var}\left(#1\right)} \newcommand{\cov}[1]{\mathrm{Cov}\left(#1\right)} \newcommand{\sumn}{\sum_{n=1}^N} \newcommand{\meann}{\frac{1}{N} \sumn} \newcommand{\cltn}{\frac{1}{\sqrt{N}} \sumn} \newcommand{\trace}[1]{\mathrm{trace}\left(#1\right)} \newcommand{\diag}[1]{\mathrm{Diag}\left(#1\right)} \newcommand{\grad}[2]{\nabla_{#1} \left. #2 \right.} \newcommand{\gradat}[3]{\nabla_{#1} \left. #2 \right|_{#3}} \newcommand{\fracat}[3]{\left. \frac{#1}{#2} \right|_{#3}} \newcommand{\W}{\mybold{W}} \newcommand{\w}{w} \newcommand{\wbar}{\bar{w}} \newcommand{\wv}{\mybold{w}} \newcommand{\X}{\mybold{X}} \newcommand{\x}{x} \newcommand{\xbar}{\bar{x}} \newcommand{\xv}{\mybold{x}} \newcommand{\Xcov}{\Sigmam_{\X}} \newcommand{\Xcovhat}{\hat{\Sigmam}_{\X}} \newcommand{\Covsand}{\Sigmam_{\mathrm{sand}}} \newcommand{\Covsandhat}{\hat{\Sigmam}_{\mathrm{sand}}} \newcommand{\Z}{\mybold{Z}} \newcommand{\z}{z} \newcommand{\zv}{\mybold{z}} \newcommand{\zbar}{\bar{z}} \newcommand{\Y}{\mybold{Y}} \newcommand{\Yhat}{\hat{\Y}} \newcommand{\y}{y} \newcommand{\yv}{\mybold{y}} \newcommand{\yhat}{\hat{\y}} \newcommand{\ybar}{\bar{y}} \newcommand{\res}{\varepsilon} \newcommand{\resv}{\mybold{\res}} \newcommand{\resvhat}{\hat{\mybold{\res}}} \newcommand{\reshat}{\hat{\res}} \newcommand{\betav}{\mybold{\beta}} \newcommand{\betavhat}{\hat{\betav}} \newcommand{\betahat}{\hat{\beta}} \newcommand{\betastar}{{\beta^{*}}} \newcommand{\bv}{\mybold{\b}} \newcommand{\bvhat}{\hat{\bv}} \newcommand{\alphav}{\mybold{\alpha}} \newcommand{\alphavhat}{\hat{\av}} \newcommand{\alphahat}{\hat{\alpha}} \newcommand{\omegav}{\mybold{\omega}} \newcommand{\gv}{\mybold{\gamma}} \newcommand{\gvhat}{\hat{\gv}} \newcommand{\ghat}{\hat{\gamma}} \newcommand{\hv}{\mybold{\h}} \newcommand{\hvhat}{\hat{\hv}} \newcommand{\hhat}{\hat{\h}} \newcommand{\gammav}{\mybold{\gamma}} \newcommand{\gammavhat}{\hat{\gammav}} \newcommand{\gammahat}{\hat{\gamma}} \newcommand{\new}{\mathrm{new}} \newcommand{\zerov}{\mybold{0}} \newcommand{\onev}{\mybold{1}} \newcommand{\id}{\mybold{I}} \newcommand{\sigmahat}{\hat{\sigma}} \newcommand{\etav}{\mybold{\eta}} \newcommand{\muv}{\mybold{\mu}} \newcommand{\Sigmam}{\mybold{\Sigma}} \newcommand{\rdom}[1]{\mathbb{R}^{#1}} \newcommand{\RV}[1]{\tilde{#1}} \def\A{\mybold{A}} \def\A{\mybold{A}} \def\av{\mybold{a}} \def\a{a} \def\B{\mybold{B}} \def\S{\mybold{S}} \def\sv{\mybold{s}} \def\s{s} \def\R{\mybold{R}} \def\rv{\mybold{r}} \def\r{r} \def\V{\mybold{V}} \def\vv{\mybold{v}} \def\v{v} \def\U{\mybold{U}} \def\uv{\mybold{u}} \def\u{u} \def\W{\mybold{W}} \def\wv{\mybold{w}} \def\w{w} \def\tv{\mybold{t}} \def\t{t} \def\Sc{\mathcal{S}} \def\ev{\mybold{e}} \def\Lammat{\mybold{\Lambda}} $$

Sampling variability of the coefficients

\(\,\)

Goals

  • Derive the exact sampling distribution of the coefficients under normality
    • Review properties of the variance estimator
    • Introduce the student t distribution and some of its basic properties

Setup

Up to now, we’ve been studying prediction intervals. We’ll shortly be turning to inference on the regression coefficients themselves. But first, we will wrap up one remaining topic: the exact sampling distribution of the regression coefficients under normality.

For this lecture, we’ll return to the normal assumptions:

  • \(\y_n = \betav^\trans \xv_n + \res_n\) for some \(\betav\)
  • \(\res_n \sim \gauss{0, \sigma^2}\), IID
  • The regressors are non-stochastic, \(\X\) is full rank, and \(\frac{1}{N} \X^\trans \X \rightarrow \Xcov\) for positive definite \(\Xcov\).

Recall that we have already proven, under the normal assumption, that

\[ \betavhat \sim \gauss{\betav, \sigma^2 (\X^\trans \X)^{-1}}. \]

Up until now we have formed predictive intervals for an unknown \(\y_\new\). Let’s instead try to form a random interval \(I\) such that some particular entry of \(\beta\) is in the interval with some given probability, i.e., for some \(\alpha \in (0,1)\),

\[ \prob{\beta_k \in I} = 1 - \alpha. \]

First we need to relate \(\beta_k\) to the distribution of \(\betav\). Note that, if we take \(\vv\) to be a vector with \(0\) everywhere except in the \(k\)–th location, then \(\beta_k = \vv^\trans \betav\). For any vector \(\av\), we have that

\[ \av^\trans \betavhat \sim \gauss{\av^\trans\betav, \sigma^2 \av^\trans(\X^\trans \X)^{-1}\av}. \]

In particular,

\[ \betahat_k \sim \gauss{\beta_k, \sigma^2 (\X^\trans \X)^{-1}_{kk}}. \]

Here, \((\X^\trans \X)^{-1}_{kk}\) is the \(k\)–th diagonal entry of the matrix \((\X^\trans \X)^{-1}\). For compactness, let’s write \(\v_k := (\X^\trans \X)^{-1}_{kk}\), noting that \(\v_k\) is a constant under the present assumptions.

Exercise

Prove that, when \(\X^\trans \X\) is diagonal, then

\[ (\X^\trans \X)^{-1}_{kk} = \frac{1}{(\X^\trans\X)_{kk}}. \]

Then show that this equality does not hold in general when \(\X^\trans \X\) is not diagonal.

If we knew \(\sigma\), we could write as usual

\[ \begin{aligned} \frac{\betahat_k - \beta_k}{\sigma \v_k} \sim{}& \gauss{0, 1} \Rightarrow\\ \prob{-a \le \frac{\betahat_k - \beta_k}{\sigma \v_k} \le a} ={}& \Phi(a) - \Phi(-a) \\ ={}& 1 - \Phi(-a) - \Phi(-a) \\ ={}& 1 - 2 \Phi(-a) = 1 - \alpha \Rightarrow \\ \Phi(-a) ={}& \frac{\alpha}{2} \Rightarrow \\ a ={}& - \Phi^{-1}\left(\frac{\alpha}{2}\right) \Rightarrow \\ I ={}& (\betahat_k - \sigma \v_k a, \betahat_k + \sigma \v_k a). \end{aligned} \]

However, we don’t know \(\sigma^2\). We can use the fact that \(\sigmahat \rightarrow \sigma\), and use

\[ \hat{I} ={} (\betahat_k - \sigmahat \v_k a, \betahat_k + \sigmahat \v_k a). \]

Intuitively, we may want to account for the variability in \(\sigmahat\) as well. In the normal case, we can do that in a precise way.

Accounting for variance variance

Recall that we have shown that

\[ \sigmahat^2 := \frac{1}{N - P} \sumn \reshat_n^2 = \frac{\sigma^2}{N-P} \s \quad\textrm{where}\quad \s \sim \chisq{N-P}, \]

and that, furthermore, \(\sigmahat\) and \(\betahat\) are independent of one another.

Bad notation warning

Note: here, we are going to normalize \(\sigmahat\) with \(N-P\) rather than \(N\).

Let’s try to use this fact to get a probability distribution for \(\betahat_k\) that takes the variability of \(\sigmahat\) into account. Write

\[ \begin{aligned} \frac{\betahat_k - \beta_k}{\v \sigmahat} ={}& \frac{\betahat_k - \beta_k}{\v \sqrt{\frac{\sigma^2}{N-P} \s}} \\={}& \left(\frac{\betahat_k - \beta_k}{\v \sigma}\right) \Big/ \sqrt{\frac{\s}{N-P} }. \end{aligned} \]

As we showed before, \(\frac{\betahat_k - \beta_k}{\v \sigma} \sim \gauss{0,1}\). Furthermore, it is independent of \(\s\). The denominator is approximately \(1\) for large \(N - P\), but has some sampling variability. The ratio of a normal to an independent chi squared happens to be a known distribution.

Definition

Let \(\z \sim \gauss{0,1}\), and \(\s \sim \chisq{N-P}\), independently of one another. Then the distribution of \[ \t := \frac{\z}{\sqrt{\s / (N-P)}} \]

is called a ``student-t distribution with \(N-P\) degrees of freedom.’’ We write \(\t \sim \studentt{N - P}\). As long as \(N - P > 2\), \(\expect{\t} = 0\) and \(\var{\t} = (N - P) / (N - P - 2)\).

Exercise

Let \(\t \sim \studentt{K}\) with \(K > 2\).

  • Prove that \(\expect{\t} = 0\).
  • Without using the above explicit formulat, prove that \(\var{\t} > 1\).
    (Hint: use the fact that, for any positive non-constant random variable, \(\RV{x}\), \(\expect{1 / \RV{x}} > 1 / \expect{\RV{x}}\), by Jensen’s inequality.)

You can find quantiles of the student-t distributions using the R function qt(), just as you would use rnorm().

Given all this, we have shown that

\[ \prob{\frac{\betahat_k - \beta_k}{\v \sigmahat} \le \z_\alpha} = \prob{\t \le \z_\alpha} \quad\textrm{where}\quad \t \sim \studentt{N - P}. \]

Using this formula, we can find exact intervals for our marginal prediction error under normality.

Exercise

Derive student t intervals for \(\av^\trans \beta\), for a generic vector \(\av\).