$$ \newcommand{\mybold}[1]{\boldsymbol{#1}} \newcommand{\trans}{\intercal} \newcommand{\norm}[1]{\left\Vert#1\right\Vert} \newcommand{\abs}[1]{\left|#1\right|} \newcommand{\bbr}{\mathbb{R}} \newcommand{\bbz}{\mathbb{Z}} \newcommand{\bbc}{\mathbb{C}} \newcommand{\gauss}[1]{\mathcal{N}\left(#1\right)} \newcommand{\chisq}[1]{\mathcal{\chi}^2_{#1}} \newcommand{\studentt}[1]{\mathrm{StudentT}_{#1}} \newcommand{\fdist}[2]{\mathrm{FDist}_{#1,#2}} \newcommand{\argmin}[1]{\underset{#1}{\mathrm{argmin}}\,} \newcommand{\projop}[1]{\underset{#1}{\mathrm{Proj}}\,} \newcommand{\proj}[1]{\underset{#1}{\mybold{P}}} \newcommand{\expect}[1]{\mathbb{E}\left[#1\right]} \newcommand{\prob}[1]{\mathbb{P}\left(#1\right)} \newcommand{\dens}[1]{\mathit{p}\left(#1\right)} \newcommand{\var}[1]{\mathrm{Var}\left(#1\right)} \newcommand{\cov}[1]{\mathrm{Cov}\left(#1\right)} \newcommand{\sumn}{\sum_{n=1}^N} \newcommand{\meann}{\frac{1}{N} \sumn} \newcommand{\cltn}{\frac{1}{\sqrt{N}} \sumn} \newcommand{\trace}[1]{\mathrm{trace}\left(#1\right)} \newcommand{\diag}[1]{\mathrm{Diag}\left(#1\right)} \newcommand{\grad}[2]{\nabla_{#1} \left. #2 \right.} \newcommand{\gradat}[3]{\nabla_{#1} \left. #2 \right|_{#3}} \newcommand{\fracat}[3]{\left. \frac{#1}{#2} \right|_{#3}} \newcommand{\W}{\mybold{W}} \newcommand{\w}{w} \newcommand{\wbar}{\bar{w}} \newcommand{\wv}{\mybold{w}} \newcommand{\X}{\mybold{X}} \newcommand{\x}{x} \newcommand{\xbar}{\bar{x}} \newcommand{\xv}{\mybold{x}} \newcommand{\Xcov}{\Sigmam_{\X}} \newcommand{\Xcovhat}{\hat{\Sigmam}_{\X}} \newcommand{\Covsand}{\Sigmam_{\mathrm{sand}}} \newcommand{\Covsandhat}{\hat{\Sigmam}_{\mathrm{sand}}} \newcommand{\Z}{\mybold{Z}} \newcommand{\z}{z} \newcommand{\zv}{\mybold{z}} \newcommand{\zbar}{\bar{z}} \newcommand{\Y}{\mybold{Y}} \newcommand{\Yhat}{\hat{\Y}} \newcommand{\y}{y} \newcommand{\yv}{\mybold{y}} \newcommand{\yhat}{\hat{\y}} \newcommand{\ybar}{\bar{y}} \newcommand{\res}{\varepsilon} \newcommand{\resv}{\mybold{\res}} \newcommand{\resvhat}{\hat{\mybold{\res}}} \newcommand{\reshat}{\hat{\res}} \newcommand{\betav}{\mybold{\beta}} \newcommand{\betavhat}{\hat{\betav}} \newcommand{\betahat}{\hat{\beta}} \newcommand{\betastar}{{\beta^{*}}} \newcommand{\bv}{\mybold{\b}} \newcommand{\bvhat}{\hat{\bv}} \newcommand{\alphav}{\mybold{\alpha}} \newcommand{\alphavhat}{\hat{\av}} \newcommand{\alphahat}{\hat{\alpha}} \newcommand{\omegav}{\mybold{\omega}} \newcommand{\gv}{\mybold{\gamma}} \newcommand{\gvhat}{\hat{\gv}} \newcommand{\ghat}{\hat{\gamma}} \newcommand{\hv}{\mybold{\h}} \newcommand{\hvhat}{\hat{\hv}} \newcommand{\hhat}{\hat{\h}} \newcommand{\gammav}{\mybold{\gamma}} \newcommand{\gammavhat}{\hat{\gammav}} \newcommand{\gammahat}{\hat{\gamma}} \newcommand{\new}{\mathrm{new}} \newcommand{\zerov}{\mybold{0}} \newcommand{\onev}{\mybold{1}} \newcommand{\id}{\mybold{I}} \newcommand{\sigmahat}{\hat{\sigma}} \newcommand{\etav}{\mybold{\eta}} \newcommand{\muv}{\mybold{\mu}} \newcommand{\Sigmam}{\mybold{\Sigma}} \newcommand{\rdom}[1]{\mathbb{R}^{#1}} \newcommand{\RV}[1]{\tilde{#1}} \def\A{\mybold{A}} \def\A{\mybold{A}} \def\av{\mybold{a}} \def\a{a} \def\B{\mybold{B}} \def\S{\mybold{S}} \def\sv{\mybold{s}} \def\s{s} \def\R{\mybold{R}} \def\rv{\mybold{r}} \def\r{r} \def\V{\mybold{V}} \def\vv{\mybold{v}} \def\v{v} \def\U{\mybold{U}} \def\uv{\mybold{u}} \def\u{u} \def\W{\mybold{W}} \def\wv{\mybold{w}} \def\w{w} \def\tv{\mybold{t}} \def\t{t} \def\Sc{\mathcal{S}} \def\ev{\mybold{e}} \def\Lammat{\mybold{\Lambda}} $$

Sampling variability of the coefficients

\(\,\)

Goals

  • Incorporate training set randomness into our predictions
    • Show the traning set variability with real data
    • Use the normal assumptions to derive the distribution under training sampling
    • Transform to a pivotal statistic to get intervals that take training varibility into account

Setup

Up to now, we’ve been studying the predictive interval for \(\y_\new\) when the training data is fixed. Under our normal assumptions, we’ve shown that

\[ \y_\new - \yhat_\new = \xv_\new(\beta - \betahat) + \res_n \sim \gauss{\xv_\new(\beta - \betahat), \sigma^2}. \]

We then argued that, for large \(N\), \(\beta \approx \betahat\) and \(\sigmahat \approx \sigma\). We then plugged these values in to find an interval

\[ I_\new = \xv_\new^\trans \betahat \pm \z_\alpha \sigmahat \]

such that \(\prob{\y_\new \in I_\new} \approx 1 - \alpha\), where we choose \(\z_\alpha\) using the normal distribution and error rate \(\alpha\).

Here, since we’re treating our training set as fixed, it would be more precise to write

\[ \y_\new - \yhat_\new \vert \X,\Y \sim \gauss{\xv_\new(\beta - \betahat), \sigma^2}, \]

since we’re conditioning on \(\X\) and \(\Y\).

Note that if we were to draw a new value for \(\Y\), we would get a new value for \(\betahat\) and \(\sigmahat\). Correspondingly, we would get a new interval, \(I_\new\). If we have drawn our training set randomly, then we might want to take the variability of the interval itself into account.

Coefficient variability in the prediction

Recall that we have already proven, under the normal assumption, that

\[ \betahat - \beta \sim \gauss{\zerov, \sigma^2 (\X^\trans \X)^{-1}}. \]

From this it follows that \[ \xv_\new^\trans (\betahat - \beta) \sim \gauss{\zerov, \sigma^2 \xv_\new^\trans (\X^\trans \X)^{-1} \xv_\new}. \]

The randomness in the preceding expression comes entirely from the randomness in \(\resv\), the training data, which is independent of the new residual, \(\res_\new\). So the normal random variable \(\xv_\new^\trans(\betahat - \beta)\) and the normal random variable \(\res_\new\) are independent, and

\[ \y_\new - \yhat_\new \sim \gauss{0, \sigma^2 \left(\xv_\new^\trans (\X^\trans \X)^{-1} \xv_\new + 1\right)} \]

Contrast this expression, which is marginal over the training data, with the condtiional version above.

  • Conditional on the training set, \(\y_\new - \yhat_\new\) is biased by \(\betahat - \beta\). However, it is marginally unbiased.
  • The conditional variance is strictly smaller than the marginal variance.
  • Both the bias and variance are larger when \((\X^\trans \X)^{-1} \xv_\new\) is larger.

For the last point, note that

\[ (\betahat - \beta)^\trans \xv_\new = (\X^\trans\Y)^\trans (\X^\trans \X)^{-1} \xv_\new. \]

Using the above expression, we can form a (wider) marginal interval using the same trick as before. Rather than conditioning on the training set and hoping the error is small, this interval should cover the prediction marginally over the variability in the training set.

To simplify our expressions, let’s define

\[ \v^2 := \xv_\new^\trans (\X^\trans \X)^{-1} \xv_\new + 1 \quad\textrm{so that}\quad \y_\new - \yhat_\new \sim \gauss{0, \v^2 \sigma^2}. \]

Plugging in \(\sigmahat \approx \sigma\), we get

\[ \frac{\y_\new - \yhat_\new}{\v \sigmahat} \approx \frac{\y_\new - \yhat_\new}{\v \sigma} \sim \gauss{0,1}. \]

Using this expression, we can form intervals for \(\y_new - \yhat_\new\) using quantiles of the standard normal. But doing so still requires us to plug-in for \(\sigmahat\). How can we account for that variability?

Accounting for the variability in the residual variance estimate

Note to future instructor

Note: here, we are going to normalize \(\sigmahat\) with \(N-P\) rather than \(N\). It would be better to be clearer about this next time.

Before, we rearranged this expression to get a standard distribution. Now that we know that

\[ \sigmahat^2 = \frac{\sigma^2}{N-P} \s \quad\textrm{where}\quad \s \sim \chisq{N-P}, \]

and furthermore, \(\sigmahat^2\) is independent of \(\betahat\). Let’s try to use this fact to get a probability distribution for \(\y_\new\) that takes the variability of \(\sigmahat\) into account. Write

\[ \begin{aligned} \frac{\y_\new - \yhat_\new}{\v \sigmahat} ={}& \frac{\y_\new - \yhat_\new}{\v \sqrt{\frac{\sigma^2}{N-P} \s}} \\={}& \left(\frac{\y_\new - \yhat_\new}{\v \sigma}\right) / \sqrt{\frac{\s}{N-P} }. \end{aligned} \]

As we showed before, \(\frac{\y_\new - \yhat_\new}{\v \sigma} \sim \gauss{0,1}\). Furthermore, it is independent of \(\s\). The denominator is approximately \(1\) for large \(N - P\), but has some sampling variability. The ratio of a normal to an independent chi squared happens to be a known distribution.

Definition

Let \(\z \sim \gauss{0,1}\), and \(\s \sim \chisq{N-P}\), independently of one another. Then the distribution of \[ \t := \frac{\z}{\sqrt{\s / (N-P)}} \]

is called a ``student-t distribution with \(N-P\) degrees of freedom.’’ We write \(\t \sim \studentt{N - P}\). As long as \(N - P > 2\), \(\expect{\t} = 0\) and \(\var{\t} = (N - P) / (N - P - 2)\).

Exercise

Let \(\t \sim \studentt{K}\) with \(K > 2\).

  • Prove that \(\expect{\t} = 0\).
  • Without using the above explicit formulat, prove that \(\var{\t} > 1\).
    (Hint: use the fact that, for any positive non-constant random variable, \(\RV{x}\), \(\expect{1 / \RV{x}} > 1 / \expect{\RV{x}}\), by Jensen’s inequality.)

You can find quantiles of the student-t distributions using the R function qt(), just as you would use rnorm().

Given all this, we have shown that

\[ \prob{\frac{\y_\new - \yhat_\new}{\v \sigmahat} \le \z_\alpha} = \prob{\t \le \z_\alpha} \quad\textrm{where}\quad \t \sim \studentt{N - P}. \]

Using this formula, we can find exact intervals for our marginal prediction error under normality.