$$ \newcommand{\mybold}[1]{\boldsymbol{#1}} \newcommand{\trans}{\intercal} \newcommand{\norm}[1]{\left\Vert#1\right\Vert} \newcommand{\abs}[1]{\left|#1\right|} \newcommand{\bbr}{\mathbb{R}} \newcommand{\bbz}{\mathbb{Z}} \newcommand{\bbc}{\mathbb{C}} \newcommand{\gauss}[1]{\mathcal{N}\left(#1\right)} \newcommand{\chisq}[1]{\mathcal{\chi}^2_{#1}} \newcommand{\studentt}[1]{\mathrm{StudentT}_{#1}} \newcommand{\fdist}[2]{\mathrm{FDist}_{#1,#2}} \newcommand{\argmin}[1]{\underset{#1}{\mathrm{argmin}}\,} \newcommand{\projop}[1]{\underset{#1}{\mathrm{Proj}}\,} \newcommand{\proj}[1]{\underset{#1}{\mybold{P}}} \newcommand{\expect}[1]{\mathbb{E}\left[#1\right]} \newcommand{\prob}[1]{\mathbb{P}\left(#1\right)} \newcommand{\dens}[1]{\mathit{p}\left(#1\right)} \newcommand{\var}[1]{\mathrm{Var}\left(#1\right)} \newcommand{\cov}[1]{\mathrm{Cov}\left(#1\right)} \newcommand{\sumn}{\sum_{n=1}^N} \newcommand{\meann}{\frac{1}{N} \sumn} \newcommand{\cltn}{\frac{1}{\sqrt{N}} \sumn} \newcommand{\trace}[1]{\mathrm{trace}\left(#1\right)} \newcommand{\diag}[1]{\mathrm{Diag}\left(#1\right)} \newcommand{\grad}[2]{\nabla_{#1} \left. #2 \right.} \newcommand{\gradat}[3]{\nabla_{#1} \left. #2 \right|_{#3}} \newcommand{\fracat}[3]{\left. \frac{#1}{#2} \right|_{#3}} \newcommand{\W}{\mybold{W}} \newcommand{\w}{w} \newcommand{\wbar}{\bar{w}} \newcommand{\wv}{\mybold{w}} \newcommand{\X}{\mybold{X}} \newcommand{\x}{x} \newcommand{\xbar}{\bar{x}} \newcommand{\xv}{\mybold{x}} \newcommand{\Xcov}{\Sigmam_{\X}} \newcommand{\Xcovhat}{\hat{\Sigmam}_{\X}} \newcommand{\Covsand}{\Sigmam_{\mathrm{sand}}} \newcommand{\Covsandhat}{\hat{\Sigmam}_{\mathrm{sand}}} \newcommand{\Z}{\mybold{Z}} \newcommand{\z}{z} \newcommand{\zv}{\mybold{z}} \newcommand{\zbar}{\bar{z}} \newcommand{\Y}{\mybold{Y}} \newcommand{\Yhat}{\hat{\Y}} \newcommand{\y}{y} \newcommand{\yv}{\mybold{y}} \newcommand{\yhat}{\hat{\y}} \newcommand{\ybar}{\bar{y}} \newcommand{\res}{\varepsilon} \newcommand{\resv}{\mybold{\res}} \newcommand{\resvhat}{\hat{\mybold{\res}}} \newcommand{\reshat}{\hat{\res}} \newcommand{\betav}{\mybold{\beta}} \newcommand{\betavhat}{\hat{\betav}} \newcommand{\betahat}{\hat{\beta}} \newcommand{\betastar}{{\beta^{*}}} \newcommand{\bv}{\mybold{\b}} \newcommand{\bvhat}{\hat{\bv}} \newcommand{\alphav}{\mybold{\alpha}} \newcommand{\alphavhat}{\hat{\av}} \newcommand{\alphahat}{\hat{\alpha}} \newcommand{\omegav}{\mybold{\omega}} \newcommand{\gv}{\mybold{\gamma}} \newcommand{\gvhat}{\hat{\gv}} \newcommand{\ghat}{\hat{\gamma}} \newcommand{\hv}{\mybold{\h}} \newcommand{\hvhat}{\hat{\hv}} \newcommand{\hhat}{\hat{\h}} \newcommand{\gammav}{\mybold{\gamma}} \newcommand{\gammavhat}{\hat{\gammav}} \newcommand{\gammahat}{\hat{\gamma}} \newcommand{\new}{\mathrm{new}} \newcommand{\zerov}{\mybold{0}} \newcommand{\onev}{\mybold{1}} \newcommand{\id}{\mybold{I}} \newcommand{\sigmahat}{\hat{\sigma}} \newcommand{\etav}{\mybold{\eta}} \newcommand{\muv}{\mybold{\mu}} \newcommand{\Sigmam}{\mybold{\Sigma}} \newcommand{\rdom}[1]{\mathbb{R}^{#1}} \newcommand{\RV}[1]{\tilde{#1}} \def\A{\mybold{A}} \def\A{\mybold{A}} \def\av{\mybold{a}} \def\a{a} \def\B{\mybold{B}} \def\S{\mybold{S}} \def\sv{\mybold{s}} \def\s{s} \def\R{\mybold{R}} \def\rv{\mybold{r}} \def\r{r} \def\V{\mybold{V}} \def\vv{\mybold{v}} \def\v{v} \def\U{\mybold{U}} \def\uv{\mybold{u}} \def\u{u} \def\W{\mybold{W}} \def\wv{\mybold{w}} \def\w{w} \def\tv{\mybold{t}} \def\t{t} \def\Sc{\mathcal{S}} \def\ev{\mybold{e}} \def\Lammat{\mybold{\Lambda}} $$

Consistency of OLS and the residual variance under the Gaussian assumptions

\(\LaTeX\)

Goals

  • Recall the prediction problem
  • Under the Gaussian assumption, prove that the error in the OLS estimate goes to zero
  • Under the Gaussian assumption, prove that the error in the residual variance estimate goes to zero
  • Use these facts to provide a consistent prediction interval

Gaussian assumption

Recall that we want to form predictions for \(\y_\new\) using only \(\x_\new\) and the OLS predictor \(\betahat = (\X^\trans \X)^{-1} \X^\trans \Y\). We’re using the assumptions

  • A1: The \(\y_n\) (including \(\y_\new\)) are actually generated according to \(\y_n = \xv_n^\trans \beta + \res_n\) for some unknown \(\beta\). The \(\xv_n\) are not random.
  • A2: The \(\res_n\) (including \(\res_\new\)) are IID \(\gauss{0, \sigma^2}\) for some (unknown) \(\sigma\)

We’ll additionally need the following assumption:

  • A3: \(\frac{1}{N} \X^\trans \X = \meann \xv_n \xv_n^\trans\) converges to a non-singular matrix, \(\Xcov\). (Note that the \(\xv_n\) are not random, so this is deterministic convergence.) Also, \(\X^\trans\X\) is nonsingular for all \(N\) (so that \(\betahat\) is well-defined).

For now, we’re only interested in how much error we’ll commit when we keep the training data fixed. That is, we’re interested in the distribution of

\[ \begin{aligned} \y_\new ={}& \xv_\new^\trans \beta + \res_\new &\textrm{(by A1)} \\ \sim{}& \gauss{\xv_\new^\trans \beta, \sigma^2}. &\textrm{(by A2, and properties of normals)} \\ \end{aligned} \]

The trouble is, we don’t know \(\beta\) or \(\sigma\). One of our main goals today will be to show that, as \(N \rightarrow \infty\),

\[ \begin{aligned} \betahat \rightarrow{}& \beta \quad\textrm{and}\\ \sigmahat^2 := \meann \reshat_n^2 \rightarrow{}& \sigma^2. \end{aligned} \]

That means, for large enough \(N\), we may as well use \(\betahat\) and \(\sigmahat\) as \(\beta\) and \(\sigma\).

Gaussian intervals.

But first, let’s talk about what we’d do if we did know \(\beta\) and \(\sigma\). One way of expressing uncertainty is an interval, \((\y_{low}, \y_{up})\), such that \[ \prob{\y_\new \in (\y_{low}, \y_{up})} = 1 - \alpha \]

for some small \(\alpha\).

For the moment, suppose that we know \(\beta\) and \(\sigma\).
How can we find \(\y_{low}\) and \(\y_{up}\)? Let’s do it by building up an expression for \(\y_\new\) in terms of the standard normal distribution, whose properties we are able to calculate.

Recall that, if \(\z\) follows standard normal distribution \(\RV{\z} \sim \gauss{0,1}\), then we can compute the tail probability \(\prob{\RV{\z} \le \z_0}\) for any \(\z_0\). This probability is usually denoted with \(\Phi(\cdot)\):

\[ \Phi(\z_0) := \prob{\RV{\z} \le \z_0}. \]

Furthermore, we can invert \(\Phi(\cdot)\) to find a \(\z_0\) such that \(\Phi(\z_0)\) is equal to some value:

\[ a := \prob{\RV{\z} \le \Phi^{-1}(a)}. \]

Exercise

Prove that \(\Phi(\cdot)\) has the following properties:

  • \(\Phi(-\infty) = 0\)
  • \(\Phi(\infty) = 1\)
  • \(\Phi(-\z) = 1 - \Phi(\z)\)
  • \(\Phi(\cdot)\) is increasing
  • \(\Phi(\cdot)\) is continuous
  • \(\Phi(\cdot)\) is invertible

Suppose we take \(\z_\alpha := \Phi^{-1}(\alpha / 2)\) so that \(\prob{\RV{\z} \le \z_\alpha} = 1 - \alpha / 2\) for \(\alpha < 1\). Note that \(\z_\alpha > 0\) (why?). Then

\[ \begin{aligned} \prob{-\z_\alpha \le \RV{\z} \le \z_\alpha} ={}& 1 - \prob{\RV{\z} \le -\z_\alpha \textrm{ or } \RV{\z} \ge \z_\alpha} \\={}& 1 - \left( \prob{\RV{\z} \le -\z_\alpha} + \prob{\RV{\z} \ge \z_\alpha} \right) &\textrm{(the regions are disjoint)} \\={}& 1 - 2 \prob{\RV{\z} \ge \z_\alpha} &\textrm{(symmetry of $\RV{\z}$)} \\={}& 1 - 2 \left(1 - \Phi(\z_\alpha) \right) &\textrm{(definition of $\Phi$)} \\={}& 1 - 2 \frac{\alpha}{2} &\textrm{(definition of $\z_\alpha$)} \\={}& 1 - \alpha. \end{aligned} \]

Now we just need to transform the preceding expression, which is about a standard normal random variable, into an expression about \(\y_\new\). By standard transformations of normal random variables, \(\y_new\) has the same distribution as \(\beta^\trans \xv_\new + \sigma \RV{\z}\). So we can write

\[ \begin{aligned} 1- \alpha ={}& \prob{-\z_\alpha \le \RV{\z} \le \z_\alpha} \\={}& \prob{\beta^\trans \xv_\new-\sigma \z_\alpha \le \beta^\trans \xv_\new + \sigma \RV{\z} \le \beta^\trans \xv_\new + \sigma \z_\alpha} \\={}& \prob{\beta^\trans \xv_\new-\sigma \z_\alpha \le \y_\new \le \beta^\trans \xv_\new + \sigma \z_\alpha}. \end{aligned} \]

We have thus found an interval for \(\y_\new\):

\[ \y_{low} = \beta^\trans \xv_\new - \sigma \z_\alpha \quad\textrm{and}\quad \y_{up} = \beta^\trans \xv_\new + \sigma \z_\alpha \quad\Rightarrow\quad \prob{\y_{low} \le \y_\new \le \y_{up}} = 1- \alpha. \]

In practice, we replace \(\sigma\) with \(\sigmahat\) and \(\beta\) with \(\betahat\), giving

\[ \hat\y_{low} = \betahat^\trans \xv_\new - \sigmahat \z_\alpha \quad\textrm{and}\quad \hat\y_{up} = \betahat^\trans \xv_\new + \sigmahat \z_\alpha. \]

In general, \(\prob{\hat\y_{low} \le \y_\new \le \hat\y_{up}} \ne 1 - \alpha\). It may be higher or lower. In fact, the probability is random, as it depends on \(\betahat\) and \(\sigmahat\), which depend on the (random) training data. But the probability does approach \(1- \alpha\) as \(N \rightarrow \infty\) as long as \(\betahat \rightarrow \beta\) and \(\sigmahat \rightarrow \sigma\). So we now turn to showing that.

Exercise

If it is the case that \(\betahat \rightarrow \beta\) and \(\sigmahat \rightarrow \sigma\), then

\[ \prob{\hat\y_{low} \le \y_\new \le \hat\y_{up}} \rightarrow 1 - \alpha \]

by the continuous mapping theorem applied to \(\Phi(\cdot)\).

Coefficient consistency

Method 1 for coefficient consistency: Use normality

Plugging in, we get \[ \betahat = (\X^\trans \X)^{-1} \X^\trans \Y = (\X^\trans \X)^{-1} \X^\trans (\X \beta + \resv) = \beta + (\X^\trans \X)^{-1} \X^\trans \resv. \]

In this expression, only \(\resv\) is random! In fact, since \(\resv\) is Gaussian, and \(\betahat\) is an affine transformation of \(\resv\), then \(\betahat\) itself is Gaussian — albeit with an unknown mean and variance. Noting that

\[ \begin{aligned} \expect{\betahat} ={}& \beta + (\X^\trans \X)^{-1} \X^\trans \expect{\resv} = \beta \\ \cov{\betahat} ={}& \expect{(\betahat - \beta) (\betahat - \beta)^\trans} \\ ={}& \expect{(\X^\trans \X)^{-1} \X^\trans \resv \resv^\trans \X (\X^\trans \X)^{-1} } \\ ={}& (\X^\trans \X)^{-1} \X^\trans \expect{\resv \resv^\trans} \X (\X^\trans \X)^{-1} \\ ={}& (\X^\trans \X)^{-1} \X^\trans \sigma^2 \id{} \X (\X^\trans \X)^{-1} \\ ={}& \sigma^2 (\X^\trans \X)^{-1} \X^\trans \X (\X^\trans \X)^{-1} \\ ={}& \sigma^2 (\X^\trans \X)^{-1}, \end{aligned} \]

we get the beautiful expression \[ \betahat \sim \gauss{\beta, \sigma^2 (\X^\trans \X)^{-1}}. \]

We still need to understand how \((\X^\trans \X)^{-1}\) behaves. For this, we have assumption A3, which we apply to get \[ \betahat \sim \gauss{\beta, \frac{1}{N} \sigma^2 \left(\frac{1}{N} \X^\trans \X \right)^{-1}} \approx \gauss{\beta, \frac{1}{N} \sigma^2 \Sigmam_\X^{-1}}. \]

It follows that the variance of \(\betahat\) goes to zero as \(N \rightarrow \infty\), and so it concentrates at its mean, \(\beta\).

Exercise

How would the OLS solution behave if the A3 were violated? Note that there are two parts — positive definiteness of the limit, and existence of the limit when scaling \(\X^\trans \X\) by \(1/N\).

Consider two examples:

  • Let \(\zv_n \sim \gauss{\zerov, \id}\) denote IID standard Gaussian P-vectors, and take \(\xv_n = \zv_n / n^2\). Show that A3 is violated. What happens to \(\betahat\) as \(N \rightarrow \infty\).
  • Let \(\xv_n = \onev\) for all \(n\). Show that A3 is violated. What is \(\betahat\)?

Method 2 for coefficient consistency: Use the LLN

The previous proof really relied on Normality. However, we can get a similar result using only the LLN. Write:

\[ \begin{aligned} \betahat - \beta ={}& \left(\X^\trans \X\right)^{-1} \X^\trans \resv \\={}& \left(\frac{1}{N} \X^\trans \X\right)^{-1} \frac{1}{N} \X^\trans \resv. \end{aligned} \]

We’ve already assumed that

\[ \left(\frac{1}{N} \X^\trans \X \right)^{-1} \rightarrow \Sigmam_\X^{-1}. \]

What about the second term? This is something to which we can apply the vector LLN. If we write \(\zv_n := \xv_n \res_n\), then \(\expect{\zv_n} = \zerov\) and

\(\cov{\zv_n} = \sigma^2 \xv_n \xv_n^\trans\). If we assume that \(\xv_n \xv_n^\trans\) is bounded for all \(n\), then we can apply our LLN to get

\[ \frac{1}{N} \X^\trans \resv = \meann \xv_n \resv_n \rightarrow \expect{\xv_1 \resv_1} = \xv_1 \expect{\resv_1} = \zerov. \]

From this it follows that

\[ \betahat \rightarrow \beta \quad\textrm{as }N \rightarrow \infty \]

by the LLN alone.

Exercise

In fact, we do not require \(\xv_n \xv_n^\trans\) is bounded for all \(n\) — all we really need is \(\max_{n\in\{1, \ldots, N\},p} \frac{1}{N} \xv_{np}^2 \rightarrow 0\) as \(N \rightarrow \infty\). Refer back to our proof of the LLN and show that (a) this condition suffices for the LLN and (b) that it is implied by our “nice regressor” assumption.

Method 3 for coefficient consistency: Plug into the general formula

At the beginning, we derived the (initially disappointing) general formula

\[ \betahat \rightarrow \Xcov^{-1} \expect{\xv_1 \y_1}. \]

If we revert to letting \(\xv_n\) be random, then we can simply plug in to get

\[ \begin{aligned} \betahat \rightarrow{}& \Xcov^{-1} \expect{\xv_1 (\xv_1^\trans \beta + \res_1)} \\={}& \Xcov^{-1} \expect{\xv_1 \xv_1^\trans} \beta + \Xcov^{-1} \expect{\xv_1 \res_1} \\={}& \Xcov^{-1} \Xcov \beta + \Xcov^{-1} \expect{\xv_1 \expect{\res_1 | \xv_1}} \\={}& \beta + \Xcov^{-1} \expect{\xv_1 0} \\={}& \beta. \end{aligned} \]

Though superficially different, all three of these methods relied on the same facts behind the scenes — namely, that \(\expect{\betahat} = \beta\), and \(\cov{\betahat} \rightarrow \zerov\) as \(N \rightarrow \infty\).

Estimating the residual variance

Method 1: Apply LLN to the squared residuals

How can we estimate \(\sigma^2\)? First, note that, if we observed the residuals \(\res_n\) (which we don’t), we could estimate

\[ \meann \res_n^2 \rightarrow \sigma^2 \]

by the LLN. Of course, we don’t observe \(\res_n\), since we don’t know \(\beta\), and \(\res_n = \y_n - \xv_n^\trans\beta\). However, we can estimate \(\beta\) with \(\betahat\), giving

\[ \reshat_n := \y_n - \xv_n^\trans \betahat, \]

and estimate

\[ \sigmahat^2 := \meann \reshat_n^2. \]

Does this work? Do we commit too large an error? It turns out, no. We can write

\[ \begin{aligned} \meann \reshat_n^2 ={}& \meann (\y_n - \xv_n^\trans \betahat)^2 \\={}& \meann (\res_n + \xv_n^\trans \beta - \xv_n^\trans \betahat)^2 & \textrm{(by A1)} \\={}& \meann (\res_n + \xv_n^\trans (\beta - \betahat))^2 \\={}& \meann \res_n^2 + 2 (\beta - \betahat)^\trans \meann \xv_n \varepsilon_n + (\beta - \betahat)^\trans \meann \xv_n \xv_n^\trans (\beta - \betahat). \end{aligned} \]

By A3, both \(\meann \xv_n \varepsilon_n\) and \(\meann \xv_n \xv_n^\trans\) converge to something bounded, and we showed above that \(\beta - \betahat \rightarrow 0\), and that \(\meann \res_n^2 \rightarrow \sigma^2\) by the LLN. It follows that

\[ \sigmahat^2 \rightarrow \sigma^2, \]

as hoped.