STAT151A Homework 3

Author

Your name here

1 Omitted variables in inference versus prediction

Suppose we are interested in student performance as measured by fourth–year GPA, \(y_n\), centered at \(0\) and scaled to have unit variance. Suppose we imagine two regressors:

  • \(x_n\): Student’s academic ability (in an abstract sense, maybe not precisely measurable or defined), and
  • \(z_n\): Student’s performance on a standardized test before admission.

Suppose that \(y_n = \beta x_n + \varepsilon_n\), where \(\varepsilon_n\) is mean zero, independent of \(x_n\), and with variance \(\sigma_\varepsilon^2\). We will also assume that \(\beta > 0\). This simplistic model is unrealistic, but we will use it for illustrative purposes, and assume for this problem that this model determines the “true” relationship between \(x_n\), \(z_n\), and \(y_n\).

Note that \(z_n\) effectively has a zero coefficient in the “true” model — student score is causally determined entirely by “ability,” not by test score.

Assume that we have centered and standardized \(x_n\) and \(z_n\), so that

\[ \begin{aligned} \frac{1}{N} \sum_{n=1}^Nx_n =& \frac{1}{N} \sum_{n=1}^Nz_n = 0 \quad\textrm{and} \\ \frac{1}{N} \sum_{n=1}^Nx_n^2 =& \frac{1}{N} \sum_{n=1}^Nz_n^2 = 1. \end{aligned} \] But large values of \(x_n\) and \(z_n\) tend to occur together, so that \(\frac{1}{N} \sum_{n=1}^Nx_n z_n = 0.9\). (You might think of \(x_n\) and \(z_n\) as having been drawn from a correlated random variables, even though they are not random for the purpose of this problem.)

(a)

Given the true model, how will we change a students’ expected GPA if we:

  • Increase their test score \(z_n\) by helping them memorize the answers to the test?
  • Improve their academic ability \(x_n\) by teaching them better time management skills?

(b)

If the number of observations \(N\) is large, and we run the regression \(y_n \sim \gamma z_n\), what will the expected coefficient \(\mathbb{E}\left[\hat{\gamma}\right]\) be? (In the expectation, only \(y_n\) and \(\varepsilon_n\) are random; the regressors are fixed.)

How does this compare to the “true” influence of \(z_n\) on GPA? How does this compare with the “true” influence of \(x_n\) on GPA?

(c)

Are we able to do effective inference using the regression \(y_n \sim \gamma z_n\)? Explain why or why not in intuitive terms.

(d)

Let \(x_{*}\), \(z_{*}\), and \(y_{*}\) denote a new observation not part of the training set. We still have \(y_* = \beta x_* + \varepsilon_*\), but assume now that \(x_*\) and \(z_*\) are random, independent of \(\varepsilon_*\), and the training data, and that their moments match the training set’s sample moments:

\[ \begin{aligned} \mathbb{E}\left[x_*\right] ={}& \mathbb{E}\left[z_*\right] = 0 \\ \mathrm{Var}\left(x_*\right) ={}& \mathrm{Var}\left(z_*\right) = 1 \\ \mathbb{E}\left[x_* z_*\right] ={}& 0.9. \end{aligned} \]

Evaluate \(\mathbb{E}\left[y_*\right]\) and \(\mathrm{Var}\left(y_*\right)\), where the randomness is over \(x_*\), \(z_*\), and \(\varepsilon_*\). Note that the answer will be in terms of \(\sigma_\varepsilon^2\), the variance of \(\varepsilon_*\).

(e)

Using the regression \(y_n \sim \gamma z_n\), form the prediction \(\hat{y}_{*} = z_{*} \hat{\gamma}\). Evaluate \(\mathbb{E}\left[\hat{y}_* \vert \boldsymbol{Y}\right]\) and \(\mathrm{Var}\left(\hat{y}_* \vert \boldsymbol{Y}\right)\). Again, the answer will be in terms of \(\sigma_\varepsilon^2\), the variance of \(\varepsilon_*\). Hint: \(\hat{\gamma}\) is fixed (not random) conditional on \(\boldsymbol{Y}\).

(f)

Using (d) and (e), evalute the conditional correlation between the response and prediction:

\[ \textrm{Correlation}(y_*, \hat{y}_* \vert \boldsymbol{Y}) = \frac{\mathbb{E}\left[y_{*} \hat{y}_{*} \vert \boldsymbol{Y}\right]} {\sqrt{\mathrm{Var}\left(\hat{y}_{*} \vert \boldsymbol{Y}\right)} \sqrt{\mathrm{Var}\left(y_* | \boldsymbol{Y}\right)}}. \]

Hint: \(\varepsilon_*\), \(x_*\), and \(z_*\) are independent of the training data \(\boldsymbol{Y}\), so the conditioning doesn’t affect their expectations.

(g)

Suppose that \(\sigma_\varepsilon/ \beta\) is not too large, so the residual standard deviation is not too large relative to the effect of academic ability on GPA. Are we then able to do effective prediction using the regression \(y_n \sim \gamma z_n\), say, to identify which students are likely to have good GPAs using only test scores? Explain why or why not in intuitive terms.

Solutions

(a)

We will not change their GPA by helping memorize the test. Improving their academic ability will increase their GPA.

(b)

\[ \begin{aligned} \mathbb{E}\left[\hat{\gamma}\right] ={}& \mathbb{E}\left[\frac{\sum_{n=1}^Ny_n z_n}{\sum_{n=1}^Nz_n^2}\right] \\={}& \frac{\sum_{n=1}^N\mathbb{E}\left[y_n\right] z_n}{\sum_{n=1}^Nz_n^2} \\={}& \frac{\sum_{n=1}^N\mathbb{E}\left[\beta x_n + \varepsilon_n\right] z_n}{\sum_{n=1}^Nz_n^2} \\={}& \frac{\sum_{n=1}^N\left( \beta x_n + \mathbb{E}\left[\varepsilon_n\right]\right) z_n}{\sum_{n=1}^Nz_n^2} \\={}& \frac{\sum_{n=1}^N\beta x_n z_n}{\sum_{n=1}^Nz_n^2} \\={}& \frac{\frac{1}{N} \sum_{n=1}^Nx_n z_n}{\frac{1}{N} \sum_{n=1}^Nz_n^2} \beta \\={}& 0.9 \beta \end{aligned} \]

The true influence of \(z_n\) on \(y_n\) is \(0\), so \(\mathbb{E}\left[\hat{\gamma}\right] = 0.9 \beta\) overestimates it. However, it is less than \(\beta\), the “true” effect of \(x_n\) on \(y_n\), because the sample correlation of \(z_n\) with \(x_n\) is less than one.

(c)

No, we are not. The true ability is an confounder. We should measure \(0\), but instead measure a positive association.

(d)

We have

\[ \mathbb{E}\left[y_*\right] ={} \mathbb{E}\left[\beta x_* + \varepsilon_*\right] = \beta \mathbb{E}\left[x_*\right] + \mathbb{E}\left[\varepsilon_*\right] = 0 \]

and

\[ \mathrm{Var}\left(y_*\right) ={} \mathrm{Var}\left(\beta x_* + \varepsilon_*\right) = \beta^2 \mathrm{Var}\left(x_*\right) + \mathrm{Var}\left(\varepsilon_*\right) = \beta^2 + \sigma_\varepsilon^2. \]

(e)

We have

\[ \mathbb{E}\left[\hat{y}_* \vert \boldsymbol{Y}\right] ={} \mathbb{E}\left[\hat{\gamma}z_* \vert \boldsymbol{Y}\right] = \hat{\gamma}\mathbb{E}\left[z_* \vert \boldsymbol{Y}\right] = 0 \]

and

\[ \mathrm{Var}\left(\hat{y}_* \vert \boldsymbol{Y}\right) ={} \mathrm{Var}\left(\hat{\gamma}z_* \vert \boldsymbol{Y}\right) = \hat{\gamma}^2 \mathrm{Var}\left(z_* \vert \boldsymbol{Y}\right) = \hat{\gamma}^2. \]

(f)

We only have to compute

\[ \begin{aligned} \mathbb{E}\left[\hat{y}_* y_* \vert \boldsymbol{Y}\right] ={}& \mathbb{E}\left[(\hat{\gamma}z_*) (\beta x_* + \varepsilon_*) \vert \boldsymbol{Y}\right] \\={}& \mathbb{E}\left[\hat{\gamma}\beta z_* x_* + \hat{\gamma}z_* \varepsilon_* \vert \boldsymbol{Y}\right] \\={}& \hat{\gamma}\beta \mathbb{E}\left[ z_* x_* \vert \boldsymbol{Y}\right] + \hat{\gamma}\mathbb{E}\left[z_* \varepsilon_* \vert \boldsymbol{Y}\right] \\={}& \hat{\gamma}\beta \mathbb{E}\left[ z_* x_*\right] + \hat{\gamma}\mathbb{E}\left[z_* \varepsilon_*\right] \\={}& 0.9 \hat{\gamma}\beta. \end{aligned} \]

Plugging in,

\[ \textrm{Correlation}(y_*, \hat{y}_* \vert \boldsymbol{Y}) = \frac{0.9 \hat{\gamma}\beta}{\sqrt{\hat{\gamma}^2} \sqrt{\beta^2 + \sigma_\varepsilon^2}} = \frac{0.9 }{\sqrt{1 + \left( \frac{\sigma_\varepsilon}{\beta}\right)^2}}. \]

(g)

Yes, we can perform prediction using \(y_n \sim z_n\), since the correlation between the prediction \(\hat{y}_*\) and \(y_*\) is high as long as \(\sigma_\varepsilon/ \beta\) is not too large. For prediction, it does not matter that \(z_n\) is not causally related to \(y_n\) — it is enough that it is correlated with a variable that is causally related to \(y_n\).

2 Eigendecomposition and covariances

Consider the regression \(\boldsymbol{Y}\sim \boldsymbol{X}\boldsymbol{\beta}\), where \(\boldsymbol{X}\) is full-rank, and we make the normal assumption \(\boldsymbol{Y}\vert \boldsymbol{X}\sim \mathcal{N}\left(\boldsymbol{X}{\boldsymbol{\beta}^{*}}, \sigma^2 \boldsymbol{I}_N\right)\). In this problem, we relate the variance of \(\hat{\boldsymbol{\beta}}\) to the eigenvalues of \(\hat{\boldsymbol{M}}_{\boldsymbol{X}}:= \frac{1}{N} \boldsymbol{X}^\intercal\boldsymbol{X}\).

Let \(\hat{\boldsymbol{M}}_{\boldsymbol{X}}= \boldsymbol{U}\Lambda \boldsymbol{U}^\intercal\) be the eigendecomposition of \(\hat{\boldsymbol{M}}_{\boldsymbol{X}}\). Here, \(\Lambda\) is diagonal with entries \(\lambda_p\), and \(\boldsymbol{U}\) is orthonormal, meaning \(\boldsymbol{U}^\intercal\boldsymbol{U}= \boldsymbol{I}_P\).

(a) Define the new regressor \(\boldsymbol{z}_n = \boldsymbol{U}^\intercal\boldsymbol{x}_n\), with the corresponding regressor matrix \(\boldsymbol{Z}\). Show that \(\boldsymbol{Z}^\intercal\boldsymbol{Z}= N \Lambda\).

(b) For the regression \(\boldsymbol{Y}\sim \boldsymbol{Z}\gamma\), show that \(\boldsymbol{U}\hat{\boldsymbol{\gamma}}= \hat{\boldsymbol{\beta}}\), and so \(\hat{\boldsymbol{\gamma}}= \boldsymbol{U}^\intercal\hat{\boldsymbol{\beta}}\). (Hint: We have defined an invertible linear reparameterization of the regressors, and we proved something on a previous homework about this.)

(c) Show that \(\hat{\boldsymbol{\gamma}}\sim \mathcal{N}\left(\boldsymbol{U}^\intercal{\boldsymbol{\beta}^{*}}, \frac{\sigma^2}{N} \Lambda^{-1}\right)\), and so the entries \(\hat{\boldsymbol{\gamma}}_p\) are independent with variance \(\sigma^2 \lambda_p^{-1} / N\).

(d) We have shown that each entry of \(\hat{\boldsymbol{\gamma}}\) is a linear combination of independent Gaussians with variance \(\sigma^2 \lambda_p^{-1} / N\). What do you expect to happen to the covariance of \(\hat{\boldsymbol{\beta}}\) as \(\lambda_p\) approaches zero for some \(p\)?

(e) When \(\boldsymbol{X}\) is not full-rank, then at least one \(\lambda_p = 0\), and \(\boldsymbol{X}^\intercal\boldsymbol{X}\) is not invertible. Interpret this fact in light of the above result.

Solutions

(a) Since \(\boldsymbol{z}_n = \boldsymbol{U}^\intercal\boldsymbol{x}_n\), the \(n\)-th row of \(\boldsymbol{Z}\) is \(\boldsymbol{z}_n^\intercal= \boldsymbol{x}_n^\intercal\boldsymbol{U}\), so \(\boldsymbol{Z}= \boldsymbol{X}\boldsymbol{U}\). Then

\[ \begin{aligned} \boldsymbol{Z}^\intercal\boldsymbol{Z} ={}& (\boldsymbol{X}\boldsymbol{U})^\intercal(\boldsymbol{X}\boldsymbol{U}) \\={}& \boldsymbol{U}^\intercal\boldsymbol{X}^\intercal\boldsymbol{X}\boldsymbol{U} \\={}& \boldsymbol{U}^\intercal(N \hat{\boldsymbol{M}}_{\boldsymbol{X}}) \boldsymbol{U} \\={}& N \boldsymbol{U}^\intercal(\boldsymbol{U}\Lambda \boldsymbol{U}^\intercal) \boldsymbol{U} \\={}& N (\boldsymbol{U}^\intercal\boldsymbol{U}) \Lambda (\boldsymbol{U}^\intercal\boldsymbol{U}) \\={}& N \Lambda. \end{aligned} \]

(b) Using (a), the regression \(\boldsymbol{Y}\sim \boldsymbol{X}\boldsymbol{\beta}\) can be written \(\boldsymbol{Y}\sim \boldsymbol{Z}\boldsymbol{U}^\intercal\boldsymbol{\beta}= \boldsymbol{Z}\boldsymbol{\gamma}\) where \(\boldsymbol{\gamma}= \boldsymbol{U}^\intercal\boldsymbol{\beta}\), i.e., \(\boldsymbol{\beta}= \boldsymbol{U}\boldsymbol{\gamma}\). By the reparameterization result from a previous homework (an invertible linear reparameterization gives the same fitted values), we have \(\hat{\boldsymbol{\beta}}= \boldsymbol{U}\hat{\boldsymbol{\gamma}}\), so \(\boldsymbol{U}^\intercal\hat{\boldsymbol{\beta}}= \boldsymbol{U}^\intercal\boldsymbol{U}\hat{\boldsymbol{\gamma}}= \hat{\boldsymbol{\gamma}}\).

(c) Since \(\hat{\boldsymbol{\gamma}}= \boldsymbol{U}^\intercal\hat{\boldsymbol{\beta}}\), by the transformation of multivariate normals, \[ \hat{\boldsymbol{\gamma}}\sim \mathcal{N}\left(\boldsymbol{U}^\intercal{\boldsymbol{\beta}^{*}}, \sigma^2 \boldsymbol{U}^\intercal(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{U}\right). \] Since \[ \boldsymbol{U}^\intercal(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{U}= (\boldsymbol{U}^\intercal\boldsymbol{X}^\intercal\boldsymbol{X}\boldsymbol{U})^{-1} = (\boldsymbol{Z}^\intercal\boldsymbol{Z})^{-1} = N^{-1} \Lambda^{-1}, \] the result follows.

(d) From (c), the \(p\)-th component of \(\hat{\boldsymbol{\gamma}}\) has variance \(\sigma^2 \lambda_p^{-1}/N\). As \(\lambda_p \to 0\), this variance diverges to \(\infty\). Since \(\hat{\boldsymbol{\beta}}= \boldsymbol{U}\hat{\boldsymbol{\gamma}}\) is a linear combination of the components of \(\hat{\boldsymbol{\gamma}}\), the covariance matrix \(\mathrm{Cov}\left(\hat{\boldsymbol{\beta}}\vert \boldsymbol{X}\right) = \frac{\sigma^2}{N} \boldsymbol{U}\Lambda^{-1} \boldsymbol{U}^\intercal\), any components of \(\hat{\boldsymbol{\beta}}\) with non-zero entries in the \(p\)–th column of \(\boldsymbol{U}\) will go to \(\pm \infty\).

(e) When \(\boldsymbol{X}\) is not full-rank, at least one \(\lambda_p = 0\). From (c) this would correspond to an infinite variance for the corresponding component of \(\hat{\boldsymbol{\gamma}}\), meaning that component of the OLS estimator is completely undetermined by the data. Equivalently, \(\boldsymbol{X}^\intercal\boldsymbol{X}\) is not invertible, so the OLS normal equations have no unique solution. The result above shows this is not just a computational nuisance: the estimator genuinely has infinite (undefined) variance along directions in which \(\boldsymbol{X}\) has no variation.

3 The bias-variance tradeoff

Suppose we assume that we have training data \((\boldsymbol{x}_n, y_n)\) that are IID, and that we want to form a prediction at \(\boldsymbol{x}\) for an unobserved \(y\), where \((\boldsymbol{x}, y)\) are drawn from the same distribution as the training data. Here, \(\boldsymbol{x}\) can be thought of as fixed and known — it’s the point where we want to make a prediction — and \(y\) is random and unknown with distribution \(p(y\vert \boldsymbol{x})\).

Below, all expectations will be take with respect to the randomness in the training responses \(\boldsymbol{Y}\) and the new, unobserved response, \(y\), with the training regressors \(\boldsymbol{X}\) and prediction regressor \(\boldsymbol{x}\) taken as fixed.

Define \(\overline{\boldsymbol{\beta}}= \mathbb{E}\left[\hat{\boldsymbol{\beta}}\vert \boldsymbol{X}\right]\), the conditional expectation of the OLS estimator given the training regressors. Let \(\mu(\boldsymbol{x}) := \mathbb{E}\left[y\vert \boldsymbol{x}\right]\) denote the true value of the conditional expectation of the predictive response.

(a) Show that

\[ \mathbb{E}\left[(y- \boldsymbol{x}^\intercal\hat{\boldsymbol{\beta}})^2 | \boldsymbol{X}, \boldsymbol{x}\right] = \mathrm{Var}\left(y\vert \boldsymbol{x}\right) + \mathbb{E}\left[(\mu(\boldsymbol{x}) - \boldsymbol{x}^\intercal\hat{\boldsymbol{\beta}})^2 | \boldsymbol{X}, \boldsymbol{x}\right]. \]

Hint: Inside the square, add and subtract \(\mu(\boldsymbol{x})\).

(b) Show that

\[ \mathbb{E}\left[(\mu(\boldsymbol{x}) - \boldsymbol{x}^\intercal\hat{\boldsymbol{\beta}})^2 | \boldsymbol{X}, \boldsymbol{x}\right] = (\mu(\boldsymbol{x}) - \boldsymbol{x}^\intercal\overline{\boldsymbol{\beta}})^2 + \mathbb{E}\left[(\boldsymbol{x}^\intercal\overline{\boldsymbol{\beta}}- \boldsymbol{x}^\intercal\hat{\boldsymbol{\beta}})^2 | \boldsymbol{X}\right]. \]

Hint: Inside the square, add and subtract \(\overline{\boldsymbol{\beta}}^\intercal\boldsymbol{x}\).

(c) Show that

\[ \mathbb{E}\left[(\boldsymbol{x}^\intercal\overline{\boldsymbol{\beta}}- \boldsymbol{x}^\intercal\hat{\boldsymbol{\beta}})^2 | \boldsymbol{X}, \boldsymbol{x}\right] = \boldsymbol{x}^\intercal\mathrm{Cov}\left(\hat{\boldsymbol{\beta}}\vert \boldsymbol{X}\right) \boldsymbol{x}. \]

Hint: Expand the square, and rearrange \(\boldsymbol{x}^\intercal\overline{\boldsymbol{\beta}}= \overline{\boldsymbol{\beta}}\boldsymbol{x}\) and \(\boldsymbol{x}^\intercal\hat{\boldsymbol{\beta}}= \hat{\boldsymbol{\beta}}^\intercal\boldsymbol{x}\) in one of the two factors of each term.

(d) Combining, write

\[ \mathbb{E}\left[(y- \boldsymbol{x}^\intercal\hat{\boldsymbol{\beta}})^2 | \boldsymbol{X}, \boldsymbol{x}\right] = \underbrace{\mathrm{Var}\left(y\vert \boldsymbol{x}\right)}_{\textrm{Irreducible error}} + \underbrace{(\mu(\boldsymbol{x}) - \boldsymbol{x}^\intercal\overline{\boldsymbol{\beta}})^2}_{\textrm{Bias term}} + \underbrace{\boldsymbol{x}^\intercal\mathrm{Cov}\left(\hat{\boldsymbol{\beta}}\vert \boldsymbol{X}\right) \boldsymbol{x}}_{\textrm{Variance term}}. \]

This is known as the bias-variance tradeoff.

(e) Suppose the model is correctly specified, so there exists some \({\boldsymbol{\beta}^{*}}\) so that \(\mu(\boldsymbol{x}) = {\boldsymbol{\beta}^{*}}^\intercal\boldsymbol{x}\). Show that the “bias term” vanishes, even if \(\mathrm{Cov}\left(\hat{\boldsymbol{\beta}}\vert \boldsymbol{X}\right)\) is large.

(f) Suppose that \(N\) is very large so that \(\mathrm{Cov}\left(\hat{\boldsymbol{\beta}}\vert \boldsymbol{X}\right) \approx 0\). Show that the variance term vanishes, even though \((\mu(\boldsymbol{x}) - \boldsymbol{x}^\intercal\overline{\boldsymbol{\beta}})^2\) may be large.

(g) Suppose that the model is correctly specified and \(N\) is very large. Do you still make any errors in predicting \(y\)? What is the source of these errors?

Solutions

(a) Write \(y- \boldsymbol{x}^\intercal\hat{\boldsymbol{\beta}}= (y- \mu(\boldsymbol{x})) + (\mu(\boldsymbol{x}) - \boldsymbol{x}^\intercal\hat{\boldsymbol{\beta}})\) and expand the square:

\[ (y- \boldsymbol{x}^\intercal\hat{\boldsymbol{\beta}})^2 = (y- \mu(\boldsymbol{x}))^2 + 2(y- \mu(\boldsymbol{x}))(\mu(\boldsymbol{x}) - \boldsymbol{x}^\intercal\hat{\boldsymbol{\beta}}) + (\mu(\boldsymbol{x}) - \boldsymbol{x}^\intercal\hat{\boldsymbol{\beta}})^2. \]

Taking conditional expectation given \(\boldsymbol{X}, \boldsymbol{x}\): the new response \(y\) is independent of the training data \(\boldsymbol{Y}\) (and hence of \(\hat{\boldsymbol{\beta}}\)), so

\[ \mathbb{E}\left[(y- \mu(\boldsymbol{x}))(\mu(\boldsymbol{x}) - \boldsymbol{x}^\intercal\hat{\boldsymbol{\beta}}) \vert \boldsymbol{X}, \boldsymbol{x}\right] = \mathbb{E}\left[y- \mu(\boldsymbol{x}) \vert \boldsymbol{x}\right] \cdot \mathbb{E}\left[\mu(\boldsymbol{x}) - \boldsymbol{x}^\intercal\hat{\boldsymbol{\beta}}\vert \boldsymbol{X}, \boldsymbol{x}\right] = 0, \]

since \(\mathbb{E}\left[y- \mu(\boldsymbol{x}) \vert \boldsymbol{x}\right] = 0\) by definition of \(\mu(\boldsymbol{x})\). Therefore

\[ \begin{aligned} \mathbb{E}\left[(y- \boldsymbol{x}^\intercal\hat{\boldsymbol{\beta}})^2 \vert \boldsymbol{X}, \boldsymbol{x}\right] ={}& \mathbb{E}\left[(y- \mu(\boldsymbol{x}))^2 \vert \boldsymbol{x}\right] + \mathbb{E}\left[(\mu(\boldsymbol{x}) - \boldsymbol{x}^\intercal\hat{\boldsymbol{\beta}})^2 \vert \boldsymbol{X}, \boldsymbol{x}\right] \\={}& \mathrm{Var}\left(y\vert \boldsymbol{x}\right) + \mathbb{E}\left[(\mu(\boldsymbol{x}) - \boldsymbol{x}^\intercal\hat{\boldsymbol{\beta}})^2 \vert \boldsymbol{X}, \boldsymbol{x}\right]. \end{aligned} \]

(b) As in (a), write \(\mu(\boldsymbol{x}) - \boldsymbol{x}^\intercal\hat{\boldsymbol{\beta}}= (\mu(\boldsymbol{x}) - \boldsymbol{x}^\intercal\overline{\boldsymbol{\beta}}) + (\boldsymbol{x}^\intercal\overline{\boldsymbol{\beta}}- \boldsymbol{x}^\intercal\hat{\boldsymbol{\beta}})\) and expand the square. The cross term is

\[ \begin{aligned} \mathbb{E}\left[(\mu(\boldsymbol{x}) - \boldsymbol{x}^\intercal\overline{\boldsymbol{\beta}})(\boldsymbol{x}^\intercal\overline{\boldsymbol{\beta}}- \boldsymbol{x}^\intercal\hat{\boldsymbol{\beta}}) \vert \boldsymbol{X}, \boldsymbol{x}\right] ={}& (\mu(\boldsymbol{x}) - \boldsymbol{x}^\intercal\overline{\boldsymbol{\beta}}) \cdot \boldsymbol{x}^\intercal\mathbb{E}\left[\overline{\boldsymbol{\beta}}- \hat{\boldsymbol{\beta}}\vert \boldsymbol{X}\right] = 0, \end{aligned} \]

since \(\mathbb{E}\left[\hat{\boldsymbol{\beta}}\vert \boldsymbol{X}\right] = \overline{\boldsymbol{\beta}}\) by definition of \(\overline{\boldsymbol{\beta}}\). The first squared term is non-random given \(\boldsymbol{X}\), so

\[ \mathbb{E}\left[(\mu(\boldsymbol{x}) - \boldsymbol{x}^\intercal\hat{\boldsymbol{\beta}})^2 \vert \boldsymbol{X}, \boldsymbol{x}\right] = (\mu(\boldsymbol{x}) - \boldsymbol{x}^\intercal\overline{\boldsymbol{\beta}})^2 + \mathbb{E}\left[(\boldsymbol{x}^\intercal\overline{\boldsymbol{\beta}}- \boldsymbol{x}^\intercal\hat{\boldsymbol{\beta}})^2 \vert \boldsymbol{X}\right]. \]

(c) Write \[ \begin{aligned} \mathbb{E}\left[(\boldsymbol{x}^\intercal\overline{\boldsymbol{\beta}}- \boldsymbol{x}^\intercal\hat{\boldsymbol{\beta}})^2 \vert \boldsymbol{X}, \boldsymbol{x}\right] ={}& \mathbb{E}\left[(\boldsymbol{x}^\intercal(\overline{\boldsymbol{\beta}}- \hat{\boldsymbol{\beta}}))^2 \vert \boldsymbol{X}, \boldsymbol{x}\right] \\={}& \boldsymbol{x}^\intercal\mathbb{E}\left[(\overline{\boldsymbol{\beta}}- \hat{\boldsymbol{\beta}}) (\overline{\boldsymbol{\beta}}- \hat{\boldsymbol{\beta}})^\intercal\vert \boldsymbol{X}\right] \boldsymbol{x}\\={}& \boldsymbol{x}^\intercal\mathbb{E}\left[(\hat{\boldsymbol{\beta}}- \overline{\boldsymbol{\beta}}) (\hat{\boldsymbol{\beta}}- \overline{\boldsymbol{\beta}})^\intercal\vert \boldsymbol{X}\right] \boldsymbol{x}\\={}& \boldsymbol{x}^\intercal\mathrm{Cov}\left(\hat{\boldsymbol{\beta}}\boldsymbol{X}\right) \boldsymbol{x}. \end{aligned} \]

(d) This is just plugging in.

(e) If \(\mu(\boldsymbol{x}) = {\boldsymbol{\beta}^{*}}^\intercal\boldsymbol{x}\), then under correct specification, OLS is unbiased: \(\overline{\boldsymbol{\beta}}= \mathbb{E}\left[\hat{\boldsymbol{\beta}}\vert \boldsymbol{X}\right] = {\boldsymbol{\beta}^{*}}\). Therefore

\[ \mu(\boldsymbol{x}) - \boldsymbol{x}^\intercal\overline{\boldsymbol{\beta}}= {\boldsymbol{\beta}^{*}}^\intercal\boldsymbol{x}- \boldsymbol{x}^\intercal{\boldsymbol{\beta}^{*}}= 0. \]

(f) If \(\mathrm{Cov}\left(\hat{\boldsymbol{\beta}}\vert \boldsymbol{X}\right) \approx 0\) (entrywise), then for any fixed \(\boldsymbol{x}\):

\[ \boldsymbol{x}^\intercal\mathrm{Cov}\left(\hat{\boldsymbol{\beta}}\vert \boldsymbol{X}\right) \boldsymbol{x}\approx 0. \]

(g) Yes, errors still occur. Even with correct specification and \(N \to \infty\), the irreducible error \(\mathrm{Var}\left(y\vert \boldsymbol{x}\right)\) remains. This reflects the inherent randomness in \(y\) that cannot be explained by \(\boldsymbol{x}\) alone, i.e., the noise \(\varepsilon_n\) in the data generating process.

4 The bias–variance tradeoff for ridge estimation

Recall the ridge estimator, \[ \hat{\boldsymbol{\beta}}_\lambda := \left(\boldsymbol{X}^\intercal\boldsymbol{X}+ \lambda \boldsymbol{I}_P \right)^{-1} \boldsymbol{X}^\intercal\boldsymbol{Y}. \] For this problem, we will show how the bias–variance tradeoff changes as \(\lambda\) varies.

To make the problem simpler, we will assume that \(\boldsymbol{X}^\intercal\boldsymbol{X}= \boldsymbol{I}_P\), and that the normal assumption holds, so that \(p(\boldsymbol{Y}| \boldsymbol{X}) = \mathcal{N}\left(\boldsymbol{X}{\boldsymbol{\beta}^{*}}, \sigma^2 \boldsymbol{I}_N\right)\).

(a) Under the normal assumption, show that, in general, \[ p(\hat{\boldsymbol{\beta}}_\lambda | \boldsymbol{X}) = \mathcal{N}\left( \left(\boldsymbol{X}^\intercal\boldsymbol{X}+ \lambda \boldsymbol{I}_P \right)^{-1} \boldsymbol{X}^\intercal\boldsymbol{X}{\boldsymbol{\beta}^{*}}, \sigma^2 \left(\boldsymbol{X}^\intercal\boldsymbol{X}+ \lambda \boldsymbol{I}_P \right)^{-1} \boldsymbol{X}^\intercal\boldsymbol{X}\left(\boldsymbol{X}^\intercal\boldsymbol{X}+ \lambda \boldsymbol{I}_P \right)^{-1}\right). \]

(b) Using (a), show that, when \(\boldsymbol{X}^\intercal\boldsymbol{X}= \boldsymbol{I}_P\), then \[ \mathrm{Cov}\left(\hat{\boldsymbol{\beta}}_\lambda \vert \boldsymbol{X}\right) = \frac{\sigma^2}{(1 + \lambda)^2} \boldsymbol{I}_P. \]

(c) Using (a), show that, when \(\boldsymbol{X}^\intercal\boldsymbol{X}= \boldsymbol{I}_P\), then \[ \overline{\boldsymbol{\beta}}:= \mathbb{E}\left[\hat{\boldsymbol{\beta}}_\lambda \vert \boldsymbol{X}\right] = \frac{1}{1 + \lambda} {\boldsymbol{\beta}^{*}}. \] Additionally, show that \(\mu(\boldsymbol{x}) := \mathbb{E}\left[y\vert \boldsymbol{x}\right] = \boldsymbol{x}^\intercal{\boldsymbol{\beta}^{*}}\), so that \[ \mu(\boldsymbol{x}) - \boldsymbol{x}^\intercal\overline{\boldsymbol{\beta}}= \frac{\lambda}{1 + \lambda} \boldsymbol{x}^\intercal{\boldsymbol{\beta}^{*}} \]

(d) Using (b) and (c), show that, under the normal assumption and when \(\boldsymbol{X}^\intercal\boldsymbol{X}= \boldsymbol{I}_P\), the bias-variance decomposition is given by

\[ \mathbb{E}\left[(y- \boldsymbol{x}^\intercal\hat{\boldsymbol{\beta}})^2 | \boldsymbol{X}, \boldsymbol{x}\right] = \underbrace{\sigma^2}_{\textrm{Irreducible error}} + \underbrace{\frac{\lambda^2}{(1 + \lambda)^2} (\boldsymbol{x}^\intercal{\boldsymbol{\beta}^{*}})^2}_{\textrm{Bias term}} + \underbrace{\frac{\sigma^2}{(1 + \lambda)^2} \boldsymbol{x}^\intercal\boldsymbol{x}}_{\textrm{Variance term}}. \]

(e) Assuming \(\boldsymbol{x}\ne \boldsymbol{0}\) and \({\boldsymbol{\beta}^{*}}^\intercal\boldsymbol{x}\ne 0\),

  • At what value of \(\lambda\) does the bias term have a maximum? What is the maximum bias term?
  • At what value of \(\lambda\) does the bias term have a minimum? What is the minimum bias term?
  • At what value of \(\lambda\) does the variance term have a minimum? What is the minimum variance term?
  • At what value of \(\lambda\) does the variance term have a maximum? What is the maximum variance term?

Explain your answers intuitively in terms of how ridge regression shrinks \(\hat{\boldsymbol{\beta}}_\lambda\) towards zero.

(f) Show (e.g. by computing derivatives), that the bias term is increasing in \(\lambda\), and the variance term is decreasing in \(\lambda\). From this, argue that the predictive mean squared error for a particular \(\boldsymbol{x}\) will be minimized at some non–zero value of \(\lambda\), even though the model is correctly specified and the resulting \(\hat{\boldsymbol{\beta}}_\lambda\) is biased.

(g) Will the \(\lambda\) that minimizes predictive error be the same for different \(\boldsymbol{x}\)? Argue why or why not.

Solutions

(a) This is identical to the derivation of the normal distribution for the OLS estimator, but with the ridge penalty included. Since \(\boldsymbol{Y}\vert \boldsymbol{X}\sim \mathcal{N}\left(\boldsymbol{X}{\boldsymbol{\beta}^{*}}, \sigma^2 \boldsymbol{I}_N\right)\) and \(\hat{\boldsymbol{\beta}}_\lambda = \boldsymbol{A}\boldsymbol{Y}\) with \(\boldsymbol{A}= (\boldsymbol{X}^\intercal\boldsymbol{X}+ \lambda \boldsymbol{I}_P)^{-1} \boldsymbol{X}^\intercal\), the estimator is a linear function of \(\boldsymbol{Y}\) and hence normally distributed. Its mean is

\[ \mathbb{E}\left[\hat{\boldsymbol{\beta}}_\lambda \vert \boldsymbol{X}\right] = (\boldsymbol{X}^\intercal\boldsymbol{X}+ \lambda \boldsymbol{I}_P)^{-1} \boldsymbol{X}^\intercal\boldsymbol{X}{\boldsymbol{\beta}^{*}}, \]

and its covariance is

\[ \mathrm{Cov}\left(\hat{\boldsymbol{\beta}}_\lambda \vert \boldsymbol{X}\right) = \boldsymbol{A}\sigma^2 \boldsymbol{I}_N \boldsymbol{A}^\intercal = \sigma^2 (\boldsymbol{X}^\intercal\boldsymbol{X}+ \lambda \boldsymbol{I}_P)^{-1} \boldsymbol{X}^\intercal\boldsymbol{X}(\boldsymbol{X}^\intercal\boldsymbol{X}+ \lambda \boldsymbol{I}_P)^{-1}. \]

(b) When \(\boldsymbol{X}^\intercal\boldsymbol{X}= \boldsymbol{I}_P\), we have \[ (\boldsymbol{X}^\intercal\boldsymbol{X}+ \lambda \boldsymbol{I}_P)^{-1} = \frac{1}{1+\lambda}\boldsymbol{I}_P. \]
Substituting into the formula from (a):

\[ \mathrm{Cov}\left(\hat{\boldsymbol{\beta}}_\lambda \vert \boldsymbol{X}\right) = \sigma^2 \cdot \frac{1}{1+\lambda}\boldsymbol{I}_P \cdot \boldsymbol{I}_P \cdot \frac{1}{1+\lambda}\boldsymbol{I}_P = \frac{\sigma^2}{(1+\lambda)^2}\boldsymbol{I}_P. \]

(c) Plugging into (b), when \(\boldsymbol{X}^\intercal\boldsymbol{X}= \boldsymbol{I}_P\)

\[ \begin{aligned} \hat{\boldsymbol{\beta}}_\lambda ={}& (\boldsymbol{X}^\intercal\boldsymbol{X}+ \lambda \boldsymbol{I}_P)^{-1} \boldsymbol{X}^\intercal\boldsymbol{Y} \\={}& \frac{1}{1+\lambda} \boldsymbol{X}^\intercal\boldsymbol{Y} \Rightarrow\\ \overline{\boldsymbol{\beta}}= \mathbb{E}\left[\hat{\boldsymbol{\beta}}_\lambda \vert \boldsymbol{X}\right] ={}& \frac{1}{1+\lambda} \boldsymbol{X}^\intercal\mathbb{E}\left[\boldsymbol{Y}\vert \boldsymbol{X}\right] \\={}& \frac{1}{1+\lambda} \boldsymbol{X}^\intercal\boldsymbol{X}{\boldsymbol{\beta}^{*}} \\={}& \frac{1}{1+\lambda} {\boldsymbol{\beta}^{*}}. \end{aligned} \]

Under the normal model, \(y\vert \boldsymbol{x}, {\boldsymbol{\beta}^{*}}\sim \mathcal{N}\left(\boldsymbol{x}^\intercal{\boldsymbol{\beta}^{*}}, \sigma^2\right)\), so \(\mu(\boldsymbol{x}) = \mathbb{E}\left[y\vert \boldsymbol{x}\right] = \boldsymbol{x}^\intercal{\boldsymbol{\beta}^{*}}\). Therefore

\[ \mu(\boldsymbol{x}) - \boldsymbol{x}^\intercal\overline{\boldsymbol{\beta}} = \boldsymbol{x}^\intercal{\boldsymbol{\beta}^{*}}- \frac{1}{1+\lambda}\boldsymbol{x}^\intercal{\boldsymbol{\beta}^{*}} = \frac{\lambda}{1+\lambda}\boldsymbol{x}^\intercal{\boldsymbol{\beta}^{*}}. \]

(d) Using \(\mathrm{Var}\left(y\vert \boldsymbol{x}\right) = \sigma^2\) (irreducible error), the bias term from (c), and the corrected variance term from (b):

\[ \mathbb{E}\left[(y- \boldsymbol{x}^\intercal\hat{\boldsymbol{\beta}}_\lambda)^2 \vert \boldsymbol{X}, \boldsymbol{x}\right] = \underbrace{\sigma^2}_{\textrm{Irreducible error}} + \underbrace{\frac{\lambda^2}{(1+\lambda)^2} (\boldsymbol{x}^\intercal{\boldsymbol{\beta}^{*}})^2}_{\textrm{Bias term}} + \underbrace{\frac{\sigma^2}{(1+\lambda)^2} \boldsymbol{x}^\intercal\boldsymbol{x}}_{\textrm{Variance term}}. \]

(e) The bias term \(\lambda^2 / (1 + \lambda^2)\) is less than one, and as \(\lambda \rightarrow \infty\) it approaches one. So the bias is maximized when \(\lambda = \infty\), where it takes the value \((\boldsymbol{x}^\intercal{\boldsymbol{\beta}^{*}})^2\).

The bias term is greater than zero except at \(\lambda = 0\), so it is minimized when \(\lambda = 0\), at which point it is zero.

The variance term is \(> 0\), and approaches zero as \(\lambda \rightarrow \infty\). So it is smallest when \(\lambda = \infty\), at which point it is zero.

The term \(1 / (1 + \lambda)^2\) is $ < 1$ except at \(\lambda = 0\), when it is \(1\). So the variance is maximized at \(\lambda = 0\), at which point it is \(\sigma^2 \boldsymbol{x}^\intercal\boldsymbol{x}\).

(f) By differentiating,

\[ \frac{d}{ d\lambda} \frac{\lambda^2}{(1 + \lambda)^2} = 2 \frac{\lambda}{(1 + \lambda)^2} - 2 \frac{\lambda^2}{(1 + \lambda)^3} = 2 \frac{\lambda}{(1 + \lambda)^2} \left(1 - \frac{\lambda}{1 + \lambda} \right) > 0, \] so the bias term is increasing in \(\lambda\).

Next, \(\lambda \mapsto (1 + \lambda)^2\) is increasing in \(\lambda\), and \(z\mapsto 1 / z\) is decreasing in \(z\), so the variance is decreasing in \(\lambda\).

(g) The optimal \(\lambda^* = \frac{\sigma^2 \boldsymbol{x}^\intercal\boldsymbol{x}}{(\boldsymbol{x}^\intercal{\boldsymbol{\beta}^{*}})^2}\) depends on both \(\boldsymbol{x}^\intercal\boldsymbol{x}\) and \((\boldsymbol{x}^\intercal{\boldsymbol{\beta}^{*}})^2\), both of which vary with the prediction point \(\boldsymbol{x}\). In general, the optimal \(\lambda\) differs for different \(\boldsymbol{x}\), so no single fixed \(\lambda\) is simultaneously optimal for all prediction points.

5 Bayesian posterior and regularization

Suppose a partitioned vector has a multivariate normal distribution of the following form:

\[ \begin{pmatrix} \boldsymbol{a}\\ \boldsymbol{b} \end{pmatrix} \sim \mathcal{N}\left( \begin{pmatrix} \boldsymbol{\mu}_a \\ \boldsymbol{\mu}_b \\ \end{pmatrix}, \begin{pmatrix} \Sigma_{aa} & \Sigma_{ab} \\ \Sigma_{ba} & \Sigma_{bb} \\ \end{pmatrix} \right) \] It happens that the conditional distribution is given by \[ p(\boldsymbol{a}\vert \boldsymbol{b}) = \mathcal{N}\left( \boldsymbol{\mu}_a + \Sigma_{ab} \Sigma_{bb}^{-1} (\boldsymbol{b}- \boldsymbol{\mu}_b), \Sigma_{aa} - \Sigma_{ab}\Sigma_{bb}^{-1} \Sigma_{ba} \right). \]

(a) Show that if \(\boldsymbol{a}\) and \(\boldsymbol{b}\) are uncorrelated, then \(\boldsymbol{a}\) and \(\boldsymbol{b}\) are independent. Hint: a sufficient condition for independence is that \(p(\boldsymbol{a}\vert \boldsymbol{b}) = p(\boldsymbol{a})\).

(b) Suppose that \(\boldsymbol{b}= c \boldsymbol{a}\) for some \(c\), so that \(\boldsymbol{a}\) and \(\boldsymbol{b}\) are perfectly correlated. Show that \(\Sigma_{ab} = c\Sigma_{aa}\) and \(\Sigma_{bb} = c^2 \Sigma_{aa}\).

(c) Using (b), show that if \(\boldsymbol{a}\) and \(\boldsymbol{b}\) are perfectly correlated, then \(\mathrm{Cov}\left(\boldsymbol{a}\vert \boldsymbol{b}\right) = \boldsymbol{0}\).

Now, suppose that we have a prior of the form \(\boldsymbol{\beta}\sim \mathcal{N}\left(\boldsymbol{0}, s^2 \boldsymbol{I}_P\right)\), and that \(\boldsymbol{Y}= \boldsymbol{X}\boldsymbol{\beta}+ \boldsymbol{\varepsilon}\), where \(\boldsymbol{\varepsilon}\sim \mathcal{N}\left(\boldsymbol{0}, \sigma^2 \boldsymbol{I}_N\right)\), for some known \(\sigma^2\) and \(s^2\). Here, take \(\boldsymbol{X}\) as fixed, and \(\boldsymbol{\varepsilon}\) and \(\boldsymbol{\beta}\) as independent.

(d) Compute the following quantities:

  • \(\mathbb{E}\left[\boldsymbol{Y}\right]\)
  • \(\mathbb{E}\left[\boldsymbol{\beta}\right]\)
  • \(\mathrm{Cov}\left(\boldsymbol{Y}\right)\)
  • \(\mathrm{Cov}\left(\boldsymbol{\beta}, \boldsymbol{Y}\right)\)
  • \(\mathrm{Cov}\left(\boldsymbol{\beta}\right)\)

(e) Using (d), write down the joint distribution of the stacked vector \((\boldsymbol{\beta}^\intercal, \boldsymbol{Y}^\intercal)^\intercal\).

(f) Using (e) and the formula for the conditional distribution of multivariate normals, find the posterior distribution \(p(\boldsymbol{\beta}\vert \boldsymbol{Y})\).

(g) What happens to the posterior mean \(\mathbb{E}\left[\boldsymbol{\beta}\vert \boldsymbol{Y}\right]\) as \(s\rightarrow \infty\)? What happens as \(s\rightarrow 0\)?

(h) What happens to the posterior variance \(\mathrm{Cov}\left(\sqrt{N} \boldsymbol{\beta}\vert \boldsymbol{Y}\right)\) as \(s\rightarrow \infty\)? (Note the scaling by \(\sqrt{N}\).) How does this relate to the limiting frequentist covariance of \(\sqrt{N}(\hat{\boldsymbol{\beta}}- {\boldsymbol{\beta}^{*}})\) given by the CLT?

(i) For what value of \(\lambda\) is the posterior mean \(\mathbb{E}\left[\boldsymbol{\beta}\vert \boldsymbol{Y}\right]\) the same as the ridge regression estimator?

Solutions

(a) If \(\boldsymbol{a}\) and \(\boldsymbol{b}\) are uncorrelated, then \(\Sigma_{ab} = \boldsymbol{0}\) (and \(\Sigma_{ba} = \boldsymbol{0}\)). The conditional distribution formula gives

\[ p(\boldsymbol{a}\vert \boldsymbol{b}) = \mathcal{N}\left(\boldsymbol{\mu}_a + \boldsymbol{0}\cdot \Sigma_{bb}^{-1}(\boldsymbol{b}- \boldsymbol{\mu}_b),\; \Sigma_{aa} - \boldsymbol{0}\cdot \Sigma_{bb}^{-1} \cdot \boldsymbol{0}\right) = \mathcal{N}\left(\boldsymbol{\mu}_a, \Sigma_{aa}\right) = p(\boldsymbol{a}). \]

Since \(p(\boldsymbol{a}\vert \boldsymbol{b}) = p(\boldsymbol{a})\), by the stated sufficient condition, \(\boldsymbol{a}\) and \(\boldsymbol{b}\) are independent.

(b) Assuming \(\boldsymbol{b}= c\boldsymbol{a}\):

\[ \Sigma_{bb} = \mathrm{Cov}\left(\boldsymbol{b}\right) = \mathrm{Cov}\left(c\boldsymbol{a}\right) = c^2 \mathrm{Cov}\left(\boldsymbol{a}\right) = c^2 \Sigma_{aa}, \]

\[ \Sigma_{ab} = \mathrm{Cov}\left(\boldsymbol{a}, \boldsymbol{b}\right) = \mathrm{Cov}\left(\boldsymbol{a}, c\boldsymbol{a}\right) = c\mathrm{Cov}\left(\boldsymbol{a}\right) = c\Sigma_{aa}. \]

(c) Substituting \(\Sigma_{ab} = c\Sigma_{aa}\) and \(\Sigma_{bb} = c^2\Sigma_{aa}\) from (b):

\[ \mathrm{Cov}\left(\boldsymbol{a}\vert \boldsymbol{b}\right) = \Sigma_{aa} - \Sigma_{ab}\Sigma_{bb}^{-1}\Sigma_{ba} = \Sigma_{aa} - (c\Sigma_{aa})(c^2\Sigma_{aa})^{-1}(c\Sigma_{aa}) = \Sigma_{aa} - \frac{c^2}{c^2}\Sigma_{aa}\Sigma_{aa}^{-1}\Sigma_{aa} = \Sigma_{aa} - \Sigma_{aa} = \boldsymbol{0}. \]

(d)

  • \(\mathbb{E}\left[\boldsymbol{Y}\right] = \mathbb{E}\left[\boldsymbol{X}\boldsymbol{\beta}+ \boldsymbol{\varepsilon}\right] = \boldsymbol{X}\mathbb{E}\left[\boldsymbol{\beta}\right] + \mathbb{E}\left[\boldsymbol{\varepsilon}\right] = \boldsymbol{X}\boldsymbol{0}+ \boldsymbol{0}= \boldsymbol{0}.\)
  • \(\mathbb{E}\left[\boldsymbol{\beta}\right] = \boldsymbol{0}\)
  • \(\mathrm{Cov}\left(\boldsymbol{Y}\right) = \mathrm{Cov}\left(\boldsymbol{X}\boldsymbol{\beta}+ \boldsymbol{\varepsilon}\right) = \boldsymbol{X}\mathrm{Cov}\left(\boldsymbol{\beta}\right)\boldsymbol{X}^\intercal+ \mathrm{Cov}\left(\boldsymbol{\varepsilon}\right) = s^2 \boldsymbol{X}\boldsymbol{X}^\intercal+ \sigma^2\boldsymbol{I}_N\)
  • \(\mathrm{Cov}\left(\boldsymbol{\beta}, \boldsymbol{Y}\right) = \mathrm{Cov}\left(\boldsymbol{\beta}, \boldsymbol{X}\boldsymbol{\beta}+ \boldsymbol{\varepsilon}\right) = \mathrm{Cov}\left(\boldsymbol{\beta}\right)\boldsymbol{X}^\intercal+ \mathrm{Cov}\left(\boldsymbol{\beta}, \boldsymbol{\varepsilon}\right) = s^2\boldsymbol{I}_P \cdot \boldsymbol{X}^\intercal+ \boldsymbol{0}= s^2\boldsymbol{X}^\intercal.\)
  • \(\mathrm{Cov}\left(\boldsymbol{\beta}\right) = s^2\boldsymbol{I}_P\)

(e)

\[ \begin{pmatrix} \boldsymbol{\beta}\\ \boldsymbol{Y}\end{pmatrix} \sim \mathcal{N}\left( \begin{pmatrix} \boldsymbol{0}\\ \boldsymbol{0}\end{pmatrix}, \begin{pmatrix} s^2\boldsymbol{I}_P & s^2\boldsymbol{X}^\intercal\\ s^2\boldsymbol{X}& s^2\boldsymbol{X}\boldsymbol{X}^\intercal+ \sigma^2\boldsymbol{I}_N \end{pmatrix} \right). \]

(f) Plugging in the corresponding quanties from (d) into (e),

\[ \begin{aligned} \mathbb{E}\left[\boldsymbol{\beta}\vert \boldsymbol{Y}\right] ={}& \mathbb{E}\left[\boldsymbol{\beta}\right] + \mathrm{Cov}\left(\boldsymbol{\beta}, \boldsymbol{Y}\right) \mathrm{Cov}\left(\boldsymbol{Y}\right)^{-1} (\boldsymbol{Y}- \mathbb{E}\left[\boldsymbol{Y}\right]) \\={}& \boldsymbol{0}+ s^2 \boldsymbol{X}^\intercal\left( s^2 \boldsymbol{X}\boldsymbol{X}^\intercal+ \sigma^2\boldsymbol{I}_N \right)^{-1} (\boldsymbol{Y}- \boldsymbol{0}) \\={}& s^2 \left( s^2 \boldsymbol{X}^\intercal\boldsymbol{X}+ \sigma^2\boldsymbol{I}_N \right)^{-1} \boldsymbol{X}^\intercal\boldsymbol{Y} \\={}& \left( \boldsymbol{X}^\intercal\boldsymbol{X}+ \frac{\sigma^2}{s^2}\boldsymbol{I}_N \right)^{-1} \boldsymbol{X}^\intercal\boldsymbol{Y} \end{aligned} \] (The last formula uses the push-through identity.)

Then, the covariance is \[ \begin{aligned} \mathrm{Cov}\left(\boldsymbol{\beta}\vert \boldsymbol{Y}\right) ={}& \mathrm{Cov}\left(\boldsymbol{\beta}\right) - \mathrm{Cov}\left(\boldsymbol{\beta}, \boldsymbol{Y}\right) \mathrm{Cov}\left(\boldsymbol{Y}\right)^{-1} \mathrm{Cov}\left(\boldsymbol{Y}, \boldsymbol{\beta}\right) \\={}& s^2 \boldsymbol{I}_P - s^2 \boldsymbol{X}^\intercal\left(s^2 \boldsymbol{X}\boldsymbol{X}^\intercal+ \sigma^2\boldsymbol{I}_N \right)^{-1} s^2 \boldsymbol{X} \\={}& s^2 \left( \boldsymbol{I}_P - s^2 \left(s^2 \boldsymbol{X}^\intercal\boldsymbol{X}+ \sigma^2\boldsymbol{I}_P \right)^{-1} \boldsymbol{X}^\intercal\boldsymbol{X}\right) \\={}& s^2 \left(s^2 \boldsymbol{X}^\intercal\boldsymbol{X}+ \sigma^2\boldsymbol{I}_P \right)^{-1} \left( s^2 \boldsymbol{X}^\intercal\boldsymbol{X}+ \sigma^2\boldsymbol{I}_P - s^2\boldsymbol{X}^\intercal\boldsymbol{X}\right) \\={}& s^2 \sigma^2 \left(s^2 \boldsymbol{X}^\intercal\boldsymbol{X}+ \sigma^2\boldsymbol{I}_P \right)^{-1} \\={}& \left(\sigma^{-2} \boldsymbol{X}^\intercal\boldsymbol{X}+ s^{-2}\boldsymbol{I}_P \right)^{-1}. \end{aligned} \]

(g) As \(s\rightarrow \infty\), \(\mathbb{E}\left[\boldsymbol{\beta}\vert \boldsymbol{Y}\right] \rightarrow \hat{\boldsymbol{\beta}}\), the OLS solution. As \(s\rightarrow 0\), it goes to zero.

(h) As \(s\rightarrow \infty\), \(\mathrm{Cov}\left(\boldsymbol{\beta}\vert \boldsymbol{Y}\right) \rightarrow \sigma^2 (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1}\), the OLS covariance. As \(s\rightarrow 0\), the variance goes to zero.

(i) From (f), the posterior mean is \((\boldsymbol{X}^\intercal\boldsymbol{X}+ \lambda\boldsymbol{I}_P)^{-1}\boldsymbol{X}^\intercal\boldsymbol{Y}\) with \(\lambda = \sigma^2/s^2\). The ridge regression estimator is \(\hat{\boldsymbol{\beta}}_\lambda = (\boldsymbol{X}^\intercal\boldsymbol{X}+ \lambda\boldsymbol{I}_P)^{-1}\boldsymbol{X}^\intercal\boldsymbol{Y}\). These are equal when \(\lambda = \sigma^2/s^2\).