STAT151A Homework 3
1 Omitted variables in inference versus prediction
Suppose we are interested in student performance as measured by fourth–year GPA, \(y_n\), centered at \(0\) and scaled to have unit variance. Suppose we imagine two regressors:
- \(x_n\): Student’s academic ability (in an abstract sense, maybe not precisely measurable or defined), and
- \(z_n\): Student’s performance on a standardized test before admission.
Suppose that \(y_n = \beta x_n + \varepsilon_n\), where \(\varepsilon_n\) is mean zero, independent of \(x_n\), and with variance \(\sigma_\varepsilon^2\). We will also assume that \(\beta > 0\). This simplistic model is unrealistic, but we will use it for illustrative purposes, and assume for this problem that this model determines the “true” relationship between \(x_n\), \(z_n\), and \(y_n\).
Note that \(z_n\) effectively has a zero coefficient in the “true” model — student score is causally determined entirely by “ability,” not by test score.
Assume that we have centered and standardized \(x_n\) and \(z_n\), so that
\[ \begin{aligned} \frac{1}{N} \sum_{n=1}^Nx_n =& \frac{1}{N} \sum_{n=1}^Nz_n = 0 \quad\textrm{and} \\ \frac{1}{N} \sum_{n=1}^Nx_n^2 =& \frac{1}{N} \sum_{n=1}^Nz_n^2 = 1. \end{aligned} \] But large values of \(x_n\) and \(z_n\) tend to occur together, so that \(\frac{1}{N} \sum_{n=1}^Nx_n z_n = 0.9\). (You might think of \(x_n\) and \(z_n\) as having been drawn from a correlated random variables, even though they are not random for the purpose of this problem.)
(a)
Given the true model, how will we change a students’ expected GPA if we:
- Increase their test score \(z_n\) by helping them memorize the answers to the test?
- Improve their academic ability \(x_n\) by teaching them better time management skills?
(b)
If the number of observations \(N\) is large, and we run the regression \(y_n \sim \gamma z_n\), what will the expected coefficient \(\mathbb{E}\left[\hat{\gamma}\right]\) be? (In the expectation, only \(y_n\) and \(\varepsilon_n\) are random; the regressors are fixed.)
How does this compare to the “true” influence of \(z_n\) on GPA? How does this compare with the “true” influence of \(x_n\) on GPA?
(c)
Are we able to do effective inference using the regression \(y_n \sim \gamma z_n\)? Explain why or why not in intuitive terms.
(d)
Let \(x_{*}\), \(z_{*}\), and \(y_{*}\) denote a new observation not part of the training set. We still have \(y_* = \beta x_* + \varepsilon_*\), but assume now that \(x_*\) and \(z_*\) are random, independent of \(\varepsilon_*\), and the training data, and that their moments match the training set’s sample moments:
\[ \begin{aligned} \mathbb{E}\left[x_*\right] ={}& \mathbb{E}\left[z_*\right] = 0 \\ \mathrm{Var}\left(x_*\right) ={}& \mathrm{Var}\left(z_*\right) = 1 \\ \mathbb{E}\left[x_* z_*\right] ={}& 0.9. \end{aligned} \]
Evaluate \(\mathbb{E}\left[y_*\right]\) and \(\mathrm{Var}\left(y_*\right)\), where the randomness is over \(x_*\), \(z_*\), and \(\varepsilon_*\). Note that the answer will be in terms of \(\sigma_\varepsilon^2\), the variance of \(\varepsilon_*\).
(e)
Using the regression \(y_n \sim \gamma z_n\), form the prediction \(\hat{y}_{*} = z_{*} \hat{\gamma}\). Evaluate \(\mathbb{E}\left[\hat{y}_* \vert \boldsymbol{Y}\right]\) and \(\mathrm{Var}\left(\hat{y}_* \vert \boldsymbol{Y}\right)\). Again, the answer will be in terms of \(\sigma_\varepsilon^2\), the variance of \(\varepsilon_*\). Hint: \(\hat{\gamma}\) is fixed (not random) conditional on \(\boldsymbol{Y}\).
(f)
Using (d) and (e), evalute the conditional correlation between the response and prediction:
\[ \textrm{Correlation}(y_*, \hat{y}_* \vert \boldsymbol{Y}) = \frac{\mathbb{E}\left[y_{*} \hat{y}_{*} \vert \boldsymbol{Y}\right]} {\sqrt{\mathrm{Var}\left(\hat{y}_{*} \vert \boldsymbol{Y}\right)} \sqrt{\mathrm{Var}\left(y_* | \boldsymbol{Y}\right)}}. \]
Hint: \(\varepsilon_*\), \(x_*\), and \(z_*\) are independent of the training data \(\boldsymbol{Y}\), so the conditioning doesn’t affect their expectations.
(g)
Suppose that \(\sigma_\varepsilon/ \beta\) is not too large, so the residual standard deviation is not too large relative to the effect of academic ability on GPA. Are we then able to do effective prediction using the regression \(y_n \sim \gamma z_n\), say, to identify which students are likely to have good GPAs using only test scores? Explain why or why not in intuitive terms.
2 Eigendecomposition and covariances
Consider the regression \(\boldsymbol{Y}\sim \boldsymbol{X}\boldsymbol{\beta}\), where \(\boldsymbol{X}\) is full-rank, and we make the normal assumption \(\boldsymbol{Y}\vert \boldsymbol{X}\sim \mathcal{N}\left(\boldsymbol{X}{\boldsymbol{\beta}^{*}}, \sigma^2 \boldsymbol{I}_N\right)\). In this problem, we relate the variance of \(\hat{\boldsymbol{\beta}}\) to the eigenvalues of \(\hat{\boldsymbol{M}}_{\boldsymbol{X}}:= \frac{1}{N} \boldsymbol{X}^\intercal\boldsymbol{X}\).
Let \(\hat{\boldsymbol{M}}_{\boldsymbol{X}}= \boldsymbol{U}\Lambda \boldsymbol{U}^\intercal\) be the eigendecomposition of \(\hat{\boldsymbol{M}}_{\boldsymbol{X}}\). Here, \(\Lambda\) is diagonal with entries \(\lambda_p\), and \(\boldsymbol{U}\) is orthonormal, meaning \(\boldsymbol{U}^\intercal\boldsymbol{U}= \boldsymbol{I}_P\).
(a) Define the new regressor \(\boldsymbol{z}_n = \boldsymbol{U}^\intercal\boldsymbol{x}_n\), with the corresponding regressor matrix \(\boldsymbol{Z}\). Show that \(\boldsymbol{Z}^\intercal\boldsymbol{Z}= N \Lambda\).
(b) For the regression \(\boldsymbol{Y}\sim \boldsymbol{Z}\gamma\), show that \(\boldsymbol{U}\hat{\boldsymbol{\gamma}}= \hat{\boldsymbol{\beta}}\), and so \(\hat{\boldsymbol{\gamma}}= \boldsymbol{U}^\intercal\hat{\boldsymbol{\beta}}\). (Hint: We have defined an invertible linear reparameterization of the regressors, and we proved something on a previous homework about this.)
(c) Show that \(\hat{\boldsymbol{\gamma}}\sim \mathcal{N}\left(\boldsymbol{U}^\intercal{\boldsymbol{\beta}^{*}}, \frac{\sigma^2}{N} \Lambda^{-1}\right)\), and so the entries \(\hat{\boldsymbol{\gamma}}_p\) are independent with variance \(\lambda_p^{-1} / N\).
(d) We have shown that each entry of \(\hat{\boldsymbol{\beta}}\) is a linear combination of independent Gaussians with variance \(\lambda_p^{-1} / N\). What do you expect to happen to the covariance of \(\hat{\boldsymbol{\beta}}\) as \(\lambda_p\) approaches zero for some \(p\)?
(e) When \(\boldsymbol{X}\) is not full-rank, then at least one \(\lambda_p = 0\), and \(\boldsymbol{X}^\intercal\boldsymbol{X}\) is not invertible. Interpret this fact in light of the above result.
3 The bias-variance tradeoff
Suppose we assume that we have training data \((\boldsymbol{x}_n, y_n)\) that are IID, and that we want to form a prediction at \(\boldsymbol{x}\) for an unobserved \(y\), where \((\boldsymbol{x}, y)\) are drawn from the same distribution as the training data. Here, \(\boldsymbol{x}\) can be thought of as fixed and known — it’s the point where we want to make a prediction — and \(y\) is random and unknown with distribution \(p(y\vert \boldsymbol{x})\).
Below, all expectations will be take with respect to the randomness in the training responses \(\boldsymbol{Y}\) and the new, unobserved response, \(y\), with the training regressors \(\boldsymbol{X}\) and prediction regressor \(\boldsymbol{x}\) taken as fixed.
Define \(\overline{\boldsymbol{\beta}}= \mathbb{E}\left[\hat{\boldsymbol{\beta}}\vert \boldsymbol{X}\right]\), the conditional expectation of the OLS estimator given the training regressors. Let \(\mu(\boldsymbol{x}) := \mathbb{E}\left[y\vert \boldsymbol{x}\right]\) denote the true value of the conditional expectation of the predictive response.
(a) Show that
\[ \mathbb{E}\left[(y- \boldsymbol{x}^\intercal\hat{\boldsymbol{\beta}})^2 | \boldsymbol{X}, \boldsymbol{x}\right] = \mathrm{Var}\left(y\vert \boldsymbol{x}\right) + \mathbb{E}\left[(\mu(\boldsymbol{x}) - \boldsymbol{x}^\intercal\hat{\boldsymbol{\beta}})^2 | \boldsymbol{X}, \boldsymbol{x}\right]. \]
Hint: Inside the square, add and subtract \(\mu(\boldsymbol{x})\).
(b) Show that
\[ \mathbb{E}\left[(\mu(\boldsymbol{x}) - \boldsymbol{x}^\intercal\hat{\boldsymbol{\beta}})^2 | \boldsymbol{X}, \boldsymbol{x}\right] = (\mu(\boldsymbol{x}) - \boldsymbol{x}^\intercal\overline{\boldsymbol{\beta}})^2 + \mathbb{E}\left[(\boldsymbol{x}^\intercal\overline{\boldsymbol{\beta}}- \boldsymbol{x}^\intercal\hat{\boldsymbol{\beta}})^2 | \boldsymbol{X}\right]. \]
Hint: Inside the square, add and subtract \(\overline{\boldsymbol{\beta}}^\intercal\boldsymbol{x}\).
(c) Show that
\[ \mathbb{E}\left[(\boldsymbol{x}^\intercal\overline{\boldsymbol{\beta}}- \boldsymbol{x}^\intercal\hat{\boldsymbol{\beta}})^2 | \boldsymbol{X}\right] = \boldsymbol{x}^\intercal\mathrm{Cov}\left(\hat{\boldsymbol{\beta}}\vert \boldsymbol{X}\right) \boldsymbol{x}. \]
Hint: Expand the square, and rearrange \(\boldsymbol{x}^\intercal\overline{\boldsymbol{\beta}}= \overline{\boldsymbol{\beta}}\boldsymbol{x}\) and \(\boldsymbol{x}^\intercal\hat{\boldsymbol{\beta}}= \hat{\boldsymbol{\beta}}^\intercal\boldsymbol{x}\) in one of the two factors of each term.
(d) Combining, write
\[ \mathbb{E}\left[(y- \boldsymbol{x}^\intercal\hat{\boldsymbol{\beta}})^2 | \boldsymbol{X}, \boldsymbol{x}\right] = \underbrace{\mathrm{Var}\left(y\vert \boldsymbol{x}\right)}_{\textrm{Irreducible error}} + \underbrace{(\mu(\boldsymbol{x}) - \boldsymbol{x}^\intercal\overline{\boldsymbol{\beta}})^2}_{\textrm{Bias term}} + \underbrace{\boldsymbol{x}^\intercal\mathrm{Cov}\left(\hat{\boldsymbol{\beta}}\vert \boldsymbol{X}\right) \boldsymbol{x}}_{\textrm{Variance term}}. \]
This is known as the bias-variance tradeoff.
(e) Suppose the model is correctly specified, so there exists some \({\boldsymbol{\beta}^{*}}\) so that \(\mu(\boldsymbol{x}) = {\boldsymbol{\beta}^{*}}^\intercal\boldsymbol{x}\). Show that the “bias term” vanishes, even if \(\mathrm{Cov}\left(\hat{\boldsymbol{\beta}}\vert \boldsymbol{X}\right)\) is large.
(f) Suppose that \(N\) is very large so that \(\mathrm{Cov}\left(\hat{\boldsymbol{\beta}}\vert \boldsymbol{X}\right) \approx 0\). Show that the variance term vanishes, even though \((\mu(\boldsymbol{x}) - \boldsymbol{x}^\intercal\overline{\boldsymbol{\beta}})^2\) may be large.
(g) Suppose that the model is correctly specified and \(N\) is very large. Do you still make any errors in predicting \(y\)? What is the source of these errors?
4 The bias–variance tradeoff for ridge estimation
Recall the ridge estimator, \[ \hat{\boldsymbol{\beta}}_\lambda := \left(\boldsymbol{X}^\intercal\boldsymbol{X}+ \lambda \boldsymbol{I}_P \right)^{-1} \boldsymbol{X}^\intercal\boldsymbol{Y}. \] For this problem, we will show how the bias–variance tradeoff changes as \(\lambda\) varies.
To make the problem simpler, we will assume that \(\boldsymbol{X}^\intercal\boldsymbol{X}= \boldsymbol{I}_P\), and that the normal assumption holds, so that \(p(\boldsymbol{Y}| \boldsymbol{X}) = \mathcal{N}\left(\boldsymbol{X}{\boldsymbol{\beta}^{*}}, \sigma^2 \boldsymbol{I}_N\right)\).
(a) Under the normal assumption, show that, in general, \[ p(\hat{\boldsymbol{\beta}}_\lambda | \boldsymbol{X}) = \mathcal{N}\left( \left(\boldsymbol{X}^\intercal\boldsymbol{X}+ \lambda \boldsymbol{I}_P \right)^{-1} \boldsymbol{X}^\intercal\boldsymbol{X}{\boldsymbol{\beta}^{*}}, \sigma^2 \left(\boldsymbol{X}^\intercal\boldsymbol{X}+ \lambda \boldsymbol{I}_P \right)^{-1} \boldsymbol{X}^\intercal\boldsymbol{X}\left(\boldsymbol{X}^\intercal\boldsymbol{X}+ \lambda \boldsymbol{I}_P \right)^{-1}\right). \]
(b) Using (a), show that, when \(\boldsymbol{X}^\intercal\boldsymbol{X}= \boldsymbol{I}_P\), then \[ \mathrm{Cov}\left(\hat{\boldsymbol{\beta}}_\lambda \vert \boldsymbol{X}\right) = \frac{\sigma^2}{(1 + \lambda)^2} \boldsymbol{I}_P. \]
(c) Using (a), show that, when \(\boldsymbol{X}^\intercal\boldsymbol{X}= \boldsymbol{I}_P\), then \[ \overline{\boldsymbol{\beta}}:= \mathbb{E}\left[\hat{\boldsymbol{\beta}}_\lambda \vert \boldsymbol{X}\right] = \frac{1}{1 + \lambda} {\boldsymbol{\beta}^{*}}. \] Additionally, show that \(\mu(\boldsymbol{x}) := \mathbb{E}\left[y\vert \boldsymbol{x}\right] = \boldsymbol{x}^\intercal{\boldsymbol{\beta}^{*}}\), so that \[ \mu(\boldsymbol{x}) - \boldsymbol{x}^\intercal\overline{\boldsymbol{\beta}}= \frac{\lambda}{1 + \lambda} \boldsymbol{x}^\intercal{\boldsymbol{\beta}^{*}} \]
(d) Using (b) and (c), show that, under the normal assumption and when \(\boldsymbol{X}^\intercal\boldsymbol{X}= \boldsymbol{I}_P\), the bias-variance decomposition is given by
\[ \mathbb{E}\left[(y- \boldsymbol{x}^\intercal\hat{\boldsymbol{\beta}})^2 | \boldsymbol{X}, \boldsymbol{x}\right] = \underbrace{\sigma^2}_{\textrm{Irreducible error}} + \underbrace{\frac{\lambda^2}{(1 + \lambda)^2} (\boldsymbol{x}^\intercal{\boldsymbol{\beta}^{*}})^2}_{\textrm{Bias term}} + \underbrace{\frac{\sigma^2}{(1 + \lambda)^2} \boldsymbol{x}^\intercal\boldsymbol{x}}_{\textrm{Variance term}}. \]
(e) Assuming \(\boldsymbol{x}\ne \boldsymbol{0}\) and \({\boldsymbol{\beta}^{*}}^\intercal\boldsymbol{x}\ne 0\),
- At what value of \(\lambda\) does the bias term have a maximum? What is the maximum bias term?
- At what value of \(\lambda\) does the bias term have a minimum? What is the minimum bias term?
- At what value of \(\lambda\) does the variance term have a minimum? What is the minimum variance term?
- At what value of \(\lambda\) does the variance term have a maximum? What is the maximum variance term?
Explain your answers intuitively in terms of how ridge regression shrinks \(\hat{\boldsymbol{\beta}}_\lambda\) towards zero.
(f) Show (e.g. by computing derivatives), that the bias term is incresing in \(\lambda\), and the variance term is decreasing in \(\lambda\). From this, argue that the predictive mean squared error for a particular \(\boldsymbol{x}\) will be minimized at some non–zero value of \(\lambda\), even though the model is correctly specified and the resulting \(\hat{\boldsymbol{\beta}}_\lambda\) is biased.
(g) Will the \(\lambda\) that minimizes predictive error be the same for different \(\boldsymbol{x}\)? Argue why or why not.
5 Bayesian posterior and regularization
Suppose a partitioned vector has a multivariate normal distribution of the following form:
\[ \begin{pmatrix} \boldsymbol{a}\\ \boldsymbol{b} \end{pmatrix} \sim \mathcal{N}\left( \begin{pmatrix} \boldsymbol{\mu}_a \\ \boldsymbol{\mu}_b \\ \end{pmatrix}, \begin{pmatrix} \Sigma_{aa} & \Sigma_{ab} \\ \Sigma_{ba} & \Sigma_{bb} \\ \end{pmatrix} \right) \] It happens that the conditional distribution is given by \[ p(\boldsymbol{a}\vert \boldsymbol{b}) = \mathcal{N}\left( \boldsymbol{\mu}_a + \Sigma_{ab} \Sigma_{bb}^{-1} (\boldsymbol{b}- \boldsymbol{\mu}_b), \Sigma_{aa} - \Sigma_{ab}\Sigma_{bb}^{-1} \Sigma_{ba} \right). \]
(a) Show that if \(\boldsymbol{a}\) and \(\boldsymbol{b}\) are uncorrelated, then \(\boldsymbol{a}\) and \(\boldsymbol{b}\) are independent. Hint: a sufficient condition for independence is that \(p(\boldsymbol{a}\vert \boldsymbol{b}) = p(\boldsymbol{a})\).
(b) Suppose that \(\boldsymbol{b}= c \boldsymbol{a}\) for some \(c\), so that \(\boldsymbol{a}\) and \(\boldsymbol{b}\) are perfectly correlated. Show that \(\Sigma_{ab} = c\Sigma_{aa}\) and \(\Sigma_{bb} = c^2 \Sigma_{aa}\).
(c) Using (b), show that if \(\boldsymbol{a}\) and \(\boldsymbol{b}\) are perfectly correlated, then \(\mathrm{Cov}\left(\boldsymbol{a}\vert \boldsymbol{b}\right) = \boldsymbol{0}\).
Now, suppose that we have a prior of the form \(\boldsymbol{\beta}\sim \mathcal{N}\left(\boldsymbol{0}, s^2 \boldsymbol{I}_P\right)\), and that \(\boldsymbol{Y}= \boldsymbol{X}\boldsymbol{\beta}+ \boldsymbol{\varepsilon}\), where \(\boldsymbol{\varepsilon}\sim \mathcal{N}\left(\boldsymbol{0}, \sigma^2 \boldsymbol{I}_N\right)\), for some known \(\sigma^2\) and \(s^2\). Here, take \(\boldsymbol{X}\) as fixed, and \(\boldsymbol{\varepsilon}\) and \(\boldsymbol{\beta}\) as independent.
(d) Compute the following quantities:
- \(\mathbb{E}\left[\boldsymbol{Y}\right]\)
- \(\mathbb{E}\left[\boldsymbol{\beta}\right]\)
- \(\mathrm{Cov}\left(\boldsymbol{Y}\right)\)
- \(\mathrm{Cov}\left(\boldsymbol{\beta}, \boldsymbol{Y}\right)\)
- \(\mathrm{Cov}\left(\boldsymbol{\beta}\right)\)
(e) Using (d), write down the joint distribution of the stacked vector \((\boldsymbol{\beta}^\intercal, \boldsymbol{Y}^\intercal)^\intercal\).
(f) Using (e) and the formula for the conditional distribution of multivariate normals, find the posterior distribution \(p(\boldsymbol{\beta}\vert \boldsymbol{Y})\).
(g) What happens to the posterior mean \(\mathbb{E}\left[\boldsymbol{\beta}\vert \boldsymbol{Y}\right]\) as \(s\rightarrow \infty\)? What happens as \(s\rightarrow 0\)?
(h) What happens to the posterior variance \(\mathrm{Cov}\left(\sqrt{N} \boldsymbol{\beta}\vert \boldsymbol{Y}\right)\) as \(s\rightarrow \infty\)? (Note the scaling by \(\sqrt{N}\).) How does this relate to the limiting frequentist covariance of \(\sqrt{N}(\hat{\boldsymbol{\beta}}- {\boldsymbol{\beta}^{*}})\) given by the CLT?
(i) For what value of \(\lambda\) is the posterior mean \(\mathbb{E}\left[\boldsymbol{\beta}\vert \boldsymbol{Y}\right]\) the same as the ridge regression estimator?