This homework is due on Gradescope on Friday October 11th at 9pm.
1 Interpretation of transforms of the response
Suppose I have data of the where \(n = 1, \ldots N\) indexes households, \(y_n\) is the expenditure on food in a time period (so that \(y_n > 0\) for all \(n\)). Suppose that households are randomly selected to either receive food stamps (for which \(x_n = 1\)) or to not receive food stamps (for which \(x_n = 0\)). Also, suppose we measure household income \(z_n\).
(a)
Suppose we regress \(y_n \sim \beta_0 + \beta_1 x_n + \beta_2 z_n\). Let \(\hat{f}(x_, z) = \hat{\beta}_0 + \hat{\beta}_1 x+ \hat{\beta}_2 z\) denote the fit.
Using this regression, we might estimate the effect of food stamps on food expenditure by \(\hat{f}(1, z) - \hat{f}(0, z)\). How does this estimate depend on \(z\)?
Using this regression, we might estimate the effect of food stamps on food expenditure by \(\exp(\hat{g}(1, z)) - \exp(\hat{g}(0, z))\). How does this estimate depend on \(z\)?
(c)
The regressions in (a) and (b) make different implicit assumptions about how food stamps affect consumption for a particular household. State these assumptions in ordinary language.
In the first model, food stamps have an additive effect on expenditure. In the second model (with the log response), food stamps have a multiplicative effect on expenditure.
2 Fit and regressors
Given a regression on \(\boldsymbol{X}\) with \(P\) regressors and \(N\) data points, and the corresponding \(\boldsymbol{Y}\), \(\hat{\boldsymbol{Y}}\), and \(\hat{\varepsilon}\), define the following quantities: \[
\begin{aligned}
RSS :={}& \hat{\varepsilon}^\intercal\hat{\varepsilon}& \textrm{(Residual sum of squares)}\\
TSS :={}& \boldsymbol{Y}^\intercal\boldsymbol{Y}& \textrm{(Total sum of squares)}\\
ESS :={}& \hat{\boldsymbol{Y}}^\intercal\hat{\boldsymbol{Y}}& \textrm{(Explained sum of squares)}\\
R^2 :={}& \frac{ESS}{TSS}.
\end{aligned}
\]
Prove that \(RSS + ESS = TSS\).
Express \(R^2\) in terms of \(TSS\) and \(RSS\).
What is \(R^2\) when we include no regressors? (\(P = 0\))
What is \(R^2\) when we include \(N\) linearly independent regressors? (\(P=N\))
Can \(R^2\) ever decrease when we add a regressor? If so, how?
Can \(R^2\) ever stay the same when we add a regressor? If so, how?
Can \(R^2\) ever increase when we add a regressor? If so, how?
Does a high \(R^2\) mean the regression is useful? (You may argue by example.)
Does a low \(R^2\) mean the regression is not useful? (You may argue by example.)
Solutions:
This follows from \(\hat{\boldsymbol{Y}}^\intercal\hat{\varepsilon}= \boldsymbol{0}\).
\(R^2 = 1\) when we include \(N\) linearly independent regressors?
No, it cannot, since you project onto the same or larger subspace.
Yes, if you add a regressor column that is colinear with the existing columns.
Yes, if you add a linearly independent regressor column.
No, you might overfit.
No, you might have low signal to noise ratio.
3 Prediction in the bodyfat example
This exercise will use the bodyfat example from the datasets. Suppose we’re interested in predicting bodyfat, which is difficult to measure precisely, with other variables which are easier to measure: Height, Weight, and Abdomen circumference.
If we do so, we get the following sum of squared errors:
Noting that Heigh, Weight, and Abdomen are on different scales, your colleague suggests that you might get a better fit by normalizing them. But when you do, here’s what happened:
Our coefficients changed, but our fitted error didn’t change at all.
Explain why the fitted error did not change.
Explain why the coefficients did change.
The new regressors are invertible linear transforms of the original regressors. This means \(\hat{\boldsymbol{Y}}\) does not change, but the coefficients change.
(b)
Chastened, your colleague suggests that maybe it’s the difference between normalized height and weight that would helps us predict. After all, it makes sense that height should only matter relative to weight, and vice versa. So they run the regresson on the difference:
Now, our fitted error didn’t change at all, but our difference coefficient wasn’t even estimated.
Explain why the fitted error did not change.
Explain why the difference coefficient was not estimated by R.
The new regressors are again invertible linear transforms of the original regressors. The difference is a linear combination of height_norm and weight_norm, so \(\boldsymbol{X}^\intercal\boldsymbol{X}\) is not invertible, and \betavhat is not uniquely defined. R reports NA for coefficients that it cannot estimate.
(c)
Finally, your colleague suggests regressing instead on the ratio of weight to height. Here are the results:
Our fitted error is different this time, and we could estimate this coefficient.
Explain why we could we estimate the coefficient of the ratio of height to weight, but not the difference.
Explain why the fitted error changed.
It happened that, by including the regressor hw_ratio, the fitted error decreased. Your colleague tells you that had it been a bad regressor, the error would have increased. Are they correct?
You can estimate the ratio because it is a nonlinear transformation of the regressors. With this new regressor, the fit can change. Including another regressor, no matter poorly associated with the response, can never make the error increase. Your colleague is incorrect.
(d)
Let \(x_n = (\textrm{Abdomen}_n, \textrm{Height}_n, \textrm{Weight}_n)^\intercal\) denote our set of regressors.
Your colleague suggests a research project where you improve your fit by regressing \(y_n \sim z_n\) for new regressors \(z_n\) of the form \(z_n = \boldsymbol{A}x_n\), where the matrix \(\boldsymbol{A}\) is chosen using a machine learning algorithm.
Will this result produce a better fit to the data than simply regressing \(y_n \sim \boldsymbol{x}_n\)? Why or why not?
Since \(\boldsymbol{A}x_n\) is a linear combination of \(\boldsymbol{x}_n\), it can never give a better fit than regressing on \(x_n\) alone. (It can give a worse fit if \(\boldsymbol{A}\) is lower rank than \(\boldsymbol{x}_n\).) Searching over all possible \(\boldsymbol{A}\) will never give you a better fit than \(y_n \sim \boldsymbol{x}_n\). This is not a good idea for a research project.
(e)
Finally, your colleague suggests a research project where you again regression \(y_n \sim z_n\), but now you let \(z_n = f(x_n)\) for any function \(f\), where you use a neural network to find the best fit to the data over all possible functions \(f(x_n)\).
Will this result produce a better fit to the data than simply regressing \(y_n \sim \boldsymbol{x}_n\)? Why or why not?
Do you think this result produce a useful prediction for new data? Why or why not?
Since \(f(\boldsymbol{x}_n)\) can be a nonlinear function, you can produce a better fit than regressing on \(\boldsymbol{x}_n\) alone. In fact, the best possible fit among all possible functions is a highly expressive function which fits the data perfectly with zero error. You don’t need a neural network to identify this. Having zero error does not mean the fit is useful for anything — it likely overfits the training data and has poor predictive power on new data. This is not a good idea for a research project.
4 Leaving a single datapoint out of regression
This homework problem derives a closed-form expression for the effect of leaving a datapoint out of the regression.
We will use the following result, known as the Woodbury formula (but also many other names, including the Sherman-Morrison-Woodbury formula). Let \(A\) denote an invertible matrix, and \(\boldsymbol{u}\) and \(\boldsymbol{v}\) vectors the same length as \(A\). Then
We will also use the definition of a “leverage score” from lecture \(h_n := \boldsymbol{x}_n^\intercal(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{x}_n\). Note that \(h_n = (\boldsymbol{X}(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{X}^\intercal)_{nn}\) is the \(n\)–th diagonal entry of the projection matrix \(\underset{\boldsymbol{X}}{\boldsymbol{P}}\).
Let \(\hat{\boldsymbol{\beta}}_{-n}\) denote the estimate of \(\hat{\beta}\) with the datapoint \(n\) left out. Similarly, let \(\boldsymbol{X}_{-n}\) denote the \(\boldsymbol{X}\) matrix with row \(n\) left out, and \(\boldsymbol{Y}_{-n}\) denote the \(\boldsymbol{Y}\) matrix with row \(n\) left out.
This follows from \(\boldsymbol{X}^\intercal\boldsymbol{X}= \sum_{n=1}^N\boldsymbol{x}_n \boldsymbol{x}_n^\intercal\) and \(\boldsymbol{X}^\intercal\boldsymbol{Y}= \sum_{n=1}^N\boldsymbol{x}_n y_n\).
b
Using the Woodbury formula, derive the following expression: \[
(\boldsymbol{X}^\intercal\boldsymbol{X}- \boldsymbol{x}_n \boldsymbol{x}_n^\intercal)^{-1} =
(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} + \frac{(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{x}_n \boldsymbol{x}_n^\intercal(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} }{1 - h_n}
\]
Solution
Direct application of the formula with \(\boldsymbol{u}= \boldsymbol{x}_n\) and \(\boldsymbol{v}= -\boldsymbol{x}_n\) gives \[
(\boldsymbol{X}^\intercal\boldsymbol{X}- \boldsymbol{x}_n \boldsymbol{x}_n^\intercal)^{-1} =
(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} +
\frac{(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{x}_n \boldsymbol{x}_n^\intercal(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} }{1 - \boldsymbol{x}_n^\intercal(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1}\boldsymbol{x}_n}.
\]
Then recognize the leverage score.
c
Combine (a) and (b) to derive the following explicit expression for \(\hat{\boldsymbol{\beta}}_{-n}\):
Letting \(\hat{y}_{n,-n} = \boldsymbol{x}_n^\intercal\hat{\boldsymbol{\beta}}_{-n}\) denote the estimate of \(y_n\) after deleting the \(n\)–th observation. Using (c), derive the following explicit expression the change in \(\hat{y}_n\) upon deleting the \(n\)–th observation: