This homework is due on Gradescope on Friday October 11th at 9pm.
1 Interpretation of transforms of the response
Suppose I have data of the where \(n = 1, \ldots N\) indexes households, \(y_n\) is the expenditure on food in a time period (so that \(y_n > 0\) for all \(n\)). Suppose that households are randomly selected to either receive food stamps (for which \(x_n = 1\)) or to not receive food stamps (for which \(x_n = 0\)). Also, suppose we measure household income \(z_n\).
(a)
Suppose we regress \(y_n \sim \beta_0 + \beta_1 x_n + \beta_2 z_n\). Let \(\hat{f}(x_, z) = \hat{\beta}_0 + \hat{\beta}_1 x+ \hat{\beta}_2 z\) denote the fit.
Using this regression, we might estimate the effect of food stamps on food expenditure by \(\hat{f}(1, z) - \hat{f}(0, z)\). How does this estimate depend on \(z\)?
Using this regression, we might estimate the effect of food stamps on food expenditure by \(\exp(\hat{g}(1, z)) - \exp(\hat{g}(0, z))\). How does this estimate depend on \(z\)?
(c)
The regressions in (a) and (b) make different implicit assumptions about how food stamps affect consumption for a particular household. State these assumptions in ordinary language.
2 Fit and regressors
Given a regression on \(\boldsymbol{X}\) with \(P\) regressors and \(N\) data points, and the corresponding \(\boldsymbol{Y}\), \(\hat{\boldsymbol{Y}}\), and \(\hat{\varepsilon}\), define the following quantities: \[
\begin{aligned}
RSS :={}& \hat{\varepsilon}^\intercal\hat{\varepsilon}& \textrm{(Residual sum of squares)}\\
TSS :={}& \boldsymbol{Y}^\intercal\boldsymbol{Y}& \textrm{(Total sum of squares)}\\
ESS :={}& \hat{\boldsymbol{Y}}^\intercal\hat{\boldsymbol{Y}}& \textrm{(Explained sum of squares)}\\
R^2 :={}& \frac{ESS}{TSS}.
\end{aligned}
\]
Prove that \(RSS + ESS = TSS\).
Express \(R^2\) in terms of \(TSS\) and \(RSS\).
What is \(R^2\) when we include no regressors? (\(P = 0\))
What is \(R^2\) when we include \(N\) linearly independent regressors? (\(P=N\))
Can \(R^2\) ever decrease when we add a regressor? If so, how?
Can \(R^2\) ever stay the same when we add a regressor? If so, how?
Can \(R^2\) ever increase when we add a regressor? If so, how?
Does a high \(R^2\) mean the regression is useful? (You may argue by example.)
Does a low \(R^2\) mean the regression is not useful? (You may argue by example.)
3 Prediction in the bodyfat example
This exercise will use the bodyfat example from the datasets. Suppose we’re interested in predicting bodyfat, which is difficult to measure precisely, with other variables which are easier to measure: Height, Weight, and Abdomen circumference.
If we do so, we get the following sum of squared errors:
Noting that Heigh, Weight, and Abdomen are on different scales, your colleague suggests that you might get a better fit by normalizing them. But when you do, here’s what happened:
Our coefficients changed, but our fitted error didn’t change at all.
Explain why the fitted error did not change.
Explain why the coefficients did change.
(b)
Chastened, your colleague suggests that maybe it’s the difference between normalized height and weight that would helps us predict. After all, it makes sense that height should only matter relative to weight, and vice versa. So they run the regresson on the difference:
Our fitted error is different this time, and we could estimate this coefficient.
Explain why we could we estimate the coefficient of the ratio of height to weight, but not the difference.
Explain why the fitted error changed.
It happened that, by including the regressor hw_ratio, the fitted error decreased. Your colleague tells you that had it been a bad regressor, the error would have increased. Are they correct?
(d)
Let \(x_n = (\textrm{Abdomen}_n, \textrm{Height}_n, \textrm{Weight}_n)^\intercal\) denote our set of regressors.
Your colleague suggests a research project where you improve your fit by regressing \(y_n \sim z_n\) for new regressors \(z_n\) of the form \(z_n = \boldsymbol{A}x_n\), where the matrix \(\boldsymbol{A}\) is chosen using a machine learning algorithm.
Will this result produce a better fit to the data than simply regressing \(y_n \sim \boldsymbol{x}_n\)? Why or why not?
(e)
Finally, your colleague suggests a research project where you again regression \(y_n \sim z_n\), but now you let \(z_n = f(x_n)\) for any function \(f\), where you use a neural network to find the best fit to the data over all possible functions \(f(x_n)\).
Will this result produce a better fit to the data than simply regressing \(y_n \sim \boldsymbol{x}_n\)? Why or why not?
Do you think this result produce a useful prediction for new data? Why or why not?
4 Leaving a single datapoint out of regression
This homework problem derives a closed-form expression for the effect of leaving a datapoint out of the regression.
We will use the following result, known as the Woodbury formula (but also many other names, including the Sherman-Morrison-Woodbury formula). Let \(A\) denote an invertible matrix, and \(\boldsymbol{u}\) and \(\boldsymbol{v}\) vectors the same length as \(A\). Then
We will also use the definition of a “leverage score” from lecture \(h_n := \boldsymbol{x}_n^\intercal(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{x}_n\). Note that \(h_n = (\boldsymbol{X}(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{X}^\intercal)_{nn}\) is the \(n\)–th diagonal entry of the projection matrix \(\underset{\boldsymbol{X}}{\boldsymbol{P}}\).
Let \(\hat{\boldsymbol{\beta}}_{-n}\) denote the estimate of \(\hat{\beta}\) with the datapoint \(n\) left out. Similarly, let \(\boldsymbol{X}_{-n}\) denote the \(\boldsymbol{X}\) matrix with row \(n\) left out, and \(\boldsymbol{Y}_{-n}\) denote the \(\boldsymbol{Y}\) matrix with row \(n\) left out.
Letting \(\hat{y}_{n,-n} = \boldsymbol{x}_n^\intercal\hat{\boldsymbol{\beta}}_{-n}\) denote the estimate of \(y_n\) after deleting the \(n\)–th observation. Using (c), derive the following explicit expression the change in \(\hat{y}_n\) upon deleting the \(n\)–th observation: