STAT151A Homework 3.

Author

Your name here

This homework is due on Gradescope on Friday October 11th at 9pm.

1 Interpretation of transforms of the response

Suppose I have data of the where n=1,N indexes households, yn is the expenditure on food in a time period (so that yn>0 for all n). Suppose that households are randomly selected to either receive food stamps (for which xn=1) or to not receive food stamps (for which xn=0). Also, suppose we measure household income zn.

(a)

Suppose we regress ynβ0+β1xn+β2zn. Let f^(x,z)=β^0+β^1x+β^2z denote the fit.

Using this regression, we might estimate the effect of food stamps on food expenditure by f^(1,z)f^(0,z). How does this estimate depend on z?

(b)

Suppose we regress logynγ0+γ1xn+γ2zn. Let g^(x,z)=γ^0+γ^1x+γ^2z.

Using this regression, we might estimate the effect of food stamps on food expenditure by exp(g^(1,z))exp(g^(0,z)). How does this estimate depend on z?

(c)

The regressions in (a) and (b) make different implicit assumptions about how food stamps affect consumption for a particular household. State these assumptions in ordinary language.

Solutions

(a) It doesn’t depend on z:

f^(1,z)f^(0,z)=β^0+β^1+β^2z(β^0+β^2z)=β^1.

(b) It is proportional to exp(γ^2z):

exp(g^(1,z))exp(g^(0,z))=exp(γ^0+γ^1+γ^2z)exp(γ^0+γ^2z)=(exp(γ^1)1)exp(γ^0)exp(γ^2z)

(c)

In the first model, food stamps have an additive effect on expenditure. In the second model (with the log response), food stamps have a multiplicative effect on expenditure.

2 Fit and regressors

Given a regression on X with P regressors and N data points, and the corresponding Y, Y^, and ε^, define the following quantities: RSS:=ε^ε^(Residual sum of squares)TSS:=YY(Total sum of squares)ESS:=Y^Y^(Explained sum of squares)R2:=ESSTSS.

  1. Prove that RSS+ESS=TSS.
  2. Express R2 in terms of TSS and RSS.
  3. What is R2 when we include no regressors? (P=0)
  4. What is R2 when we include N linearly independent regressors? (P=N)
  5. Can R2 ever decrease when we add a regressor? If so, how?
  6. Can R2 ever stay the same when we add a regressor? If so, how?
  7. Can R2 ever increase when we add a regressor? If so, how?
  8. Does a high R2 mean the regression is useful? (You may argue by example.)
  9. Does a low R2 mean the regression is not useful? (You may argue by example.)

Solutions:

  1. This follows from Y^ε^=0.
  2. R2=TSSRSSTSS=1RSSTSS
  3. R2=0 when we include no regressors
  4. R2=1 when we include N linearly independent regressors?
  5. No, it cannot, since you project onto the same or larger subspace.
  6. Yes, if you add a regressor column that is colinear with the existing columns.
  7. Yes, if you add a linearly independent regressor column.
  8. No, you might overfit.
  9. No, you might have low signal to noise ratio.

3 Prediction in the bodyfat example

This exercise will use the bodyfat example from the datasets. Suppose we’re interested in predicting bodyfat, which is difficult to measure precisely, with other variables which are easier to measure: Height, Weight, and Abdomen circumference.

If we do so, we get the following sum of squared errors:

reg <- lm(bodyfat ~ Abdomen + Height + Weight, bodyfat_df)
print(reg$coefficients)
(Intercept)     Abdomen      Height      Weight 
-36.6147193   0.9515631  -0.1270307  -0.1307606 
print(sprintf("Error: %f", mean(reg$residuals^2)))
[1] "Error: 19.456161"

(a)

Noting that Heigh, Weight, and Abdomen are on different scales, your colleague suggests that you might get a better fit by normalizing them. But when you do, here’s what happened:

bodyfat_df <- bodyfat_df %>%
    mutate(height_norm=(Height - mean(Height)) / sd(Height),
           weight_norm=(Weight - mean(Weight)) / sd(Weight),
           abdomen_norm=(Abdomen - mean(Abdomen)) / sd(Abdomen))
reg_norm <- lm(bodyfat ~ abdomen_norm + height_norm + weight_norm, bodyfat_df)
print(reg_norm$coefficients)
 (Intercept) abdomen_norm  height_norm  weight_norm 
   19.150794    10.260778    -0.465295    -3.842945 
print(sprintf("Error: %f", mean(reg_norm$residuals^2)))
[1] "Error: 19.456161"

Our coefficients changed, but our fitted error didn’t change at all.

  • Explain why the fitted error did not change.
  • Explain why the coefficients did change.

The new regressors are invertible linear transforms of the original regressors. This means Y^ does not change, but the coefficients change.

(b)

Chastened, your colleague suggests that maybe it’s the difference between normalized height and weight that would helps us predict. After all, it makes sense that height should only matter relative to weight, and vice versa. So they run the regresson on the difference:

bodyfat_df <- bodyfat_df %>%
    mutate(hw_diff = height_norm - weight_norm)
reg_diff <- lm(bodyfat ~ abdomen_norm + height_norm + weight_norm + hw_diff, bodyfat_df)
print(reg_diff$coefficients)
 (Intercept) abdomen_norm  height_norm  weight_norm      hw_diff 
   19.150794    10.260778    -0.465295    -3.842945           NA 
print(sprintf("Error: %f", mean(reg_diff$residuals^2)))
[1] "Error: 19.456161"

Now, our fitted error didn’t change at all, but our difference coefficient wasn’t even estimated.

  • Explain why the fitted error did not change.
  • Explain why the difference coefficient was not estimated by R.

The new regressors are again invertible linear transforms of the original regressors. The difference is a linear combination of height_norm and weight_norm, so XX is not invertible, and \betavhat is not uniquely defined. R reports NA for coefficients that it cannot estimate.

(c)

Finally, your colleague suggests regressing instead on the ratio of weight to height. Here are the results:

bodyfat_df <- bodyfat_df %>%
    mutate(hw_ratio = height_norm / weight_norm)
reg_ratio <- lm(bodyfat ~ abdomen_norm + height_norm + weight_norm + hw_ratio, bodyfat_df)
print(reg_ratio$coefficients)
 (Intercept) abdomen_norm  height_norm  weight_norm     hw_ratio 
19.149475485 10.271751693 -0.478683698 -3.848561820  0.009057317 
print(sprintf("Error: %f", mean(reg_ratio$residuals^2)))
[1] "Error: 19.423560"

Our fitted error is different this time, and we could estimate this coefficient.

  • Explain why we could we estimate the coefficient of the ratio of height to weight, but not the difference.
  • Explain why the fitted error changed.
  • It happened that, by including the regressor hw_ratio, the fitted error decreased. Your colleague tells you that had it been a bad regressor, the error would have increased. Are they correct?

You can estimate the ratio because it is a nonlinear transformation of the regressors. With this new regressor, the fit can change. Including another regressor, no matter poorly associated with the response, can never make the error increase. Your colleague is incorrect.

(d)

Let xn=(Abdomenn,Heightn,Weightn) denote our set of regressors.

Your colleague suggests a research project where you improve your fit by regressing ynzn for new regressors zn of the form zn=Axn, where the matrix A is chosen using a machine learning algorithm.

  • Will this result produce a better fit to the data than simply regressing ynxn? Why or why not?

Since Axn is a linear combination of xn, it can never give a better fit than regressing on xn alone. (It can give a worse fit if A is lower rank than xn.) Searching over all possible A will never give you a better fit than ynxn. This is not a good idea for a research project.

(e)

Finally, your colleague suggests a research project where you again regression ynzn, but now you let zn=f(xn) for any function f, where you use a neural network to find the best fit to the data over all possible functions f(xn).

  • Will this result produce a better fit to the data than simply regressing ynxn? Why or why not?
  • Do you think this result produce a useful prediction for new data? Why or why not?

Since f(xn) can be a nonlinear function, you can produce a better fit than regressing on xn alone. In fact, the best possible fit among all possible functions is a highly expressive function which fits the data perfectly with zero error. You don’t need a neural network to identify this. Having zero error does not mean the fit is useful for anything — it likely overfits the training data and has poor predictive power on new data. This is not a good idea for a research project.

4 Leaving a single datapoint out of regression

This homework problem derives a closed-form expression for the effect of leaving a datapoint out of the regression.

We will use the following result, known as the Woodbury formula (but also many other names, including the Sherman-Morrison-Woodbury formula). Let A denote an invertible matrix, and u and v vectors the same length as A. Then

(A+uv)1=A1A1uvA11+vA1u.

We will also use the definition of a “leverage score” from lecture hn:=xn(XX)1xn. Note that hn=(X(XX)1X)nn is the n–th diagonal entry of the projection matrix PX.

Let β^n denote the estimate of β^ with the datapoint n left out. Similarly, let Xn denote the X matrix with row n left out, and Yn denote the Y matrix with row n left out.

a Prove that

β^n=(XnXn)1XnYn=(XXxnxn)1(XYxnyn)

Solution

This follows from XX=n=1Nxnxn and XY=n=1Nxnyn.

b

Using the Woodbury formula, derive the following expression: (XXxnxn)1=(XX)1+(XX)1xnxn(XX)11hn

Solution

Direct application of the formula with u=xn and v=xn gives (XXxnxn)1=(XX)1+(XX)1xnxn(XX)11xn(XX)1xn.

Then recognize the leverage score.

c

Combine (a) and (b) to derive the following explicit expression for β^n:

β^n=β^(XX)1xn11hnε^n.

Solution

We have

(XXxnxn)1XY=β^+(XX)1xnxnβ^1hn.

and

(XXxnxn)1xnyn=(XX)1xnyn+(XX)1xnhn1hnyn.

Combining,

β^n=β^+(XX)1xn(xnβ^1hnynhn1hnyn)=β^+(XX)1xn(11hny^n(1+hn1hn)yn)=β^+(XX)1xn(11hny^n11hnyn)=β^(XX)1xn11hnε^n

d

Letting y^n,n=xnβ^n denote the estimate of yn after deleting the n–th observation. Using (c), derive the following explicit expression the change in y^n upon deleting the n–th observation:

y^n,ny^n=xnβ^nxnβ^=hn1hnε^n.

This shows that the effect of deleting observation n on y^n is large only if both the residual and leverage score is large.

Solution Taking xn times each side of the result from c gives

y^n,n=xnβ^n=xnβ^xn(XX)1xn11hnε^n=y^nhn1hnε^n.

The result follows by subtracting y^n from both sides.