STAT151A Homework 4

Author

Your name here

1 FWL and F-tests for subsets of regressors

In this homework problem, we use the FWL theorem to connect our F-test with a more common (but less general) form you will find in many textbooks. This question has been liberally adapted from Section 4.4 (“Tests of several restrictions”) of Davidson, MacKinnon, et al. (2004).

Suppose we have a normal regression model: \[ \boldsymbol{Y}= \boldsymbol{X}_A {\boldsymbol{\beta}^{*}}+ \boldsymbol{X}_B \boldsymbol{\gamma}^*+ \boldsymbol{\varepsilon} \quad\textrm{where}\quad \boldsymbol{\varepsilon}\sim \mathcal{N}\left(0, \sigma^2 \boldsymbol{I}_N\right), \] and want to test that the null hypotheiss that \(\boldsymbol{\gamma}= \boldsymbol{0}\). That is, we are asking whether we can reject the hypohtesis that the extra regressors \(\boldsymbol{X}_B\) add anything to the explanatory power of the model.

Define the two regressions: \[ \begin{aligned} \boldsymbol{Y}={}& \boldsymbol{X}_A \hat{\boldsymbol{\beta}}_A + \hat{\boldsymbol{\varepsilon}}_A & \textrm{(``restricted'')} \\ \boldsymbol{Y}={}& \boldsymbol{X}_A \hat{\boldsymbol{\beta}}_{AB} + \boldsymbol{X}_B \hat{\boldsymbol{\gamma}}+ \hat{\boldsymbol{\varepsilon}}_{AB} & \textrm{(``unrestricted'')}. \end{aligned} \] Let \(\boldsymbol{X}_A\) have \(P\) columns and \(\boldsymbol{X}_B\) have \(Q\) columns.

(a) Since \[ \begin{pmatrix} \hat{\boldsymbol{\beta}}\\ \hat{\boldsymbol{\gamma}} \end{pmatrix} - \begin{pmatrix} {\boldsymbol{\beta}^{*}}\\ \boldsymbol{\gamma}^* \end{pmatrix} \sim \mathcal{N}\left( \begin{pmatrix} \boldsymbol{0}\\ \boldsymbol{0} \end{pmatrix}, \sigma^2 \begin{pmatrix} \boldsymbol{X}_A^\intercal\boldsymbol{X}_A & \boldsymbol{X}_A^\intercal\boldsymbol{X}_B \\ \boldsymbol{X}_B^\intercal\boldsymbol{X}_A & \boldsymbol{X}_B^\intercal\boldsymbol{X}_B \\ \end{pmatrix}^{-1} \right). \] Note that \[ \begin{pmatrix} \boldsymbol{0}_{Q \times P} & \boldsymbol{I}_Q \end{pmatrix} \begin{pmatrix} \boldsymbol{\beta}\\ \boldsymbol{\gamma} \end{pmatrix} = \boldsymbol{\gamma}, \] and so \[ \hat{\boldsymbol{\gamma}}- \boldsymbol{\gamma}^*\sim \mathcal{N}\left( \boldsymbol{0}, \sigma^2 \begin{pmatrix} \boldsymbol{0}_{Q \times P} & \boldsymbol{I}_Q \end{pmatrix} \begin{pmatrix} \boldsymbol{X}_A^\intercal\boldsymbol{X}_A & \boldsymbol{X}_A^\intercal\boldsymbol{X}_B \\ \boldsymbol{X}_B^\intercal\boldsymbol{X}_A & \boldsymbol{X}_B^\intercal\boldsymbol{X}_B \\ \end{pmatrix}^{-1} \begin{pmatrix} \boldsymbol{0}_{P \times Q} \\ \boldsymbol{I}_Q \end{pmatrix} \right). \]

Using these, show that the standard F-test is given by the test statistic \[ \phi = \frac{ \hat{\boldsymbol{\gamma}}^\intercal \left( \begin{pmatrix} \boldsymbol{0}_{Q \times P} & \boldsymbol{I}_Q \end{pmatrix} \begin{pmatrix} \boldsymbol{X}_A^\intercal\boldsymbol{X}_A & \boldsymbol{X}_A^\intercal\boldsymbol{X}_B \\ \boldsymbol{X}_B^\intercal\boldsymbol{X}_A & \boldsymbol{X}_B^\intercal\boldsymbol{X}_B \\ \end{pmatrix}^{-1} \begin{pmatrix} \boldsymbol{0}_{P \times Q} \\ \boldsymbol{I}_Q \end{pmatrix} \right)^{-1} \hat{\boldsymbol{\gamma}}/ Q }{\hat{\boldsymbol{\varepsilon}}_{AB}^\intercal\hat{\boldsymbol{\varepsilon}}_{AB} / (N - P - Q)}. \] Our goal is to simplify this complicated expression.

(b) Using the fact that \(\hat{\boldsymbol{\varepsilon}}_A = \underset{\boldsymbol{X}_A^\perp}{\boldsymbol{P}} \boldsymbol{Y}\), show that \[ \hat{\boldsymbol{\varepsilon}}_A = \underset{\boldsymbol{X}_A^\perp}{\boldsymbol{P}} \boldsymbol{X}_B \hat{\boldsymbol{\gamma}}+ \hat{\boldsymbol{\varepsilon}}_{AB}, \] and so \[ \hat{\boldsymbol{\varepsilon}}_A^\intercal\hat{\boldsymbol{\varepsilon}}_A - \hat{\boldsymbol{\varepsilon}}_{AB}^\intercal\hat{\boldsymbol{\varepsilon}}_{AB} = \hat{\boldsymbol{\gamma}}^\intercal\boldsymbol{X}_B^\intercal\underset{\boldsymbol{X}_A^\perp}{\boldsymbol{P}} \boldsymbol{X}_B \hat{\boldsymbol{\gamma}}. \]

(c) Using the FWL theorem, the unrestricted regression is identical to the regression \[ \hat{\boldsymbol{Y}}_{AB} = \left(\underset{\boldsymbol{X}_A^\perp}{\boldsymbol{P}} \boldsymbol{X}_B \right) \hat{\boldsymbol{\gamma}}+ \hat{\boldsymbol{\varepsilon}}_{AB}. \]

It follows that \[ \hat{\boldsymbol{\gamma}}- \boldsymbol{\gamma}^*\sim \mathcal{N}\left(\boldsymbol{0}, \sigma^2 \left(\boldsymbol{X}_B^\intercal\underset{\boldsymbol{X}_A^\perp}{\boldsymbol{P}} \boldsymbol{X}_B \right)^{-1} \right). \] From this, show that \[ \boldsymbol{X}_B^\intercal\underset{\boldsymbol{X}_A^\perp}{\boldsymbol{P}} \boldsymbol{X}_B = \left( \begin{pmatrix} \boldsymbol{0}_{Q \times P} & \boldsymbol{I}_Q \end{pmatrix} \begin{pmatrix} \boldsymbol{X}_A^\intercal\boldsymbol{X}_A & \boldsymbol{X}_A^\intercal\boldsymbol{X}_B \\ \boldsymbol{X}_B^\intercal\boldsymbol{X}_A & \boldsymbol{X}_B^\intercal\boldsymbol{X}_B \\ \end{pmatrix}^{-1} \begin{pmatrix} \boldsymbol{0}_{P \times Q} \\ \boldsymbol{I}_Q \end{pmatrix} \right)^{-1}. \]

Note: The Schur complement is another way to derive this identity.

(d) Plug in your results from (b) and (c) into (a) to derive the test statistic \[ \phi = \frac{\left( \hat{\boldsymbol{\varepsilon}}_A^\intercal\hat{\boldsymbol{\varepsilon}}_A - \hat{\boldsymbol{\varepsilon}}_{AB}^\intercal\hat{\boldsymbol{\varepsilon}}_{AB}\right) / Q}{\hat{\boldsymbol{\varepsilon}}_{AB}^\intercal\hat{\boldsymbol{\varepsilon}}_{AB} / (N - P - Q)}. \]

This is sometimes re-written in terms of \[ \begin{aligned} RSSR ={}& \hat{\boldsymbol{\varepsilon}}_A^\intercal\hat{\boldsymbol{\varepsilon}}_A & \textrm{(restricted sum of squared residuals)} \\ USSR ={}& \hat{\boldsymbol{\varepsilon}}_{AB}^\intercal\hat{\boldsymbol{\varepsilon}}_{AB} & \textrm{(unrestricted sum of squared residuals)} \end{aligned} \] as \[ \phi = \frac{(RSSR - USSR) / Q}{USSR / (N - P - Q)}. \] Note that when you include new regressors, the SSR must decrease, so \(USSR < RSSR\). This simple formula shows that the F-test is checking whether the relative decrease in sum of squared residuals when you include the new regressors is large relative to the appropriate F-distribution.

Solutions

(a) Plug the following expressions into the definintion of the F-statistic given in the lecture notes:

\[ \boldsymbol{A}= \begin{pmatrix} \boldsymbol{0}_{Q \times P} & \boldsymbol{I}_Q \end{pmatrix} \quad a= \boldsymbol{0}_Q \quad \boldsymbol{X}^\intercal\boldsymbol{X}= \begin{pmatrix} \boldsymbol{X}_A^\intercal\boldsymbol{X}_A & \boldsymbol{X}_A^\intercal\boldsymbol{X}_B \\ \boldsymbol{X}_B^\intercal\boldsymbol{X}_A & \boldsymbol{X}_B^\intercal\boldsymbol{X}_B \\ \end{pmatrix} \quad \hat{\sigma}^2 = \hat{\boldsymbol{\varepsilon}}_{AB}^\intercal\hat{\boldsymbol{\varepsilon}}_{AB} / (N - P - Q). \]

There is nothing wrong with just using this expression for testing. We are trying to get a simpler formula in order to have some intuition about F-tests of subsets of regressors.

(b)

\[ \begin{aligned} \hat{\boldsymbol{\varepsilon}}_A ={}& \underset{\boldsymbol{X}_A^\perp}{\boldsymbol{P}} \boldsymbol{Y} \\={}& \underset{\boldsymbol{X}_A^\perp}{\boldsymbol{P}} \left( \boldsymbol{X}_A \hat{\boldsymbol{\beta}}+ \boldsymbol{X}_B \hat{\boldsymbol{\gamma}}+ \hat{\boldsymbol{\varepsilon}}_{AB} \right) \\={}& \underset{\boldsymbol{X}_A^\perp}{\boldsymbol{P}} \boldsymbol{X}_B \hat{\boldsymbol{\gamma}}+ \hat{\boldsymbol{\varepsilon}}_{AB}, \end{aligned} \] since \(\underset{\boldsymbol{X}_A^\perp}{\boldsymbol{P}} \boldsymbol{X}_A = \boldsymbol{0}\) and \(\hat{\boldsymbol{\varepsilon}}_{AB}\) is already orthogonal to \(\boldsymbol{X}_A\). Then

\[ \begin{aligned} \hat{\boldsymbol{\varepsilon}}_A^\intercal\hat{\boldsymbol{\varepsilon}}_A ={}& \left(\underset{\boldsymbol{X}_A^\perp}{\boldsymbol{P}} \boldsymbol{X}_B \hat{\boldsymbol{\gamma}}+ \hat{\boldsymbol{\varepsilon}}_{AB} \right)^\intercal \left(\underset{\boldsymbol{X}_A^\perp}{\boldsymbol{P}} \boldsymbol{X}_B \hat{\boldsymbol{\gamma}}+ \hat{\boldsymbol{\varepsilon}}_{AB} \right) \\={}& \hat{\boldsymbol{\gamma}}^\intercal\boldsymbol{X}_B^\intercal\underset{\boldsymbol{X}_A^\perp}{\boldsymbol{P}} \boldsymbol{X}_B \hat{\boldsymbol{\gamma}}+ \hat{\boldsymbol{\varepsilon}}_{AB}^\intercal\hat{\boldsymbol{\varepsilon}}_{AB} + \hat{\boldsymbol{\gamma}}^\intercal\boldsymbol{X}_B^\intercal\underset{\boldsymbol{X}_A^\perp}{\boldsymbol{P}} \hat{\boldsymbol{\varepsilon}}_{AB} + \hat{\boldsymbol{\varepsilon}}_{AB}^\intercal\underset{\boldsymbol{X}_A^\perp}{\boldsymbol{P}} \boldsymbol{X}_B \hat{\boldsymbol{\gamma}} \\={}& \hat{\boldsymbol{\gamma}}^\intercal\boldsymbol{X}_B^\intercal\underset{\boldsymbol{X}_A^\perp}{\boldsymbol{P}} \boldsymbol{X}_B \hat{\boldsymbol{\gamma}}+ \hat{\boldsymbol{\varepsilon}}_{AB}^\intercal\hat{\boldsymbol{\varepsilon}}_{AB} + \hat{\boldsymbol{\gamma}}^\intercal\boldsymbol{X}_B^\intercal\hat{\boldsymbol{\varepsilon}}_{AB} + \hat{\boldsymbol{\varepsilon}}_{AB}^\intercal\boldsymbol{X}_B \hat{\boldsymbol{\gamma}} \\={}& \hat{\boldsymbol{\gamma}}^\intercal\boldsymbol{X}_B^\intercal\underset{\boldsymbol{X}_A^\perp}{\boldsymbol{P}} \boldsymbol{X}_B \hat{\boldsymbol{\gamma}}+ \hat{\boldsymbol{\varepsilon}}_{AB}^\intercal\hat{\boldsymbol{\varepsilon}}_{AB}, \end{aligned} \] because \(\hat{\boldsymbol{\varepsilon}}_{AB}\) is orthogonal to \(\boldsymbol{X}_B\).

(c) We know that

\[ \hat{\gamma}= \begin{pmatrix} \boldsymbol{0}_{Q \times P} & \boldsymbol{I}_Q \end{pmatrix} \begin{pmatrix} \hat{\boldsymbol{\beta}}\\ \hat{\boldsymbol{\gamma}} \end{pmatrix}, \] and so by the properties of the multivariate normal distribution, \[ \hat{\boldsymbol{\gamma}}- \boldsymbol{\gamma}^*= \begin{pmatrix} \boldsymbol{0}_{Q \times P} & \boldsymbol{I}_Q \end{pmatrix} \begin{pmatrix} \hat{\boldsymbol{\beta}}- {\boldsymbol{\beta}^{*}}\\ \hat{\boldsymbol{\gamma}}- \boldsymbol{\gamma}^* \end{pmatrix} \sim \mathcal{N}\left( \boldsymbol{0}, \sigma^2 \begin{pmatrix} \boldsymbol{0}_{Q \times P} & \boldsymbol{I}_Q \end{pmatrix} \begin{pmatrix} \boldsymbol{X}_A^\intercal\boldsymbol{X}_A & \boldsymbol{X}_A^\intercal\boldsymbol{X}_B \\ \boldsymbol{X}_B^\intercal\boldsymbol{X}_A & \boldsymbol{X}_B^\intercal\boldsymbol{X}_B \\ \end{pmatrix}^{-1} \begin{pmatrix} \boldsymbol{0}_{P \times Q} \\ \boldsymbol{I}_Q \end{pmatrix} \right). \] We have derived the same distribution two different ways, so the covariances must be equal.

Note that this proof doesn’t depend on actually assuming that the data is normally distributed — it’s a mathematical proof of the identity of two matrices, using the fact that if the data were normally distributed, the matrices would have to coincide. As mentioned in the problem, the same result can be derived with Schur complements, which is in turn another way to derive the FWL theorem.

(d) This is just plugging in.

2 Omitted variable bias

This homework problem is taken from Cinelli and Hazlett (2020).

Suppose we believe that the following model is correctly specified: \[ y= \beta_0 + \boldsymbol{\beta}^\intercal\boldsymbol{x}+ \boldsymbol{\gamma}z+ \tau d+ \varepsilon, \] where

\(\boldsymbol{x}\) is an observed vector of covariates,
\(z\) is an unobserved (omitted) scalar-valued covariate,
\(d\) is a binary indicator of a causal intervention, and
\(\varepsilon\) is a mean-zero error term that is independent of \(\boldsymbol{x}\), \(z\), and \(d\).

We’re interested in \(\tau\), the causal effect of \(d\), but are concerned about how much our OLS estimate is biased by the fact that we have excluded the variable \(z\). Let \(\boldsymbol{X}\), \(\boldsymbol{Z}\), and \(\boldsymbol{D}\) denote the regressor matrices. Write

\[ \begin{aligned} \boldsymbol{Y}={}& \boldsymbol{X}\hat{\boldsymbol{\beta}}+ \boldsymbol{Z}\hat{\gamma}+ \boldsymbol{D}\hat{\tau}+ \hat{\boldsymbol{\varepsilon}} \quad\textrm{(infeasible full regression)}\\ \boldsymbol{Y}={}& \boldsymbol{X}\hat{\boldsymbol{\beta}}_{\mathrm{res}}+ \boldsymbol{D}\hat{\tau}_{\mathrm{res}}+ \hat{\boldsymbol{\varepsilon}}_{\mathrm{res}} \quad\textrm{(feasible restricted regression)}\\ \end{aligned} \]

Assume all matrices are full rank, and that that \(N\) is large enough that, if we had run the full regression that \(\hat{\tau}\) would be accurate enough for practical purposes, so we would like to know \(\hat{\tau}\). But since we don’t observe \(Z\), we can only estimate \(\hat{\tau}_{\mathrm{res}}\). We are interested in estimating / understanding the omitted variable bias,

\[ \widehat{\mathrm{Bias}} := \hat{\tau}_{\mathrm{res}}- \hat{\tau}. \]

Define the orthogonal projection \[ \underset{\boldsymbol{X}^\perp}{\boldsymbol{P}} = (\boldsymbol{I}_N - \boldsymbol{X}(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{X}^\intercal). \]

(a) Using the FWL theorem, show that

\[ \hat{\tau}_{\mathrm{res}}= \frac{ \left(\underset{\boldsymbol{X}^\perp}{\boldsymbol{P}} \boldsymbol{D}\right)^\intercal\left(\underset{\boldsymbol{X}^\perp}{\boldsymbol{P}} \boldsymbol{Y}\right) }{ \left\Vert\underset{\boldsymbol{X}^\perp}{\boldsymbol{P}} \boldsymbol{D}\right\Vert_2^2 }. \]

(b) Plugging in the full regression for \(\boldsymbol{Y}\) into (a), show that

\[ \hat{\tau}_{\mathrm{res}}= \frac{\left(\underset{\boldsymbol{X}^\perp}{\boldsymbol{P}} \boldsymbol{D}\right)^\intercal \left(\underset{\boldsymbol{X}^\perp}{\boldsymbol{P}} \boldsymbol{Z}\right)}{\left\Vert\underset{\boldsymbol{X}^\perp}{\boldsymbol{P}} \boldsymbol{D}\right\Vert_2^2} \hat{\gamma}+ \hat{\tau}. \]

Hint: \(\underset{\boldsymbol{X}^\perp}{\boldsymbol{P}} \boldsymbol{X}= \boldsymbol{0}\) and \(\boldsymbol{D}^\intercal\hat{\boldsymbol{\varepsilon}}= \boldsymbol{0}\).

(c) Consider the regression \(z\sim \delta d+ \boldsymbol{x}^\intercal\boldsymbol{\alpha}\), with corresponding OLS estimates \(\hat{\delta}\) and \(\hat{\boldsymbol{a}}\). Show that

\[ \hat{\delta}= \frac{\left(\underset{\boldsymbol{X}^\perp}{\boldsymbol{P}} \boldsymbol{D}\right)^\intercal\left(\underset{\boldsymbol{X}^\perp}{\boldsymbol{P}} \boldsymbol{Z}\right)}{\left\Vert\underset{\boldsymbol{X}^\perp}{\boldsymbol{P}} \boldsymbol{D}\right\Vert_2^2}. \]

(d) Plug (c) into (b) to get the very nice formula

\[ \widehat{\mathrm{Bias}} = \hat{\tau}_{\mathrm{res}}- \hat{\tau}= \hat{\delta}\hat{\gamma}. \]

(e) Suppose that we believe that the model is correctly specified, and that \(y\) is independent of \(z\) — that is, the omitted variable doesn’t affect the outcome. What does this imply about the omitted variable bias?

(f) Suppose we believe that the model is correctly specified, and that the treatment \(d\) is independent of \(z\) — that is, the omitted variable doesn’t affect whether or not an observation receives treatment. What does this imply about the omitted variable bias?

(g) In ordinary lanuage, what kinds of omitted variables do we need to worry about?

Solutions

(a) The FWL theorem gives have that \(\hat{\tau}_{\mathrm{res}}\) can be derived from the regresssion \(\underset{\boldsymbol{X}^\perp}{\boldsymbol{P}} \boldsymbol{Y}\sim \underset{\boldsymbol{X}^\perp}{\boldsymbol{P}} \boldsymbol{D}\tau\), so \[ \hat{\tau}_{\mathrm{res}}= \left( \boldsymbol{D}^\intercal\underset{\boldsymbol{X}^\perp}{\boldsymbol{P}}^\intercal\underset{\boldsymbol{X}^\perp}{\boldsymbol{P}} \boldsymbol{D}\right)^{-1} \boldsymbol{D}^\intercal\underset{\boldsymbol{X}^\perp}{\boldsymbol{P}}^\intercal\underset{\boldsymbol{X}^\perp}{\boldsymbol{P}} \boldsymbol{Y}. \] The result then follows because \(\underset{\boldsymbol{X}^\perp}{\boldsymbol{P}}^\intercal\underset{\boldsymbol{X}^\perp}{\boldsymbol{P}} = \underset{\boldsymbol{X}^\perp}{\boldsymbol{P}}\) and \(\left\Vert\underset{\boldsymbol{X}^\perp}{\boldsymbol{P}} \boldsymbol{D}\right\Vert_2^2\) is a scalar.

(b) We just plug in for \(\boldsymbol{Y}\) giving

\[ \begin{aligned} \boldsymbol{D}^\intercal\underset{\boldsymbol{X}^\perp}{\boldsymbol{P}} \boldsymbol{Y}={}& \boldsymbol{D}^\intercal\underset{\boldsymbol{X}^\perp}{\boldsymbol{P}} \left( \boldsymbol{X}\hat{\boldsymbol{\beta}}+ \boldsymbol{Z}\hat{\gamma}+ \boldsymbol{D}\hat{\tau}+ \hat{\boldsymbol{\varepsilon}} \right) \\={}& \boldsymbol{D}^\intercal\underset{\boldsymbol{X}^\perp}{\boldsymbol{P}} \boldsymbol{X}\hat{\boldsymbol{\beta}}+ \boldsymbol{D}^\intercal\underset{\boldsymbol{X}^\perp}{\boldsymbol{P}} \boldsymbol{Z}\hat{\gamma}+ \boldsymbol{D}^\intercal\underset{\boldsymbol{X}^\perp}{\boldsymbol{P}} \boldsymbol{D}\hat{\tau}+ \boldsymbol{D}^\intercal\underset{\boldsymbol{X}^\perp}{\boldsymbol{P}} \hat{\boldsymbol{\varepsilon}} \\={}& \boldsymbol{D}^\intercal\underset{\boldsymbol{X}^\perp}{\boldsymbol{P}}^\intercal\underset{\boldsymbol{X}^\perp}{\boldsymbol{P}} \boldsymbol{Z}\hat{\gamma}+ \boldsymbol{D}^\intercal\underset{\boldsymbol{X}^\perp}{\boldsymbol{P}}^\intercal\underset{\boldsymbol{X}^\perp}{\boldsymbol{P}} \boldsymbol{D}\hat{\tau}+ \boldsymbol{D}^\intercal\hat{\boldsymbol{\varepsilon}}. \\={}& \boldsymbol{D}^\intercal\underset{\boldsymbol{X}^\perp}{\boldsymbol{P}} \boldsymbol{Z}\hat{\gamma}+ \left\Vert\underset{\boldsymbol{X}^\perp}{\boldsymbol{P}} \boldsymbol{D}\right\Vert_2^2 \hat{\tau}. \end{aligned} \]

Then plug in to (a).

(c) This is just another application of the FWL theorem.

(d) This is just plugging in.

(e) If \(y\) is independent of \(z\), then we expect the true \(\gamma = 0\), and with a large amount of data \(\hat{\gamma}\approx 0\). Plugging into (d), we see that the omitted variable bias should be approximately zero.

(f) Reasoning as in (e), we expect the true \(\delta = 0\), and so \(\hat{\delta}\approx 0\). Plugging into (d), we see that the omitted variable bias should again be approximately zero.

(g) We need to worry about ommitted variables that are associated with both the response variable and the covariate whose coefficient we are interested in.

3 Fixed effects and the FWL theorem

Suppose we are interested in measuring the effect of class size on teaching evalutaions. Let \(y_n\) denote the teaching evaluation submitted by a particular student in a particular class and let \(x_n\) denote the class size for the \(n\)–th teaching evaluation.

Suppose there are \(K\) students, and that each row was submitted by exactly one of the \(K\) students. Suppose also that each student submitted a teaching evaluation for each class they took, so that a particular student contributes to multiple rows. Let \(z_n\) denote a one–hot indicator recording which student submitted the \(n\)–th teaching evaluation. By \(k(n)\), denote the index of the student who submitted row \(n\). (For example, if student \(3\) wrote the review in row \(10\), then \(k(10) = 3\).)

We will be interested in \(\hat{\beta}\) in the regression \(y_n \sim x_n \beta + \boldsymbol{z}_n^\intercal\gamma\).

(a)

Let \(\underset{\boldsymbol{Z}}{\boldsymbol{P}}\) denote the projection matrix onto \(\boldsymbol{Z}\), the \(N \times K\) matrix of \(\boldsymbol{z}_n\) regressors. Show that \[ \left(\underset{\boldsymbol{Z}}{\boldsymbol{P}} \boldsymbol{Y}\right)_n = \frac{1}{N_{k(n)}} \sum_{m: k(m) = k(n)} y_m =: \bar{y}_{k(n)}, \] where \(N_{k(n)}\) is the number of rows in which student \(k(n)\) occurs and \(\bar{y}_{k}\) is the average rating submitted by student \(k\). That is, the projection onto \(\boldsymbol{Z}\) computes the average evaluation submitted by each student.

(b)

Define \(y'_n := y_n - \bar{y}_{k(n)}\) denote the students’ evaluations centered at each students’ mean evaluation, and let \(\boldsymbol{Y}' = (y'_1, \ldots, y'_N)\). Show that \(\boldsymbol{Y}' = \underset{\boldsymbol{Z}^\perp}{\boldsymbol{P}} \boldsymbol{Y}\).

(c)

Let \(\boldsymbol{X}= (x_1, \ldots x_N)^\intercal\) denote the vector of class sizes. Let \(\bar{x}_{k}\) denote the average size of class taken by student \(k(n)\). Let \(x'_n := x_n - \bar{x}_{k(n)}\). Using the FWL theorem, show that \(\hat{\beta}\) in the regression \(y_n \sim x_n \beta + \boldsymbol{z}_n^\intercal\gamma\) is equal to \(\hat{\beta}\) in the regression \(y'_n \sim x'_n \beta\).

(d)

Some students tend to give higher ratings than others, and some students tend to take larger classes than others. Models like this are sometime called “fixed effects models,” since they estimate the student-to-student variability with “fixed” regression parameters. How would you interpret \(\hat{\gamma}_k\), the \(k\)–th estimate coefficient in the regression \(y_n \sim x_n \beta + \boldsymbol{z}_n^\intercal\gamma\)?

Solutions

(a)

This follows from

\[ \boldsymbol{Z}^\intercal\boldsymbol{Z}= \begin{pmatrix} N_1 & 0 & \ldots & 0 \\ \vdots & N_2 & \ldots & 0 \\ 0 & \ldots & 0 & N_K \\ \end{pmatrix} \]

and

\[ \boldsymbol{Z}^\intercal\boldsymbol{Y}= \begin{pmatrix} \sum_{m: k(n) = 1} y_m \\ \vdots \\ \sum_{m: k(n) = K} y_m \\ \end{pmatrix}, \]

so that

\[ (\boldsymbol{Z}^\intercal\boldsymbol{Z})^{-1} \boldsymbol{Z}^\intercal\boldsymbol{Y}= \begin{pmatrix} \bar{y}_1 \\ \vdots \\ \bar{y}_K \end{pmatrix}. \]

Apply the formula \(\underset{\boldsymbol{Z}}{\boldsymbol{P}} \boldsymbol{Y}= \boldsymbol{Z}(\boldsymbol{Z}^\intercal\boldsymbol{Z})^{-1} \boldsymbol{Z}^\intercal\boldsymbol{Y}\), using the fact that \(z_n\) has a \(1\) in row \(k(n)\), and so picks out element \(\bar{y}_{k(n)}\) from \((\boldsymbol{Z}^\intercal\boldsymbol{Z})^{-1} \boldsymbol{Z}^\intercal\boldsymbol{Y}\).

(b)

This follows immediately from (a) and the fact that \(\underset{\boldsymbol{Z}^\perp}{\boldsymbol{P}}\boldsymbol{Y}= (\boldsymbol{I}- \underset{\boldsymbol{Z}}{\boldsymbol{P}}) \boldsymbol{Y}\).

(c)

By the same reasoning as (b) and (c), \(\boldsymbol{X}' = \underset{\boldsymbol{Z}^\perp}{\boldsymbol{P}} \boldsymbol{X}\). The result then folows from the FWL theorem.

(d) \(\hat{\gamma}_k\) can be interpreted as an estimate of a students’ average tendency to give high ratings, approximately controlling for class size.

4 Regression to the mean with noisy test set data

Suppose we are interesting in building models of the brain. In particular, we want to know whether it is easier to model the brain watching dog videos versus the brain watching cat videos.

For each of \(n=1,\ldots, N\) people, we show them videos of animals playing while recording signals from their brain. We then build a classifier to predict, using the same signals, whether that individuals is watching a dog or a cat. We then evaluate the model on new dog and cat videos, measuring for person a test set error \(\varepsilon_{dog,n}\) for the dog videos and \(\varepsilon_{cat,n}\) for the cat videos.

Since the test set is random, it is reasonable to model the errors as random and unbiased. For each \(n\), we thus have

\[ \mathbb{E}\left[\varepsilon_{dog,n}\right] = \mu_{dog,n} \quad\quad\textrm{and}\quad\quad \mathbb{E}\left[\varepsilon_{cat,n}\right] = \mu_{cat,n}, \]

where \(\mathrm{Var}\left(\varepsilon_{cat,n}\right) > 0\) and \(\mathrm{Var}\left(\varepsilon_{dog,n}\right) > 0\)

(a)

Intuitively, what would it mean if \(\mu_{dog,n} = \mu_{cat,n}\) for each \(n\)?

(b)

Suppose that \(\mu_{dog,n} = \mu_{cat,n} = \mu_n\) for each \(n\), and we run the regression \(\varepsilon_{dog,n} \sim \beta\varepsilon_{cat,n}\). For very large \(N\), do you expect \(\hat{\beta}\) to be larger or smaller than \(1\)? Justify your answer.

(c)

Again, suppose that \(\mu_{dog,n} = \mu_{cat,n} = \mu_n\) for each \(n\), and we run the regression \(\varepsilon_{cat,n} \sim \gamma \varepsilon_{dog,n}\). For very large \(N\), do you expect \(\hat{\gamma}\) to be larger or smaller than \(1\)? Justify your answer.

(d)

Suppose the researcher runs the regression \(\varepsilon_{dog,n} \sim \beta \varepsilon_{cat,n}\) and gets \(\hat{\beta}= 0.3\). They then conclude that:

\(\hat{\beta}\) is much less than one
Therefore dog errors are lower than cat errors,
Therefore dogs are easier to predict than cats.

Is this a resonable conclusion? Why or why not?

Solutions

(a) This would mean that, on average, dogs are as easy to predict as cats.

(b) Due to regression to the mean, we expect \(\hat{\beta}\) to be smaller than one, even though the two averages are the same.

(c) Just as in (b), we expect \(\hat{\gamma}\) to be smaller than one.

(d) This is not a reasonable conclusion; we expect \(\hat{\beta}\) to be smaller than one due to the randomness in the regrssor alone. How much lower will depend on the test set variability, and requires a more detailed analysis.

5 Leaving a single datapoint out of regression

This homework problem derives a closed-form expression for the effect of leaving a datapoint out of the regression.

We will use the following result, known as the Woodbury formula (but also many other names, including the Sherman-Morrison-Woodbury formula). Let \(A\) denote an invertible matrix, and \(\boldsymbol{u}\) and \(\boldsymbol{v}\) vectors the same length as \(A\). Then

\[ (A + \boldsymbol{u}\boldsymbol{v}^\intercal)^{-1} ={} A^{-1} - \frac{A^{-1} \boldsymbol{u}\boldsymbol{v}^\intercal A^{-1}}{1 + \boldsymbol{v}^\intercal A^{-1} \boldsymbol{u}}. \]

We will also use the definition of a “leverage score” from lecture \(h_n := \boldsymbol{x}_n^\intercal(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{x}_n\). Note that \(h_n = (\boldsymbol{X}(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{X}^\intercal)_{nn}\) is the \(n\)–th diagonal entry of the projection matrix \(\underset{\boldsymbol{X}}{\boldsymbol{P}}\).

Let \(\hat{\boldsymbol{\beta}}_{-n}\) denote the estimate of \(\hat{\beta}\) with the datapoint \(n\) left out. Similarly, let \(\boldsymbol{X}_{-n}\) denote the \(\boldsymbol{X}\) matrix with row \(n\) left out, and \(\boldsymbol{Y}_{-n}\) denote the \(\boldsymbol{Y}\) matrix with row \(n\) left out.

a Prove that

\[ \hat{\boldsymbol{\beta}}_{-n} = (\boldsymbol{X}_{-n}^\intercal\boldsymbol{X}_{-n})^{-1} \boldsymbol{X}_{-n}^\intercal\boldsymbol{Y}_{-n} = (\boldsymbol{X}^\intercal\boldsymbol{X}- \boldsymbol{x}_n \boldsymbol{x}_n^\intercal)^{-1} (\boldsymbol{X}^\intercal\boldsymbol{Y}- \boldsymbol{x}_n y_n) \]

Solution

This follows from \(\boldsymbol{X}^\intercal\boldsymbol{X}= \sum_{n=1}^N\boldsymbol{x}_n \boldsymbol{x}_n^\intercal\) and \(\boldsymbol{X}^\intercal\boldsymbol{Y}= \sum_{n=1}^N\boldsymbol{x}_n y_n\).

Using the Woodbury formula, derive the following expression: \[ (\boldsymbol{X}^\intercal\boldsymbol{X}- \boldsymbol{x}_n \boldsymbol{x}_n^\intercal)^{-1} = (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} + \frac{(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{x}_n \boldsymbol{x}_n^\intercal(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} }{1 - h_n} \]

Solution

Direct application of the formula with \(\boldsymbol{u}= \boldsymbol{x}_n\) and \(\boldsymbol{v}= -\boldsymbol{x}_n\) gives \[ (\boldsymbol{X}^\intercal\boldsymbol{X}- \boldsymbol{x}_n \boldsymbol{x}_n^\intercal)^{-1} = (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} + \frac{(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{x}_n \boldsymbol{x}_n^\intercal(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} }{1 - \boldsymbol{x}_n^\intercal(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1}\boldsymbol{x}_n}. \]

Then recognize the leverage score.

Combine (a) and (b) to derive the following explicit expression for \(\hat{\boldsymbol{\beta}}_{-n}\):

\[ \hat{\boldsymbol{\beta}}_{-n} = \hat{\boldsymbol{\beta}}- (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{x}_n \frac{1}{1 - h_n} \hat{\varepsilon}_n. \]

Solutions

We have

\[ (\boldsymbol{X}^\intercal\boldsymbol{X}- \boldsymbol{x}_n \boldsymbol{x}_n^\intercal)^{-1} \boldsymbol{X}^\intercal\boldsymbol{Y}= \hat{\beta}+ \frac{(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{x}_n \boldsymbol{x}_n^\intercal\hat{\beta}}{1 - h_n}. \]

and

\[ (\boldsymbol{X}^\intercal\boldsymbol{X}- \boldsymbol{x}_n \boldsymbol{x}_n^\intercal)^{-1} \boldsymbol{x}_n y_n = (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{x}_n y_n + \frac{(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{x}_n h_n }{1 - h_n} y_n. \]

Combining,

\[ \begin{aligned} \hat{\boldsymbol{\beta}}_{-n} ={}& \hat{\beta}+ (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{x}_n \left( \frac{\boldsymbol{x}_n^\intercal\hat{\beta}}{1 - h_n} - y_n - \frac{h_n }{1 - h_n} y_n \right) \\={}& \hat{\beta}+ (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{x}_n \left( \frac{1 }{1 - h_n} \hat{y}_n - \left(1 + \frac{h_n }{1 - h_n} \right)y_n \right) \\={}& \hat{\beta}+ (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{x}_n \left( \frac{1 }{1 - h_n} \hat{y}_n - \frac{1 }{1 - h_n} y_n \right) \\={}& \hat{\beta}- (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{x}_n \frac{1 }{1 - h_n} \hat{\varepsilon}_n \end{aligned} \]

Letting \(\hat{y}_{n,-n} = \boldsymbol{x}_n^\intercal\hat{\boldsymbol{\beta}}_{-n}\) denote the estimate of \(y_n\) after deleting the \(n\)–th observation. Using (c), derive the following explicit expression the change in \(\hat{y}_n\) upon deleting the \(n\)–th observation:

\[ \hat{y}_{n,-n} - \hat{y}_n = \boldsymbol{x}_n^\intercal\hat{\boldsymbol{\beta}}_{-n} - \boldsymbol{x}_n^\intercal\hat{\boldsymbol{\beta}}= -\frac{h_n}{1 - h_n} \hat{\varepsilon}_n. \]

This shows that the effect of deleting observation \(n\) on \(\hat{y}_n\) is large only if both the residual and leverage score is large.

Solution Taking \(\boldsymbol{x}_n^\intercal\) times each side of the result from c gives

\[ \begin{aligned} \hat{y}_{n,-n} ={}& \boldsymbol{x}_n^\intercal\hat{\boldsymbol{\beta}}_{-n} \\={}& \boldsymbol{x}_n^\intercal\hat{\beta}- \boldsymbol{x}_n^\intercal(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{x}_n \frac{1 }{1 - h_n} \hat{\varepsilon}_n \\={}& \hat{y}_n - \frac{h_n}{1 - h_n} \hat{\varepsilon}_n. \end{aligned} \]

The result follows by subtracting \(\hat{y}_n\) from both sides.

6 Prediction in the bodyfat example

This exercise will use the bodyfat example from the datasets. Suppose we’re interested in predicting bodyfat, which is difficult to measure precisely, with other variables which are easier to measure: Height, Weight, and Abdomen circumference.

If we do so, we get the following sum of squared errors:

reg <- lm(bodyfat ~ Abdomen + Height + Weight, bodyfat_df)
print(reg$coefficients)

(Intercept)     Abdomen      Height      Weight 
-36.6147193   0.9515631  -0.1270307  -0.1307606

print(sprintf("Error: %f", mean(reg$residuals^2)))

[1] "Error: 19.456161"

(a)

Noting that Heigh, Weight, and Abdomen are on different scales, your colleague suggests that you might get a better fit by normalizing them. But when you do, here’s what happened:

bodyfat_df <- bodyfat_df %>%
    mutate(height_norm=(Height - mean(Height)) / sd(Height),
           weight_norm=(Weight - mean(Weight)) / sd(Weight),
           abdomen_norm=(Abdomen - mean(Abdomen)) / sd(Abdomen))
reg_norm <- lm(bodyfat ~ abdomen_norm + height_norm + weight_norm, bodyfat_df)
print(reg_norm$coefficients)

 (Intercept) abdomen_norm  height_norm  weight_norm 
   19.150794    10.260778    -0.465295    -3.842945

print(sprintf("Error: %f", mean(reg_norm$residuals^2)))

[1] "Error: 19.456161"

Our coefficients changed, but our fitted error didn’t change at all.

Explain why the fitted error did not change.
Explain why the coefficients did change.

(b)

Chastened, your colleague suggests that maybe it’s the difference between normalized height and weight that would helps us predict. After all, it makes sense that height should only matter relative to weight, and vice versa. So they run the regresson on the difference:

bodyfat_df <- bodyfat_df %>%
    mutate(hw_diff = height_norm - weight_norm)
reg_diff <- lm(bodyfat ~ abdomen_norm + height_norm + weight_norm + hw_diff, bodyfat_df)
print(reg_diff$coefficients)

 (Intercept) abdomen_norm  height_norm  weight_norm      hw_diff 
   19.150794    10.260778    -0.465295    -3.842945           NA

print(sprintf("Error: %f", mean(reg_diff$residuals^2)))

[1] "Error: 19.456161"

Now, our fitted error didn’t change at all, but our difference coefficient wasn’t even estimated.

Explain why the fitted error did not change.
Explain why the difference coefficient was not estimated by R.

(c)

Finally, your colleague suggests regressing instead on the ratio of weight to height. Here are the results:

bodyfat_df <- bodyfat_df %>%
    mutate(hw_ratio = height_norm / weight_norm)
reg_ratio <- lm(bodyfat ~ abdomen_norm + height_norm + weight_norm + hw_ratio, bodyfat_df)
print(reg_ratio$coefficients)

 (Intercept) abdomen_norm  height_norm  weight_norm     hw_ratio 
19.149475485 10.271751693 -0.478683698 -3.848561820  0.009057317

print(sprintf("Error: %f", mean(reg_ratio$residuals^2)))

[1] "Error: 19.423560"

Our fitted error is different this time, and we could estimate this coefficient.

Explain why we could we estimate the coefficient of the ratio of height to weight, but not the difference.
Explain why the fitted error changed.
It happened that, by including the regressor hw_ratio, the fitted error decreased. Your colleague tells you that had it been a bad regressor, the error would have increased. Are they correct?

(d)

Let \(x_n = (\textrm{Abdomen}_n, \textrm{Height}_n, \textrm{Weight}_n)^\intercal\) denote our set of regressors.

Your colleague suggests a research project where you improve your fit by regressing \(y_n \sim z_n\) for new regressors \(z_n\) of the form \(z_n = \boldsymbol{A}x_n\), where the matrix \(\boldsymbol{A}\) is chosen using a machine learning algorithm.

Will this result produce a better fit to the data than simply regressing \(y_n \sim \boldsymbol{x}_n\)? Why or why not?

(e)

Finally, your colleague suggests a research project where you again regression \(y_n \sim z_n\), but now you let \(z_n = f(x_n)\) for any function \(f\), where you use a neural network to find the best fit to the data over all possible functions \(f(x_n)\).

Will this result produce a better fit to the data than simply regressing \(y_n \sim \boldsymbol{x}_n\)? Why or why not?
Do you think this result produce a useful prediction for new data? Why or why not?

Solutions

(a) The new regressors are invertible linear transforms of the original regressors. This means \(\hat{\boldsymbol{Y}}\) does not change, but the coefficients change.

(b) The new regressors are again invertible linear transforms of the original regressors. The difference is a linear combination of height_norm and weight_norm, so \(\boldsymbol{X}^\intercal\boldsymbol{X}\) is not invertible, and \betavhat is not uniquely defined. R reports NA for coefficients that it cannot estimate.

(c) You can estimate the ratio because it is a nonlinear transformation of the regressors. With this new regressor, the fit can change. Including another regressor, no matter poorly associated with the response, can never make the error increase. Your colleague is incorrect.

(d) Since \(\boldsymbol{A}x_n\) is a linear combination of \(\boldsymbol{x}_n\), it can never give a better fit than regressing on \(x_n\) alone. (It can give a worse fit if \(\boldsymbol{A}\) is lower rank than \(\boldsymbol{x}_n\).) Searching over all possible \(\boldsymbol{A}\) will never give you a better fit than \(y_n \sim \boldsymbol{x}_n\). This is not a good idea for a research project.

(e) Since \(f(\boldsymbol{x}_n)\) can be a nonlinear function, you can produce a better fit than regressing on \(\boldsymbol{x}_n\) alone. In fact, the best possible fit among all possible functions is a highly expressive function which fits the data perfectly with zero error. You don’t need a neural network to identify this. Having zero error does not mean the fit is useful for anything — it likely overfits the training data and has poor predictive power on new data. This is not a good idea for a research project.

7 References

Cinelli, Carlos, and Chad Hazlett. 2020. “Making Sense of Sensitivity: Extending Omitted Variable Bias.” Journal of the Royal Statistical Society Series B: Statistical Methodology 82 (1): 39–67.

Davidson, Russell, James G MacKinnon, et al. 2004. Econometric Theory and Methods. Vol. 5. Oxford University Press New York.