STAT151A Homework 4.

Author

Your name here

This homework is due on Gradescope on Monday October 28th at 9pm.

1 Fixed effects and the FWL theorem

Suppose we are interested in measuring the effect of class size on teaching evalutaions. Let \(y_n\) denote the teaching evaluation submitted by a particular student in a particular class and let \(x_n\) denote the class size for the \(n\)–th teaching evaluation.

Suppose there are \(K\) students, and that each row was submitted by exactly one of the \(K\) students. Suppose also that each student submitted a teaching evaluation for each class they took, so that a particular student contributes to multiple rows. Let \(z_n\) denote a one–hot indicator recording which student submitted the \(n\)–th teaching evaluation. By \(k(n)\), denote the index of the student who submitted row \(n\). (For example, if student \(3\) wrote the review in row \(10\), then \(k(10) = 3\).)

We will be interested in \(\hat{\beta}\) in the regression \(y_n \sim x_n \beta + \boldsymbol{z}_n^\intercal\gamma\).

(a)

Let \(\underset{\boldsymbol{Z}}{\boldsymbol{P}}\) denote the projection matrix onto \(\boldsymbol{Z}\), the \(N \times K\) matrix of \(\boldsymbol{z}_n\) regressors. Show that \[ \left(\underset{\boldsymbol{Z}}{\boldsymbol{P}} \boldsymbol{Y}\right)_n = \frac{1}{N_{k(n)}} \sum_{m: k(m) = k(n)} y_m =: \bar{y}_{k(n)}, \] where \(N_{k(n)}\) is the number of rows in which student \(k(n)\) occurs and \(\bar{y}_{k}\) is the average rating submitted by student \(k\). That is, the projection onto \(\boldsymbol{Z}\) computes the average evaluation submitted by each student.

(b)

Define \(y'_n := y_n - \bar{y}_{k(n)}\) denote the students’ evaluations centered at each students’ mean evaluation, and let \(\boldsymbol{Y}' = (y'_1, \ldots, y'_N)\). Show that \(\boldsymbol{Y}' = \underset{\boldsymbol{Z}^\perp}{\boldsymbol{P}} \boldsymbol{Y}\).

(c)

Let \(\boldsymbol{X}= (x_1, \ldots x_N)^\intercal\) denote the vector of class sizes. Let \(\bar{x}_{k}\) denote the average size of class taken by student \(k(n)\). Let \(x'_n := x_n - \bar{x}_{k(n)}\). Using the FWL theorem, show that \(\hat{\beta}\) in the regression \(y_n \sim x_n \beta + \boldsymbol{z}_n^\intercal\gamma\) is equal to \(\hat{\beta}\) in the regression \(y'_n \sim x'_n \beta\).

(d)

Some students tend to give higher ratings than others, and some students tend to take larger classes than others.
Models like this are sometime called “fixed effects models,” since they estimate the student-to-student variability with “fixed” regression parameters. How would you interpret \(\hat{\gamma}_k\), the \(k\)–th estimate coefficient in the regression \(y_n \sim x_n \beta + \boldsymbol{z}_n^\intercal\gamma\)?

2 Omitted variables in inference versus prediction

Suppose we are interested in student performance as measured by fourth–year GPA, \(y_n\), centered at \(0\) and scaled to have unit variance. Suppose we imagine two regressors:

\(x_n\): Student’s academic ability (in an abstract sense, maybe not precisely measurable or defined), and
\(z_n\): Student’s performance on a standardized test before admission.

Suppose that \(y_n = \beta x_n + \varepsilon_n\), where \(\varepsilon_n\) is mean zero, independent of \(x_n\), and with variance \(\sigma_\varepsilon^2\). We will also assume that \(\beta > 0\). This simplistic model is unrealistic, but we will use it for illustrative purposes, and assume for this problem that this model determines the “true” relationship between \(x_n\), \(z_n\), and \(y_n\).

Note that \(z_n\) effectively has a zero coefficient in the “true” model — student score is causally determined entirely by “ability,” not by test score.

Assume that we have centered and standardized \(x_n\) and \(z_n\), so that

\[ \begin{aligned} \frac{1}{N} \sum_{n=1}^Nx_n =& \frac{1}{N} \sum_{n=1}^Nz_n = 0 \quad\textrm{and} \\ \frac{1}{N} \sum_{n=1}^Nx_n^2 =& \frac{1}{N} \sum_{n=1}^Nz_n^2 = 1. \end{aligned} \] But large values of \(x_n\) and \(z_n\) tend to occur together, so that \(\frac{1}{N} \sum_{n=1}^Nx_n z_n = 0.9\). (You might think of \(x_n\) and \(z_n\) as having been drawn from a correlated random variables, even though they are not random for the purpose of this problem.)

(a)

Given the true model, how will we change a students’ expected GPA if we:

Increase their test score \(z_n\) by helping them memorize the answers to the test?
Improve their academic ability \(x_n\) by teaching them better time management skills?

(b)

If the number of observations \(N\) is large, and we run the regression \(y_n \sim \gamma z_n\), what will the expected coefficient \(\mathbb{E}\left[\hat{\gamma}\right]\) be? (In the expectation, only \(y_n\) and \(\varepsilon_n\) are random; the regressors are fixed.)

How does this compare to the “true” influence of \(z_n\) on GPA? How does this compare with the “true” influence of \(x_n\) on GPA?

(c)

Are we able to do effective inference using the regression \(y_n \sim \gamma z_n\)? Explain why or why not in intuitive terms.

(d)

Let \(x_{*}\), \(z_{*}\), and \(y_{*}\) denote a new observation not part of the training set. We still have \(y_* = \beta x_* + \varepsilon_*\), but assume now that \(x_*\) and \(z_*\) are random, independent of \(\varepsilon_*\), and the training data, and that their moments match the training set’s sample moments:

\[ \begin{aligned} \mathbb{E}\left[x_*\right] ={}& \mathbb{E}\left[z_*\right] = 0 \\ \mathrm{Var}\left(x_*\right) ={}& \mathrm{Var}\left(z_*\right) = 1 \\ \mathbb{E}\left[x_* z_*\right] ={}& 0.9. \end{aligned} \]

Evaluate \(\mathbb{E}\left[y_*\right]\) and \(\mathrm{Var}\left(y_*\right)\), where the randomness is over \(x_*\), \(z_*\), and \(\varepsilon_*\). Note that the answer will be in terms of \(\sigma_\varepsilon^2\), the variance of \(\varepsilon_*\).

(e)

Using the regression \(y_n \sim \gamma z_n\), form the prediction \(\hat{y}_{*} = z_{*} \hat{\gamma}\). Evaluate \(\mathbb{E}\left[\hat{y}_* \vert \boldsymbol{Y}\right]\) and \(\mathrm{Var}\left(\hat{y}_* \vert \boldsymbol{Y}\right)\). Again, the answer will be in terms of \(\sigma_\varepsilon^2\), the variance of \(\varepsilon_*\). Hint: \(\hat{\gamma}\) is fixed (not random) conditional on \(\boldsymbol{Y}\).

(f)

Using (d) and (e), evalute the conditional correlation between the response and prediction:

\[ \textrm{Correlation}(y_*, \hat{y}_* \vert \boldsymbol{Y}) = \frac{\mathbb{E}\left[y_{*} \hat{y}_{*} \vert \boldsymbol{Y}\right]} {\sqrt{\mathrm{Var}\left(\hat{y}_{*} \vert \boldsymbol{Y}\right)} \sqrt{\mathrm{Var}\left(y_* | \boldsymbol{Y}\right)}}. \]

Hint: \(\varepsilon_*\), \(x_*\), and \(z_*\) are independent of the training data \(\boldsymbol{Y}\), so the conditioning doesn’t affect their expectations.

(g)

Suppose that \(\sigma_\varepsilon/ \beta\) is not too large, so the residual standard deviation is not too large relative to the effect of academic ability on GPA. Are we then able to do effective prediction using the regression \(y_n \sim \gamma z_n\), say, to identify which students are likely to have good GPAs using only test scores? Explain why or why not in intuitive terms.

3 Regression to the mean with noisy test set data

Suppose we are interesting in building models of the brain. In particular, we want to know whether it is easier to model the brain watching dog videos versus the brain watching cat videos.

For each of \(n=1,\ldots, N\) people, we show them videos of animals playing while recording signals from their brain. We then build a classifier to predict, using the same signals, whether that individuals is watching a dog or a cat. We then evaluate the model on new dog and cat videos, measuring for person a test set error \(\varepsilon_{dog,n}\) for the dog videos and \(\varepsilon_{cat,n}\) for the cat videos.

Since the test set is random, it is reasonable to model the errors as random and unbiased. For each \(n\), we thus have

\[ \mathbb{E}\left[\varepsilon_{dog,n}\right] = \mu_{dog,n} \quad\quad\textrm{and}\quad\quad \mathbb{E}\left[\varepsilon_{cat,n}\right] = \mu_{cat,n}, \]

where \(\mathrm{Var}\left(\varepsilon_{cat,n}\right) > 0\) and \(\mathrm{Var}\left(\varepsilon_{dog,n}\right) > 0\)

(a)

Intuitively, what would it mean if \(\mu_{dog,n} = \mu_{cat,n}\) for each \(n\)?

(b)

Suppose that \(\mu_{dog,n} = \mu_{cat,n} = \mu_n\) for each \(n\), and we run the regression \(\varepsilon_{dog,n} \sim \beta\varepsilon_{cat,n}\). For very large \(N\), do you expect \(\hat{\beta}\) to be larger or smaller than \(1\)? Justify your answer.

(c)

Again, suppose that \(\mu_{dog,n} = \mu_{cat,n} = \mu_n\) for each \(n\), and we run the regression \(\varepsilon_{cat,n} \sim \gamma \varepsilon_{dog,n}\). For very large \(N\), do you expect \(\hat{\gamma}\) to be larger or smaller than \(1\)? Justify your answer.

(d)

Suppose the researcher runs the regression \(\varepsilon_{dog,n} \sim \beta \varepsilon_{cat,n}\) and gets \(\hat{\beta}= 0.3\). They then conclude that:

\(\hat{\beta}\) is much less than one
Therefore dog errors are lower than cat errors,
Therefore dogs are easier to predict than cats.

Is this a resonable conclusion? Why or why not?

4 Simulations

For this problem, we will simulate data from a linear regression and confirm some of our mathematical results.

Define the following quantities and their corresponding R variables:

\(N\): The number of observations
\(\sigma\): The residual standard deviation
\(\boldsymbol{\beta}\): A \(3\)–dimensional regression coefficient
\(\rho\): The covariance between the regressors \(x_{n1}\) and \(x_{n2}\)

We will assume that:

\(x_{n1}\) is normal with mean \(1\) and variance \(1\),
\(x_{n2}\) is normal with mean \(2\) and variance \(1\),
\(\mathrm{Cov}\left(x_{n1}, x_{n2}\right) = \rho\) (so that \(x_{n1}\) is not independent of \(x_{n2}\) if \(\rho \ne 0\)),
The regressor rows are independent of one another (\(x_{n1}\) and \(x_{n2}\) are independent of \(x_{m1}\) and \(x_{m2}\) for \(m \ne n\)),
The residuals \(\varepsilon_n\) are IID normal with mean \(0\) and variance \(\sigma^2\), indepenent of all regressors, and
For each \(n\), \(y_n = \boldsymbol{x}_n^\intercal\boldsymbol{\beta}+ \varepsilon_n\).

(a)

Write a function to generate a regressors \(\boldsymbol{x}_1\), \(\boldsymbol{x}_2\), and \(\boldsymbol{Y}\) given the inputs \(N\), \(\sigma\), \(\boldsymbol{\beta}\), and \(\rho\). You may use the functions rnorm (which generates standard normal random variables) and rmvnorm from the mvtnorm package which draws multivariate normal random variables.

Check that \(\boldsymbol{x}_1\) and \(\boldsymbol{x}_2\) have the correct covariance using cov.

(b)

Write a function to compute \(\hat{\boldsymbol{\beta}}\) for the regression \(\boldsymbol{Y}\sim \boldsymbol{X}\boldsymbol{\beta}\) given \(\boldsymbol{X}\) and \(\boldsymbol{Y}\) as inputs.

Donot use the lm function — use only matrix operations solve(), t(), and %*%. Your function may assume that \(\boldsymbol{X}\) is full column rank. However, please do check that your answer is the same as that given by lm.

(c)

Using (a) and (b), run the regression \(y_n \sim 1 + x_{n1} + x_{n2}\) for the following inputs:

\(N = 5000\)
\(\sigma = 3\)
\(\boldsymbol{\beta}= (1, 2, 3)^\intercal\)
\(\rho = 0.8\)

How does \(\hat{\boldsymbol{\beta}}\) compare to the true value \(\boldsymbol{\beta}\)? Justify your observation mathematically.

(d)

Repeat (c), but with the regression \(y_n \sim 1 + x_{n1}\). How do the estimated coefficients compare with the true intercept and coefficient for \(x_{n1}\)? Justify your observation mathematically.

(e)

Repeat (c), but with the regression \(y_n \sim 1 + x_{n1}\) and with \(\rho = 0\). How do the estimated coefficients compare with the true intercept and coefficient for \(x_{n1}\)? Justify your observation mathematically.

(f)

Using the simulated data to illustrate a theoretical result of your choice from the class.

5 Multivariate normal exercises

Let \(\boldsymbol{x}\sim \mathcal{N}\left(\boldsymbol{\mu}, \boldsymbol{\Sigma}\right)\) where

\[ \boldsymbol{\mu}= \begin{pmatrix} 1 \\ 2 \\ 3 \end{pmatrix} \quad\textrm{and}\quad \boldsymbol{\Sigma}= \begin{pmatrix} 1 & 0.5 & 0 \\ 0.5 & 2 & 0.1 \\ 0 & 0.1 & 4 \\ \end{pmatrix}. \]

Let

\[ \boldsymbol{v}= \begin{pmatrix} 0 \\ 1 \\ 0 \end{pmatrix} \quad \boldsymbol{a}= \begin{pmatrix} 2 \\ 1 \\ 1 \end{pmatrix} \quad \boldsymbol{A}= \begin{pmatrix} 1 & 1 & 0 \\ 2 & 0 & 1 \\ \end{pmatrix} \]

Evaluate the following expressions.

\(\mathbb{E}\left[\boldsymbol{v}^\intercal\boldsymbol{x}\right]\)
\(\mathrm{Var}\left(\boldsymbol{v}^\intercal\boldsymbol{x}\right)\)
\(\mathbb{E}\left[\boldsymbol{a}^\intercal\boldsymbol{x}\right]\)
\(\mathrm{Var}\left(\boldsymbol{a}^\intercal\boldsymbol{x}\right)\)
\(\mathbb{E}\left[\boldsymbol{v}\boldsymbol{x}^\intercal\right]\)
\(\mathbb{E}\left[\boldsymbol{x}\boldsymbol{x}^\intercal\right]\)
\(\mathbb{E}\left[\boldsymbol{x}^\intercal\boldsymbol{x}\right]\)
\(\mathbb{E}\left[\mathrm{trace}\left(\boldsymbol{x}\boldsymbol{x}^\intercal\right)\right]\)
\(\mathbb{E}\left[\boldsymbol{A}\boldsymbol{x}\right]\)
\(\mathrm{Cov}\left(\boldsymbol{A}\boldsymbol{x}\right)\)
\(\mathbb{E}\left[(\boldsymbol{A}\boldsymbol{x}) (\boldsymbol{A}\boldsymbol{x})^\intercal\right]\)