STAT151A Homework 5.

Author

Your name here

1 The output of lm

For this problem, download the births.csv dataset. We will manually reproduce most of the quantities computed by the lm function in R for the following regression:

library(tidyverse)
births_df <- read.csv("../datasets/births/births14.csv")
ols_fit <- lm(weight ~ 1 + mage + habit, births_df, x=TRUE, y=TRUE)
print(summary(ols_fit))

Call:
lm(formula = weight ~ 1 + mage + habit, data = births_df, x = TRUE, 
    y = TRUE)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.0065 -0.6763  0.1039  0.8237  3.1172 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  6.890764   0.207443  33.218  < 2e-16 ***
mage         0.013291   0.007112   1.869   0.0619 .  
habitsmoker -0.580241   0.127545  -4.549 6.06e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.278 on 978 degrees of freedom
  (19 observations deleted due to missingness)
Multiple R-squared:  0.02507,   Adjusted R-squared:  0.02307 
F-statistic: 12.57 on 2 and 978 DF,  p-value: 4.063e-06
ols_coeffs <- summary(ols_fit)$coefficients

Note that, since I called lm with the arguments x=TRUE and y=TRUE, the \(\boldsymbol{X}\) and \(\boldsymbol{Y}\) matrices are available as ols_fit$x and ols_fit$y respectively.

For the following questions, use only core R functions (matrix multiplication and inversion and basic linear algebra) on \(\boldsymbol{X}\) and \(\boldsymbol{Y}\).

(a) Compute \(\hat{\beta}\) and show that it matches coefficients(ols_fit).

(b) Compute \(\hat{\boldsymbol{\varepsilon}}\) and show that it matches ols_fit$residuals.

(c) Compute \(\hat{\boldsymbol{Y}}\) and show that it matches ols_fit$fitted.values.

(d) Estimate the residual standard deviation, \(\hat{\sigma}\), and show that it matches sigma(ols_fit).

(e) Estimate the homoskedastic sampling covariance of \(\hat{\boldsymbol{\beta}}\) and show that it matches vcov(ols_fit).

(f) Estimate the heteroskedastic (sandwich) sampling covariance of \(\hat{\boldsymbol{\beta}}\) and show that it matches sandwich::vcovCL(ols_fit, type="HC0", cadjust=FALSE).

(g) Estimate the standard errors of \(\hat{\boldsymbol{\beta}}\) and show that they match ols_coeffs[, "Std. Error"].

(h) Esimate the t–statistics for \(\hat{\boldsymbol{\beta}}\) and show that they match ols_coeffs[, "t value"].

(i) Esimate the p–values for \(\hat{\boldsymbol{\beta}}\) and show that they match ols_coeffs[, "Pr(>|t|)"].

(j) Estimate a \(\alpha = 0.05\) two–sided confidence interval for the coefficient of mage, and show that it matches confint(ols_fit, "mage", level=1 - alpha).

Solutions:

# Coefficients 

xmat <- ols_fit$x
yvec <- ols_fit$y

xtx <- t(xmat) %*% xmat
betahat <- solve(xtx, t(xmat) %*% yvec)

print(cbind(coefficients(ols_fit), betahat))
                   [,1]        [,2]
(Intercept)  6.89076356  6.89076356
mage         0.01329106  0.01329106
habitsmoker -0.58024070 -0.58024070
# Residuals
reshat <- yvec - xmat %*% betahat
print(max(abs(reshat - ols_fit$residuals)))
[1] 8.126833e-14
# Fitted
yhat <-  xmat %*% betahat
print(max(abs(yhat - ols_fit$fitted.values)))
[1] 8.171241e-14
# Residual variance
sigmahat2 <- sum(reshat^2) / (nrow(xmat) - ncol(xmat))
print(c(sqrt(sigmahat2), sigma(ols_fit)))
[1] 1.278499 1.278499
# Betahat covariance
betahat_cov <- sigmahat2 * solve(xtx)
print(betahat_cov)
             (Intercept)          mage   habitsmoker
(Intercept)  0.043032582 -1.442567e-03 -0.0032354412
mage        -0.001442567  5.057444e-05  0.0000473339
habitsmoker -0.003235441  4.733390e-05  0.0162678535
print(vcov(ols_fit))
             (Intercept)          mage   habitsmoker
(Intercept)  0.043032582 -1.442567e-03 -0.0032354412
mage        -0.001442567  5.057444e-05  0.0000473339
habitsmoker -0.003235441  4.733390e-05  0.0162678535
# Sandwich covariance
score_cov <- t(xmat) %*% (diag(as.numeric(reshat)^2) %*% xmat)
betahat_sandwich_cov <- solve(xtx, score_cov) %*% solve(xtx)
print(betahat_sandwich_cov)
             (Intercept)          mage   habitsmoker
(Intercept)  0.044942817 -1.537795e-03 -3.932433e-03
mage        -0.001537795  5.472591e-05  7.760102e-05
habitsmoker -0.003932433  7.760102e-05  2.416461e-02
print(sandwich::vcovCL(ols_fit, type="HC0", cadjust=FALSE))
             (Intercept)          mage   habitsmoker
(Intercept)  0.044942817 -1.537795e-03 -3.932433e-03
mage        -0.001537795  5.472591e-05  7.760102e-05
habitsmoker -0.003932433  7.760102e-05  2.416461e-02
# Standard errors
betahat_sd <- sqrt(diag(betahat_cov))
print(cbind(betahat_sd, ols_coeffs[,"Std. Error"]))
             betahat_sd            
(Intercept) 0.207442960 0.207442960
mage        0.007111571 0.007111571
habitsmoker 0.127545496 0.127545496
# Confidence intervals
alpha <- 0.05
mage_col <- 2
zalpha <- -1 * qt(alpha / 2, df=nrow(xmat) - ncol(xmat))
ci <- betahat[mage_col] +  betahat_sd[mage_col] * c(-zalpha, zalpha)

print(ci)
[1] -0.0006646303  0.0272467564
print(confint(ols_fit, "mage", level=1 - alpha))
             2.5 %     97.5 %
mage -0.0006646303 0.02724676
# T values
t_value <- betahat / betahat_sd
print(cbind(t_value, ols_coeffs[, "t value"]))
                 [,1]      [,2]
(Intercept) 33.217630 33.217630
mage         1.868935  1.868935
habitsmoker -4.549284 -4.549284
# P values
df <- nrow(xmat) - ncol(xmat)
p_value <- pt(- abs(t_value), df=df) * 2
print(cbind(p_value, ols_coeffs[, "Pr(>|t|)"]))
                     [,1]          [,2]
(Intercept) 1.389298e-162 1.389298e-162
mage         6.193059e-02  6.193059e-02
habitsmoker  6.058598e-06  6.058598e-06

2 Simulating coverage

For each of the following model specifications, perform the following steps:

  • Fix \(\boldsymbol{\beta}\) as described in the model specification
  • Draw a single \(\boldsymbol{X}\) matrix, with dimensions \(N \times 2\) , where the first column is ones, and the second column IID standard normal random variables.
  • Repeat for \(s=(1, \ldots, N_{sim})\) simulations, with \(N_{sim} = 5000\):
    • Draw \(\boldsymbol{\varepsilon}\) as described in the model specification
    • Set \(\boldsymbol{Y}= \boldsymbol{X}\boldsymbol{\beta}+ \boldsymbol{\varepsilon}\)
    • Compute an 80% confidence interval for \(\boldsymbol{\beta}_1\) using homoskedastic, normal error assumptions
    • Record \(c_s = 1\) if \({\boldsymbol{\beta}_1 \in I_1}\) and \(0\) otherwise.
  • Estimate the coverage probability, \(\hat{C} := \frac{1}{N_{sim}} \sum_{s=1}^{N_{sim}} c_s\).

Note that if the original confidence interval is correct, then we should have \(\hat{C} \approx 0.8\). For each problem

  • Evaluate whether \(\hat{C} \approx 0.8\), and explain whether this is expected.
  • Evaluate the average width of the confidence interval

For each model, set \(\boldsymbol{\beta}= (1, 2)^\intercal\).

(a)

  • \(N = 500\)
  • \(\varepsilon_n \sim \mathcal{N}\left(0, 1\right)\)

(b)

  • \(N = 500\)
  • \(\varepsilon_n = z_n \cdot x_1\quad\) where \(\quad z_n \sim \mathcal{N}\left(0, 1\right)\)

(c)

  • \(N = 500\)
  • \(\varepsilon_n = \exp(z_n) - \exp(1/2)\quad\) where \(\quad z_n \sim \mathcal{N}\left(0, 1\right)\) (Note that \(\varepsilon_n\) is mean zero by properties of the lognormal distribution.)

(d)

Like (a), but \(N=10\).

(e)

Like (b), but \(N=10\).

(f)

Like (c), but \(N=10\).

(g)

Briefly describe the observed effect of non–Gaussian errors, heteroskedasticity, and low sample size on coverage and CI width.

Solutions:

See the file hw5_2_solution.qmd.

3 Interpreting tests and confidence intervals

Consider a regression \(y_n \sim \boldsymbol{x}_n^\intercal\boldsymbol{\beta}\), with the given confidence intervals for \(\beta_1\). In each problem, let \(\hat{v}= \left( (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \hat{\sigma}^2\right)_{11}\) denote the standard error for \(\hat{\boldsymbol{\beta}}_1\). Assume that we have a reasonably large number of observations.

Assume that the response is patient well-being, and \(x_{n1}\) is a treatment indicator for a medical procedure, so that \(\beta_1\) is the modeled effect of the procedure on the patient’s health. Here, \(\beta_1 > 0\) means the treatment has a positive (good) health benefit. The other regressors are explanatory covariates. The treatment is randomly assigned, so that it is independent of the other covariates.

For each problem, discuss briefly whether the inferential statement is reasonable or unreasonable and why.

(a) Consider the one-sided interval \(I = (\hat{\beta}_1 - 1.64 \sqrt{\hat{v}}, \infty)\), and assume that \(\hat{\beta}_1 - 1.64 \sqrt{\hat{v}} = - 3.4\). Since this interval contains \(\beta = 10\), we conclude that the treatment might have a very positive effect.

(b) Consider the two-sided interval \(I = \hat{\beta}_1 \pm 1.96 \sqrt{\hat{v}}\). We find that the interval contains \(0\), therefore we know that the treatment has no effect.

(c) Consider the two-sided interval \(I = \hat{\beta}_1 \pm 1.96 \sqrt{\hat{v}}\). We find that the interval does not contain \(\hat{\beta}_1 - 1.96 \sqrt{\hat{v}} > 0\), which is tentative evidence that the treatment has some non–zero, positive effect.

(d) Consider (c), except that when we plot the residuals \(\hat{\varepsilon}_n\) versus \(x_{n2}\) and find that the distribution of \(\hat{\varepsilon}_n\) has greater spread when \(\left|x_{n2}\right|\) is larger.

(e) We observe a new patient with regressors \(x_{*}\), and construct a confidence interval \(I\) such that for \(\mathbb{P}\left(x_{*}^\intercal\boldsymbol{\beta}\in I_{*}\right) \ge 0.95\). We expect that their unobserved well-being, \(y_*\), is probably in \(I_{*}\).

Solutions:

(a) This one–sided interval is based on a test with no power against very high effects, and so the conclusion is unreasonable.

(b) Failing to reject the null does not mean the null is true. The conclusion is unreasonable.

(c) This is reasonable for the population studied, though tentative, since it is only reasonable if the other assumptions of the model are correct, such as homoskedasticity, no outliers or very high leverage points, and so on.

(d) Since there is heteroskedasticity, the intervals are likely to be invalid, and the conclusion is unreasonable.

(e) The interval \(I_{*}\) does not cover \(y_*\); it only covers \(\boldsymbol{\beta}^\intercal x_{*}\). The variability in the residual \(\varepsilon_*\) must also be taken into account. The conclusion is unreasonable.

4 Chi squared random variables

Let \(s\sim \mathcal{\chi}^2_{K}\). Prove that

  • \(\mathbb{E}\left[s\right] = K\)
  • \(\mathrm{Var}\left(s\right) = 2 K\) (hint: if \(z\sim \mathcal{N}\left(0,\sigma^2\right)\), then \(\mathbb{E}\left[z^4\right] = 3\sigma^4\))
  • If \(a_n \sim \mathcal{N}\left(0,\sigma^2\right)\) IID for \(1,\ldots,N\), then \(\frac{1}{\sigma^2} \sum_{n=1}^Na_n^2 \sim \mathcal{\chi}^2_{N}\)
  • \(\frac{1}{K} s\rightarrow 1\) as \(K \rightarrow \infty\)
  • \(\frac{1}{\sqrt{K}} (s- K) \rightarrow \mathcal{N}\left(0, 2\right)\) as \(K \rightarrow \infty\)
  • Let \(\boldsymbol{a}\sim \mathcal{N}\left(\boldsymbol{0}, \boldsymbol{I}\right)\) where \(a\in \mathbb{R}^{K}\). Then \(\left\Vert\boldsymbol{a}\right\Vert_2^2 \sim \mathcal{\chi}^2_{K}\)
  • Let \(\boldsymbol{a}\sim \mathcal{N}\left(\boldsymbol{0}, \boldsymbol{\Sigma}\right)\) where \(a\in \mathbb{R}^{K}\). Then \(\boldsymbol{a}^\intercal\boldsymbol{\Sigma}^{-1} \boldsymbol{a}\sim \mathcal{\chi}^2_{K}\)

Solutions:

Write \(s= \sum_{k=1}^K z_k^2\) for \(z_k \sim \mathcal{N}\left(0,1\right)\) IID.

  • \(\mathbb{E}\left[\sum_{k=1}^K z_k^2\right] = \sum_{k=1}^K 1 = K\)
  • \(\mathbb{E}\left[\left( \sum_{k=1}^K (z_k^2 - 1)\right)^2 = K \mathbb{E}\left[z_k^2 - 1\right]^2\right] = K(\mathbb{E}\left[z_k^4\right] - 2 \mathbb{E}\left[z_k^2\right]+ 1) = K(3 - 2 + 1) = 2 K\).
  • This follows from the LLN.
  • This follows from the CLT.
  • Use the fact that \(\left\Vert\boldsymbol{a}\right\Vert_2^2 = \sum a_k^2\).
  • Use the fact that \(\boldsymbol{\Sigma}^{-1/2} \boldsymbol{a}\sim \mathcal{N}\left(\boldsymbol{0}, \boldsymbol{I}\right)\).