library(tidyverse)
<- read.csv("../datasets/births/births14.csv")
births_df <- lm(weight ~ 1 + mage + habit, births_df, x=TRUE, y=TRUE) ols_fit
STAT151A Homework 5.
1 The output of lm
For this problem, download the births.csv dataset. We will manually reproduce most of the quantities computed by the lm
function in R
for the following regression:
print(summary(ols_fit))
Call:
lm(formula = weight ~ 1 + mage + habit, data = births_df, x = TRUE,
y = TRUE)
Residuals:
Min 1Q Median 3Q Max
-6.0065 -0.6763 0.1039 0.8237 3.1172
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.890764 0.207443 33.218 < 2e-16 ***
mage 0.013291 0.007112 1.869 0.0619 .
habitsmoker -0.580241 0.127545 -4.549 6.06e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.278 on 978 degrees of freedom
(19 observations deleted due to missingness)
Multiple R-squared: 0.02507, Adjusted R-squared: 0.02307
F-statistic: 12.57 on 2 and 978 DF, p-value: 4.063e-06
<- summary(ols_fit)$coefficients ols_coeffs
Note that, since I called lm
with the arguments x=TRUE
and y=TRUE
, the \(\boldsymbol{X}\) and \(\boldsymbol{Y}\) matrices are available as ols_fit$x
and ols_fit$y
respectively.
For the following questions, use only core R
functions (matrix multiplication and inversion and basic linear algebra) on \(\boldsymbol{X}\) and \(\boldsymbol{Y}\).
(a) Compute \(\hat{\beta}\) and show that it matches coefficients(ols_fit)
.
(b) Compute \(\hat{\boldsymbol{\varepsilon}}\) and show that it matches ols_fit$residuals
.
(c) Compute \(\hat{\boldsymbol{Y}}\) and show that it matches ols_fit$fitted.values
.
(d) Estimate the residual standard deviation, \(\hat{\sigma}\), and show that it matches sigma(ols_fit)
.
(e) Estimate the homoskedastic sampling covariance of \(\hat{\boldsymbol{\beta}}\) and show that it matches vcov(ols_fit)
.
(f) Estimate the heteroskedastic (sandwich) sampling covariance of \(\hat{\boldsymbol{\beta}}\) and show that it matches sandwich::vcovCL(ols_fit, type="HC0", cadjust=FALSE)
.
(g) Estimate the standard errors of \(\hat{\boldsymbol{\beta}}\) and show that they match ols_coeffs[, "Std. Error"]
.
(h) Esimate the t–statistics for \(\hat{\boldsymbol{\beta}}\) and show that they match ols_coeffs[, "t value"]
.
(i) Esimate the p–values for \(\hat{\boldsymbol{\beta}}\) and show that they match ols_coeffs[, "Pr(>|t|)"]
.
(j) Estimate a \(\alpha = 0.05\) two–sided confidence interval for the coefficient of mage
, and show that it matches confint(ols_fit, "mage", level=1 - alpha)
.
2 Simulating coverage
For each of the following model specifications, perform the following steps:
- Fix \(\boldsymbol{\beta}\) as described in the model specification
- Draw a single \(\boldsymbol{X}\) matrix, with dimensions \(N \times 2\) , where the first column is ones, and the second column IID standard normal random variables.
- Repeat for \(s=(1, \ldots, N_{sim})\) simulations, with \(N_{sim} = 5000\):
- Draw \(\boldsymbol{\varepsilon}\) as described in the model specification
- Set \(\boldsymbol{Y}= \boldsymbol{X}\boldsymbol{\beta}+ \boldsymbol{\varepsilon}\)
- Compute an 80% confidence interval for \(\boldsymbol{\beta}_1\) using homoskedastic, normal error assumptions
- Record \(c_s = 1\) if \({\boldsymbol{\beta}_1 \in I_1}\) and \(0\) otherwise.
- Estimate the coverage probability, \(\hat{C} := \frac{1}{N_{sim}} \sum_{s=1}^{N_{sim}} c_s\).
Note that if the original confidence interval is correct, then we should have \(\hat{C} \approx 0.8\). For each problem
- Evaluate whether \(\hat{C} \approx 0.8\), and explain whether this is expected.
- Evaluate the average width of the confidence interval
For each model, set \(\boldsymbol{\beta}= (1, 2)^\intercal\).
(a)
- \(N = 500\)
- \(\varepsilon_n \sim \mathcal{N}\left(0, 1\right)\)
(b)
- \(N = 500\)
- \(\varepsilon_n = z_n \cdot x_1\quad\) where \(\quad z_n \sim \mathcal{N}\left(0, 1\right)\)
(c)
- \(N = 500\)
- \(\varepsilon_n = \exp(z_n) - \exp(1/2)\quad\) where \(\quad z_n \sim \mathcal{N}\left(0, 1\right)\) (Note that \(\varepsilon_n\) is mean zero by properties of the lognormal distribution.)
(d)
Like (a), but \(N=10\).
(e)
Like (b), but \(N=10\).
(f)
Like (c), but \(N=10\).
(g)
Briefly describe the observed effect of non–Gaussian errors, heteroskedasticity, and low sample size on coverage and CI width.
3 Interpreting tests and confidence intervals
Consider a regression \(y_n \sim \boldsymbol{x}_n^\intercal\boldsymbol{\beta}\), with the given confidence intervals for \(\beta_1\). In each problem, let \(\hat{v}= \left( (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \hat{\sigma}^2\right)_{11}\) denote the standard error for \(\hat{\boldsymbol{\beta}}_1\). Assume that we have a reasonably large number of observations.
Assume that the response is patient well-being, and \(x_{n1}\) is a treatment indicator for a medical procedure, so that \(\beta_1\) is the modeled effect of the procedure on the patient’s health. Here, \(\beta_1 > 0\) means the treatment has a positive (good) health benefit. The other regressors are explanatory covariates. The treatment is randomly assigned, so that it is independent of the other covariates.
For each problem, discuss briefly whether the inferential statement is reasonable or unreasonable and why.
(a) Consider the one-sided interval \(I = (\hat{\beta}_1 - 1.64 \sqrt{\hat{v}}, \infty)\), and assume that \(\hat{\beta}_1 - 1.64 \sqrt{\hat{v}} = - 3.4\). Since this interval contains \(\beta = 10\), we conclude that the treatment might have a very positive effect.
(b) Consider the two-sided interval \(I = \hat{\beta}_1 \pm 1.96 \sqrt{\hat{v}}\). We find that the interval contains \(0\), therefore we know that the treatment has no effect.
(c) Consider the two-sided interval \(I = \hat{\beta}_1 \pm 1.96 \sqrt{\hat{v}}\). We find that the interval does not contain \(\hat{\beta}_1 - 1.96 \sqrt{\hat{v}} > 0\), which is tentative evidence that the treatment has some non–zero, positive effect.
(d) Consider (c), except that when we plot the residuals \(\hat{\varepsilon}_n\) versus \(x_{n2}\) and find that the distribution of \(\hat{\varepsilon}_n\) has greater spread when \(\left|x_{n2}\right|\) is larger.
(e) We observe a new patient with regressors \(x_{*}\), and construct a confidence interval \(I\) such that for \(\mathbb{P}\left(x_{*}^\intercal\boldsymbol{\beta}\in I_{*}\right) \ge 0.95\). We expect that their unobserved well-being, \(y_*\), is probably in \(I_{*}\).
4 Chi squared random variables
Let \(s\sim \mathcal{\chi}^2_{K}\). Prove that
- \(\mathbb{E}\left[s\right] = K\)
- \(\mathrm{Var}\left(s\right) = 2 K\) (hint: if \(z\sim \mathcal{N}\left(0,\sigma^2\right)\), then \(\mathbb{E}\left[z^4\right] = 3\sigma^4\))
- If \(a_n \sim \mathcal{N}\left(0,\sigma^2\right)\) IID for \(1,\ldots,N\), then \(\frac{1}{\sigma^2} \sum_{n=1}^Na_n^2 \sim \mathcal{\chi}^2_{N}\)
- \(\frac{1}{K} s\rightarrow 1\) as \(K \rightarrow \infty\)
- \(\frac{1}{\sqrt{K}} (s- K) \rightarrow \mathcal{N}\left(0, 2\right)\) as \(K \rightarrow \infty\)
- Let \(\boldsymbol{a}\sim \mathcal{N}\left(\boldsymbol{0}, \boldsymbol{I}\right)\) where \(a\in \mathbb{R}^{K}\). Then \(\left\Vert\boldsymbol{a}\right\Vert_2^2 \sim \mathcal{\chi}^2_{K}\)
- Let \(\boldsymbol{a}\sim \mathcal{N}\left(\boldsymbol{0}, \boldsymbol{\Sigma}\right)\) where \(a\in \mathbb{R}^{K}\). Then \(\boldsymbol{a}^\intercal\boldsymbol{\Sigma}^{-1} \boldsymbol{a}\sim \mathcal{\chi}^2_{K}\)