library(tidyverse)
<- read.csv("../datasets/births/births14.csv")
births_df <- lm(weight ~ 1 + mage + habit, births_df, x=TRUE, y=TRUE) ols_fit
STAT151A Homework 5.
1 The output of lm
For this problem, download the births.csv dataset. We will manually reproduce most of the quantities computed by the lm
function in R
for the following regression:
print(summary(ols_fit))
Call:
lm(formula = weight ~ 1 + mage + habit, data = births_df, x = TRUE,
y = TRUE)
Residuals:
Min 1Q Median 3Q Max
-6.0065 -0.6763 0.1039 0.8237 3.1172
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.890764 0.207443 33.218 < 2e-16 ***
mage 0.013291 0.007112 1.869 0.0619 .
habitsmoker -0.580241 0.127545 -4.549 6.06e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.278 on 978 degrees of freedom
(19 observations deleted due to missingness)
Multiple R-squared: 0.02507, Adjusted R-squared: 0.02307
F-statistic: 12.57 on 2 and 978 DF, p-value: 4.063e-06
<- summary(ols_fit)$coefficients ols_coeffs
Note that, since I called lm
with the arguments x=TRUE
and y=TRUE
, the \(\boldsymbol{X}\) and \(\boldsymbol{Y}\) matrices are available as ols_fit$x
and ols_fit$y
respectively.
For the following questions, use only core R
functions (matrix multiplication and inversion and basic linear algebra) on \(\boldsymbol{X}\) and \(\boldsymbol{Y}\).
(a) Compute \(\hat{\beta}\) and show that it matches coefficients(ols_fit)
.
(b) Compute \(\hat{\boldsymbol{\varepsilon}}\) and show that it matches ols_fit$residuals
.
(c) Compute \(\hat{\boldsymbol{Y}}\) and show that it matches ols_fit$fitted.values
.
(d) Estimate the residual standard deviation, \(\hat{\sigma}\), and show that it matches sigma(ols_fit)
.
(e) Estimate the homoskedastic sampling covariance of \(\hat{\boldsymbol{\beta}}\) and show that it matches vcov(ols_fit)
.
(f) Estimate the heteroskedastic (sandwich) sampling covariance of \(\hat{\boldsymbol{\beta}}\) and show that it matches sandwich::vcovCL(ols_fit, type="HC0", cadjust=FALSE)
.
(g) Estimate the standard errors of \(\hat{\boldsymbol{\beta}}\) and show that they match ols_coeffs[, "Std. Error"]
.
(h) Esimate the t–statistics for \(\hat{\boldsymbol{\beta}}\) and show that they match ols_coeffs[, "t value"]
.
(i) Esimate the p–values for \(\hat{\boldsymbol{\beta}}\) and show that they match ols_coeffs[, "Pr(>|t|)"]
.
(j) Estimate a \(\alpha = 0.05\) two–sided confidence interval for the coefficient of mage
, and show that it matches confint(ols_fit, "mage", level=1 - alpha)
.
Solutions:
# Coefficients
<- ols_fit$x
xmat <- ols_fit$y
yvec
<- t(xmat) %*% xmat
xtx <- solve(xtx, t(xmat) %*% yvec)
betahat
print(cbind(coefficients(ols_fit), betahat))
[,1] [,2]
(Intercept) 6.89076356 6.89076356
mage 0.01329106 0.01329106
habitsmoker -0.58024070 -0.58024070
# Residuals
<- yvec - xmat %*% betahat
reshat print(max(abs(reshat - ols_fit$residuals)))
[1] 8.126833e-14
# Fitted
<- xmat %*% betahat
yhat print(max(abs(yhat - ols_fit$fitted.values)))
[1] 8.171241e-14
# Residual variance
<- sum(reshat^2) / (nrow(xmat) - ncol(xmat))
sigmahat2 print(c(sqrt(sigmahat2), sigma(ols_fit)))
[1] 1.278499 1.278499
# Betahat covariance
<- sigmahat2 * solve(xtx)
betahat_cov print(betahat_cov)
(Intercept) mage habitsmoker
(Intercept) 0.043032582 -1.442567e-03 -0.0032354412
mage -0.001442567 5.057444e-05 0.0000473339
habitsmoker -0.003235441 4.733390e-05 0.0162678535
print(vcov(ols_fit))
(Intercept) mage habitsmoker
(Intercept) 0.043032582 -1.442567e-03 -0.0032354412
mage -0.001442567 5.057444e-05 0.0000473339
habitsmoker -0.003235441 4.733390e-05 0.0162678535
# Sandwich covariance
<- t(xmat) %*% (diag(as.numeric(reshat)^2) %*% xmat)
score_cov <- solve(xtx, score_cov) %*% solve(xtx)
betahat_sandwich_cov print(betahat_sandwich_cov)
(Intercept) mage habitsmoker
(Intercept) 0.044942817 -1.537795e-03 -3.932433e-03
mage -0.001537795 5.472591e-05 7.760102e-05
habitsmoker -0.003932433 7.760102e-05 2.416461e-02
print(sandwich::vcovCL(ols_fit, type="HC0", cadjust=FALSE))
(Intercept) mage habitsmoker
(Intercept) 0.044942817 -1.537795e-03 -3.932433e-03
mage -0.001537795 5.472591e-05 7.760102e-05
habitsmoker -0.003932433 7.760102e-05 2.416461e-02
# Standard errors
<- sqrt(diag(betahat_cov))
betahat_sd print(cbind(betahat_sd, ols_coeffs[,"Std. Error"]))
betahat_sd
(Intercept) 0.207442960 0.207442960
mage 0.007111571 0.007111571
habitsmoker 0.127545496 0.127545496
# Confidence intervals
<- 0.05
alpha <- 2
mage_col <- -1 * qt(alpha / 2, df=nrow(xmat) - ncol(xmat))
zalpha <- betahat[mage_col] + betahat_sd[mage_col] * c(-zalpha, zalpha)
ci
print(ci)
[1] -0.0006646303 0.0272467564
print(confint(ols_fit, "mage", level=1 - alpha))
2.5 % 97.5 %
mage -0.0006646303 0.02724676
# T values
<- betahat / betahat_sd
t_value print(cbind(t_value, ols_coeffs[, "t value"]))
[,1] [,2]
(Intercept) 33.217630 33.217630
mage 1.868935 1.868935
habitsmoker -4.549284 -4.549284
# P values
<- nrow(xmat) - ncol(xmat)
df <- pt(- abs(t_value), df=df) * 2
p_value print(cbind(p_value, ols_coeffs[, "Pr(>|t|)"]))
[,1] [,2]
(Intercept) 1.389298e-162 1.389298e-162
mage 6.193059e-02 6.193059e-02
habitsmoker 6.058598e-06 6.058598e-06
2 Simulating coverage
For each of the following model specifications, perform the following steps:
- Fix \(\boldsymbol{\beta}\) as described in the model specification
- Draw a single \(\boldsymbol{X}\) matrix, with dimensions \(N \times 2\) , where the first column is ones, and the second column IID standard normal random variables.
- Repeat for \(s=(1, \ldots, N_{sim})\) simulations, with \(N_{sim} = 5000\):
- Draw \(\boldsymbol{\varepsilon}\) as described in the model specification
- Set \(\boldsymbol{Y}= \boldsymbol{X}\boldsymbol{\beta}+ \boldsymbol{\varepsilon}\)
- Compute an 80% confidence interval for \(\boldsymbol{\beta}_1\) using homoskedastic, normal error assumptions
- Record \(c_s = 1\) if \({\boldsymbol{\beta}_1 \in I_1}\) and \(0\) otherwise.
- Estimate the coverage probability, \(\hat{C} := \frac{1}{N_{sim}} \sum_{s=1}^{N_{sim}} c_s\).
Note that if the original confidence interval is correct, then we should have \(\hat{C} \approx 0.8\). For each problem
- Evaluate whether \(\hat{C} \approx 0.8\), and explain whether this is expected.
- Evaluate the average width of the confidence interval
For each model, set \(\boldsymbol{\beta}= (1, 2)^\intercal\).
(a)
- \(N = 500\)
- \(\varepsilon_n \sim \mathcal{N}\left(0, 1\right)\)
(b)
- \(N = 500\)
- \(\varepsilon_n = z_n \cdot x_1\quad\) where \(\quad z_n \sim \mathcal{N}\left(0, 1\right)\)
(c)
- \(N = 500\)
- \(\varepsilon_n = \exp(z_n) - \exp(1/2)\quad\) where \(\quad z_n \sim \mathcal{N}\left(0, 1\right)\) (Note that \(\varepsilon_n\) is mean zero by properties of the lognormal distribution.)
(d)
Like (a), but \(N=10\).
(e)
Like (b), but \(N=10\).
(f)
Like (c), but \(N=10\).
(g)
Briefly describe the observed effect of non–Gaussian errors, heteroskedasticity, and low sample size on coverage and CI width.
Solutions:
See the file hw5_2_solution.qmd.
3 Interpreting tests and confidence intervals
Consider a regression \(y_n \sim \boldsymbol{x}_n^\intercal\boldsymbol{\beta}\), with the given confidence intervals for \(\beta_1\). In each problem, let \(\hat{v}= \left( (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \hat{\sigma}^2\right)_{11}\) denote the standard error for \(\hat{\boldsymbol{\beta}}_1\). Assume that we have a reasonably large number of observations.
Assume that the response is patient well-being, and \(x_{n1}\) is a treatment indicator for a medical procedure, so that \(\beta_1\) is the modeled effect of the procedure on the patient’s health. Here, \(\beta_1 > 0\) means the treatment has a positive (good) health benefit. The other regressors are explanatory covariates. The treatment is randomly assigned, so that it is independent of the other covariates.
For each problem, discuss briefly whether the inferential statement is reasonable or unreasonable and why.
(a) Consider the one-sided interval \(I = (\hat{\beta}_1 - 1.64 \sqrt{\hat{v}}, \infty)\), and assume that \(\hat{\beta}_1 - 1.64 \sqrt{\hat{v}} = - 3.4\). Since this interval contains \(\beta = 10\), we conclude that the treatment might have a very positive effect.
(b) Consider the two-sided interval \(I = \hat{\beta}_1 \pm 1.96 \sqrt{\hat{v}}\). We find that the interval contains \(0\), therefore we know that the treatment has no effect.
(c) Consider the two-sided interval \(I = \hat{\beta}_1 \pm 1.96 \sqrt{\hat{v}}\). We find that the interval does not contain \(\hat{\beta}_1 - 1.96 \sqrt{\hat{v}} > 0\), which is tentative evidence that the treatment has some non–zero, positive effect.
(d) Consider (c), except that when we plot the residuals \(\hat{\varepsilon}_n\) versus \(x_{n2}\) and find that the distribution of \(\hat{\varepsilon}_n\) has greater spread when \(\left|x_{n2}\right|\) is larger.
(e) We observe a new patient with regressors \(x_{*}\), and construct a confidence interval \(I\) such that for \(\mathbb{P}\left(x_{*}^\intercal\boldsymbol{\beta}\in I_{*}\right) \ge 0.95\). We expect that their unobserved well-being, \(y_*\), is probably in \(I_{*}\).
Solutions:
(a) This one–sided interval is based on a test with no power against very high effects, and so the conclusion is unreasonable.
(b) Failing to reject the null does not mean the null is true. The conclusion is unreasonable.
(c) This is reasonable for the population studied, though tentative, since it is only reasonable if the other assumptions of the model are correct, such as homoskedasticity, no outliers or very high leverage points, and so on.
(d) Since there is heteroskedasticity, the intervals are likely to be invalid, and the conclusion is unreasonable.
(e) The interval \(I_{*}\) does not cover \(y_*\); it only covers \(\boldsymbol{\beta}^\intercal x_{*}\). The variability in the residual \(\varepsilon_*\) must also be taken into account. The conclusion is unreasonable.
4 Chi squared random variables
Let \(s\sim \mathcal{\chi}^2_{K}\). Prove that
- \(\mathbb{E}\left[s\right] = K\)
- \(\mathrm{Var}\left(s\right) = 2 K\) (hint: if \(z\sim \mathcal{N}\left(0,\sigma^2\right)\), then \(\mathbb{E}\left[z^4\right] = 3\sigma^4\))
- If \(a_n \sim \mathcal{N}\left(0,\sigma^2\right)\) IID for \(1,\ldots,N\), then \(\frac{1}{\sigma^2} \sum_{n=1}^Na_n^2 \sim \mathcal{\chi}^2_{N}\)
- \(\frac{1}{K} s\rightarrow 1\) as \(K \rightarrow \infty\)
- \(\frac{1}{\sqrt{K}} (s- K) \rightarrow \mathcal{N}\left(0, 2\right)\) as \(K \rightarrow \infty\)
- Let \(\boldsymbol{a}\sim \mathcal{N}\left(\boldsymbol{0}, \boldsymbol{I}\right)\) where \(a\in \mathbb{R}^{K}\). Then \(\left\Vert\boldsymbol{a}\right\Vert_2^2 \sim \mathcal{\chi}^2_{K}\)
- Let \(\boldsymbol{a}\sim \mathcal{N}\left(\boldsymbol{0}, \boldsymbol{\Sigma}\right)\) where \(a\in \mathbb{R}^{K}\). Then \(\boldsymbol{a}^\intercal\boldsymbol{\Sigma}^{-1} \boldsymbol{a}\sim \mathcal{\chi}^2_{K}\)
Solutions:
Write \(s= \sum_{k=1}^K z_k^2\) for \(z_k \sim \mathcal{N}\left(0,1\right)\) IID.
- \(\mathbb{E}\left[\sum_{k=1}^K z_k^2\right] = \sum_{k=1}^K 1 = K\)
- \(\mathbb{E}\left[\left( \sum_{k=1}^K (z_k^2 - 1)\right)^2 = K \mathbb{E}\left[z_k^2 - 1\right]^2\right] = K(\mathbb{E}\left[z_k^4\right] - 2 \mathbb{E}\left[z_k^2\right]+ 1) = K(3 - 2 + 1) = 2 K\).
- This follows from the LLN.
- This follows from the CLT.
- Use the fact that \(\left\Vert\boldsymbol{a}\right\Vert_2^2 = \sum a_k^2\).
- Use the fact that \(\boldsymbol{\Sigma}^{-1/2} \boldsymbol{a}\sim \mathcal{N}\left(\boldsymbol{0}, \boldsymbol{I}\right)\).