STAT151A Homework 2

Author

Your name here

1 Multivariate normal exercises

Let \(\boldsymbol{x}\sim \mathcal{N}\left(\boldsymbol{\mu}, \boldsymbol{\Sigma}\right)\) where

\[ \boldsymbol{\mu}= \begin{pmatrix} 1 \\ 2 \\ 3 \end{pmatrix} \quad\textrm{and}\quad \boldsymbol{\Sigma}= \begin{pmatrix} 1 & 0.5 & 0 \\ 0.5 & 2 & 0.1 \\ 0 & 0.1 & 4 \\ \end{pmatrix}. \]

Let

\[ \boldsymbol{v}= \begin{pmatrix} 0 \\ 1 \\ 0 \end{pmatrix} \quad \boldsymbol{a}= \begin{pmatrix} 2 \\ 1 \\ 1 \end{pmatrix} \quad \boldsymbol{A}= \begin{pmatrix} 1 & 1 & 0 \\ 2 & 0 & 1 \\ \end{pmatrix} \]

Evaluate the following expressions.

  1. \(\mathbb{E}\left[\boldsymbol{v}^\intercal\boldsymbol{x}\right]\)
  2. \(\mathrm{Var}\left(\boldsymbol{v}^\intercal\boldsymbol{x}\right)\)
  3. \(\mathbb{E}\left[\boldsymbol{a}^\intercal\boldsymbol{x}\right]\)
  4. \(\mathrm{Var}\left(\boldsymbol{a}^\intercal\boldsymbol{x}\right)\)
  5. \(\mathbb{E}\left[\boldsymbol{v}\boldsymbol{x}^\intercal\right]\)
  6. \(\mathbb{E}\left[\boldsymbol{x}\boldsymbol{x}^\intercal\right]\)
  7. \(\mathbb{E}\left[\boldsymbol{x}^\intercal\boldsymbol{x}\right]\)
  8. \(\mathbb{E}\left[\mathrm{trace}\left(\boldsymbol{x}\boldsymbol{x}^\intercal\right)\right]\)
  9. \(\mathbb{E}\left[\boldsymbol{A}\boldsymbol{x}\right]\)
  10. \(\mathrm{Cov}\left(\boldsymbol{A}\boldsymbol{x}\right)\)
  11. \(\mathbb{E}\left[(\boldsymbol{A}\boldsymbol{x}) (\boldsymbol{A}\boldsymbol{x})^\intercal\right]\)
Solutions

Solutions

a_mat <- matrix(c(1, 2, 1, 0, 0, 1), nrow=2)
a <- matrix(c(2, 1, 1), nrow=3)
v <- matrix(c(0, 1, 0), nrow=3)
mu <- matrix(c(1, 2, 3), nrow=3)
sigmam <- matrix(c(1, 0.5, 0, 0.5, 2, 0.1, 0, 0.1, 4), nrow=3)
v %*% t(mu)
     [,1] [,2] [,3]
[1,]    0    0    0
[2,]    1    2    3
[3,]    0    0    0

\(\mathbb{E}\left[\boldsymbol{v}^\intercal\boldsymbol{x}\right] = \boldsymbol{v}^\intercal\boldsymbol{\mu}=\) 2

\(\mathrm{Var}\left(\boldsymbol{v}^\intercal\boldsymbol{x}\right) = \boldsymbol{v}^\intercal\boldsymbol{\Sigma}\boldsymbol{v}=\) 2

\(\mathbb{E}\left[\boldsymbol{a}^\intercal\boldsymbol{x}\right] = \boldsymbol{a}^\intercal\boldsymbol{\mu}=\) 7

\(\mathrm{Var}\left(\boldsymbol{a}^\intercal\boldsymbol{x}\right) = \boldsymbol{a}^\intercal\boldsymbol{\Sigma}\boldsymbol{a}=\) 12.2

\(\mathbb{E}\left[\boldsymbol{v}\boldsymbol{x}^\intercal\right] = \boldsymbol{v}\boldsymbol{\mu}^\intercal=\)

     [,1] [,2] [,3]
[1,]    0    0    0
[2,]    1    2    3
[3,]    0    0    0

\(\mathbb{E}\left[\boldsymbol{x}\boldsymbol{x}^\intercal\right] = \boldsymbol{\Sigma}+ \boldsymbol{\mu}\boldsymbol{\mu}^\intercal=\)

     [,1] [,2] [,3]
[1,]  2.0  2.5  3.0
[2,]  2.5  6.0  6.1
[3,]  3.0  6.1 13.0

\(\mathbb{E}\left[\boldsymbol{x}^\intercal\boldsymbol{x}\right] = \boldsymbol{\mu}^\intercal\boldsymbol{\mu}+ \mathrm{trace}\left(\boldsymbol{\Sigma}\right) =\) 21

\(\mathbb{E}\left[\mathrm{trace}\left(\boldsymbol{x}\boldsymbol{x}^\intercal\right)\right] = \mathbb{E}\left[\boldsymbol{x}^\intercal\boldsymbol{x}\right] =\) 21

\(\mathbb{E}\left[\boldsymbol{A}\boldsymbol{x}\right] = \boldsymbol{A}\boldsymbol{\mu}=\)

     [,1]
[1,]    3
[2,]    5

\(\mathrm{Cov}\left(\boldsymbol{A}\boldsymbol{x}\right) = \boldsymbol{A}\boldsymbol{\Sigma}\boldsymbol{A}^\intercal=\)

     [,1] [,2]
[1,]  4.0  3.1
[2,]  3.1  8.0

\(\mathbb{E}\left[(\boldsymbol{A}\boldsymbol{x}) (\boldsymbol{A}\boldsymbol{x})^\intercal\right] = \boldsymbol{A}(\boldsymbol{\mu}\boldsymbol{\mu}^\intercal+ \boldsymbol{\Sigma}) \boldsymbol{A}^\intercal=\)

     [,1] [,2]
[1,] 13.0 18.1
[2,] 18.1 33.0

2 Chi squared random variables and allies

Let \(s\sim \mathcal{\chi}^2_{K}\). Prove that

  • \(\mathbb{E}\left[s\right] = K\)
  • \(\mathrm{Var}\left(s\right) = 2 K\) (hint: if \(z\sim \mathcal{N}\left(0,\sigma^2\right)\), then \(\mathbb{E}\left[z^4\right] = 3\sigma^4\))
  • If \(a_n \sim \mathcal{N}\left(0,\sigma^2\right)\) IID for \(1,\ldots,N\), then \(\frac{1}{\sigma^2} \sum_{n=1}^Na_n^2 \sim \mathcal{\chi}^2_{N}\)
  • \(\frac{1}{K} s\rightarrow 1\) as \(K \rightarrow \infty\)
  • \(\frac{1}{\sqrt{K}} (s- K) \rightarrow \mathcal{N}\left(0, 2\right)\) as \(K \rightarrow \infty\)
  • Let \(\boldsymbol{a}\sim \mathcal{N}\left(\boldsymbol{0}, \boldsymbol{I}\right)\) where \(a\in \mathbb{R}^{K}\). Then \(\left\Vert\boldsymbol{a}\right\Vert_2^2 \sim \mathcal{\chi}^2_{K}\)
  • Let \(\boldsymbol{a}\sim \mathcal{N}\left(\boldsymbol{0}, \boldsymbol{\Sigma}\right)\) where \(a\in \mathbb{R}^{K}\). Then \(\boldsymbol{a}^\intercal\boldsymbol{\Sigma}^{-1} \boldsymbol{a}\sim \mathcal{\chi}^2_{K}\)

Now, let \(t\sim \mathrm{StudentT}_{K}\) with \(K > 2\).

  • Prove that \(\mathbb{E}\left[t\right] = 0\).
  • Prove that \(\mathrm{Var}\left(t\right) > 1\). (You may use the explicit formula or Jensen’s inequality.)
  • Argue that a standard confidence interval based on a Student T distribution will always be wider than a confidence interval for the same data based on a normal assumption that ignores the variability of \(\hat{\sigma}^2\).
  • For a generic vector \(\boldsymbol{a}\), derive a student-t confidence interval for \(\boldsymbol{a}^\intercal{\boldsymbol{\beta}^{*}}\),

Now, let \(\phi \sim \mathrm{FDist}_{K_1,K_2}\).

  • For large \(K_2\), show that the mean of the F distribution is approximately \(1\).
  • For large \(K_2\), show that the variance of the F distribution is approximately \(2 / K_1\). (Hint: use the definition of a \(\mathcal{\chi}^2_{K_1}\) as the sum of squares of independent standard normals.)
  • If we are testing the null that a set of coefficients are zero, argue why it only makes sense to reject for large values of our F-statistic.
Solutions

Write \(s= \sum_{k=1}^K z_k^2\) for \(z_k \sim \mathcal{N}\left(0,1\right)\) IID.

  • \(\mathbb{E}\left[\sum_{k=1}^K z_k^2\right] = \sum_{k=1}^K 1 = K\)
  • \(\mathbb{E}\left[\left( \sum_{k=1}^K (z_k^2 - 1)\right)^2\right] = K \mathbb{E}\left[(z_k^2 - 1)^2\right] = K(\mathbb{E}\left[z_k^4\right] - 2 \mathbb{E}\left[z_k^2\right]+ 1) = K(3 - 2 + 1) = 2 K\). (Here we used the independence of \(z_k\) to drop cross terms, and identical distribution to factor out \(K\).)
  • If \(a_n \sim \mathcal{N}\left(0, \sigma^2\right)\), then \(a_n / \sigma \sim \mathcal{N}\left(0, 1\right)\), so \((a_n / \sigma)^2 \sim \mathcal{\chi}^2_{1}\). Since the \(a_n\) are IID, the sum of \(N\) independent \(\mathcal{\chi}^2_{1}\) random variables is \(\mathcal{\chi}^2_{N}\), giving \(\frac{1}{\sigma^2} \sum_{n=1}^Na_n^2 \sim \mathcal{\chi}^2_{N}\).
  • This follows from the LLN applied to \(\frac{1}{K}s= \frac{1}{K}\sum_{k=1}^K z_k^2 \rightarrow \mathbb{E}\left[z_k^2\right] = 1\).
  • This follows from the CLT applied to \(\frac{1}{\sqrt{K}} \sum_{k=1}^K (z_k^2 - 1) \rightarrow \mathcal{N}\left(0, \mathrm{Var}\left(z_k^2\right)\right) = \mathcal{N}\left(0, 2\right)\).
  • Use the fact that \(\left\Vert\boldsymbol{a}\right\Vert_2^2 = \sum a_k^2\) where each \(a_k \sim \mathcal{N}\left(0,1\right)\) IID, so \(\left\Vert\boldsymbol{a}\right\Vert_2^2\) is the sum of \(K\) squared standard normals, which is \(\mathcal{\chi}^2_{K}\) by definition.
  • Use the fact that \(\boldsymbol{\Sigma}^{-1/2} \boldsymbol{a}\sim \mathcal{N}\left(\boldsymbol{0}, \boldsymbol{I}\right)\), so \(\boldsymbol{a}^\intercal\boldsymbol{\Sigma}^{-1} \boldsymbol{a}= (\boldsymbol{\Sigma}^{-1/2} \boldsymbol{a})^\intercal(\boldsymbol{\Sigma}^{-1/2} \boldsymbol{a}) = \left\Vert\boldsymbol{\Sigma}^{-1/2}\boldsymbol{a}\right\Vert_2^2 \sim \mathcal{\chi}^2_{K}\) by the previous result.

Student T:

Recall \(t= z/ \sqrt{s/ K}\) where \(z\sim \mathcal{N}\left(0,1\right)\) and \(s\sim \mathcal{\chi}^2_{K}\) independently.

  • \(\mathbb{E}\left[t\right] = \mathbb{E}\left[z\right] \cdot \mathbb{E}\left[1/\sqrt{s/K}\right] = 0\) since \(z\) and \(s\) are independent.
  • By independence, \(\mathrm{Var}\left(t\right) = \mathbb{E}\left[t^2\right] = \mathbb{E}\left[z^2\right] \mathbb{E}\left[K / s\right] = \mathbb{E}\left[K / s\right]\). Since \(1/x\) is convex, Jensen’s inequality gives \(\mathbb{E}\left[1/s\right] > 1/\mathbb{E}\left[s\right] = 1/K\), so \(\mathrm{Var}\left(t\right) = K \mathbb{E}\left[1/s\right] > 1\). (The explicit formula gives \(\mathrm{Var}\left(t\right) = K / (K - 2)\) for \(K > 2\).)
  • Since the Student T critical values are larger than the normal critical values (because \(\mathrm{Var}\left(t\right) > 1\) implies heavier tails), the Student T confidence interval \(\hat{\beta}\pm t_{K, \alpha/2} \cdot \hat{\sigma}/ \sqrt{N}\) is always wider than the normal-based interval \(\hat{\beta}\pm z_{\alpha/2} \cdot \hat{\sigma}/ \sqrt{N}\), since \(t_{K, \alpha/2} > z_{\alpha/2}\).
  • For a generic vector \(\boldsymbol{a}\), \(\boldsymbol{a}^\intercal\hat{\boldsymbol{\beta}}\sim \mathcal{N}\left(\boldsymbol{a}^\intercal{\boldsymbol{\beta}^{*}}, \sigma^2 \boldsymbol{a}^\intercal(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{a}\right)\). With \(\sigma^2\) estimated by \(\hat{\sigma}^2\), we have \(\frac{\boldsymbol{a}^\intercal\hat{\boldsymbol{\beta}}- \boldsymbol{a}^\intercal{\boldsymbol{\beta}^{*}}}{\hat{\sigma}\sqrt{\boldsymbol{a}^\intercal(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{a}}} \sim \mathrm{StudentT}_{N - P}\). So a level-\(\alpha\) confidence interval for \(\boldsymbol{a}^\intercal{\boldsymbol{\beta}^{*}}\) is \[\boldsymbol{a}^\intercal\hat{\boldsymbol{\beta}}\pm t_{N-P, \alpha/2} \cdot \hat{\sigma}\sqrt{\boldsymbol{a}^\intercal(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{a}}.\]

F distribution:

Recall \(\phi = \frac{s_1 / K_1}{s_2 / K_2}\) where \(s_1 \sim \mathcal{\chi}^2_{K_1}\) and \(s_2 \sim \mathcal{\chi}^2_{K_2}\) independently.

  • \(\mathbb{E}\left[\phi\right] = \mathbb{E}\left[s_1 / K_1\right] \cdot \mathbb{E}\left[K_2 / s_2\right] = 1 \cdot \frac{K_2}{K_2 - 2}\). For large \(K_2\), \(\frac{K_2}{K_2-2} \approx 1\).
  • For large \(K_2\), \(s_2/K_2 \rightarrow 1\) by LLN, so \(\phi \approx s_1 / K_1 = \frac{1}{K_1}\sum_{k=1}^{K_1} z_k^2\). Then \(\mathrm{Var}\left(\phi\right) \approx \mathrm{Var}\left(s_1/K_1\right) = \frac{1}{K_1^2} \cdot 2K_1 = 2/K_1\).
  • Under the null \({\boldsymbol{\beta}^{*}}= \boldsymbol{0}\), the coefficients \(\hat{\boldsymbol{\beta}}\) are estimated from pure noise, so \(\hat{\boldsymbol{\beta}}^\intercal\hat{\boldsymbol{\beta}}\) is small and \(\phi \approx 1\). Under the alternative \({\boldsymbol{\beta}^{*}}\ne \boldsymbol{0}\), \(\hat{\boldsymbol{\beta}}\) captures the true signal, inflating \(\hat{\boldsymbol{\beta}}^\intercal\hat{\boldsymbol{\beta}}\) and making \(\phi\) large. So it only makes sense to reject for large \(\phi\).

3 Fit and regressors

Given a regression on \(\boldsymbol{X}\) with \(P\) regressors and \(N\) data points, and the corresponding \(\boldsymbol{Y}\), \(\hat{\boldsymbol{Y}}\), and \(\hat{\varepsilon}\), define the following quantities: \[ \begin{aligned} RSS :={}& \hat{\varepsilon}^\intercal\hat{\varepsilon}& \textrm{(Residual sum of squares)}\\ TSS :={}& \boldsymbol{Y}^\intercal\boldsymbol{Y}& \textrm{(Total sum of squares)}\\ ESS :={}& \hat{\boldsymbol{Y}}^\intercal\hat{\boldsymbol{Y}}& \textrm{(Explained sum of squares)}\\ R^2 :={}& \frac{ESS}{TSS}. \end{aligned} \]

  1. Prove that \(RSS + ESS = TSS\).
  2. Express \(R^2\) in terms of \(TSS\) and \(RSS\).
  3. What is \(R^2\) when we include no regressors? (\(P = 0\))
  4. What is \(R^2\) when we include \(N\) linearly independent regressors? (\(P=N\))
  5. Can \(R^2\) ever decrease when we add a regressor? If so, how?
  6. Can \(R^2\) ever stay the same when we add a regressor? If so, how?
  7. Can \(R^2\) ever increase when we add a regressor? If so, how?
  8. Does a high \(R^2\) mean the regression is useful? (You may argue by example.)
  9. Does a low \(R^2\) mean the regression is not useful? (You may argue by example.)
Solutions
  1. This follows from \(\hat{\boldsymbol{Y}}^\intercal\hat{\varepsilon}= \boldsymbol{0}\).
  2. \(R^2 = \frac{TSS - RSS}{TSS} = 1 - \frac{RSS}{TSS}\)
  3. \(R^2 = 0\) when we include no regressors
  4. \(R^2 = 1\) when we include \(N\) linearly independent regressors?
  5. No, it cannot, since you project onto the same or larger subspace.
  6. Yes, if you add a regressor column that is colinear with the existing columns.
  7. Yes, if you add a linearly independent regressor column.
  8. No, you might overfit.
  9. No, you might have low signal to noise ratio.

4 Power

For this problem, consider for simplicity regressing on a scalar \(y_n \sim \beta x_n\) without a constant. Suppose that \(y_n = {\beta^{*}}x_n + \varepsilon_n\) where \(\varepsilon_n\) are IID \(\mathcal{N}\left(0, \sigma^2\right)\), and \(x_n\) are deterministic and satisfy \(\frac{1}{N} \sum_{n=1}^Nx_n^2 \rightarrow v> 0\). Take \(\sigma^2\) as known, and set \(s^2 = \frac{1}{N} \sum_{n=1}^Nx_n^2\).

For the usual OLS estimator \(\hat{\beta}\) and null hypothesis \(H_0: {\beta^{*}}= \beta_0\), define the test statistic \[ t = \sqrt{N} \frac{\hat{\beta}- \beta_0}{\sigma / s}. \]

Let \(q_\alpha\) denote the \(1 - \alpha\)–th quantile of the standard normal, i.e. \(\Phi(q_\alpha) = 1 - \alpha\). Define the following rejection regions:

  • \(R_u = (q_\alpha, \infty)\)
  • \(R_l = (-\infty, -q_\alpha)\)
  • \(R_{ul} = (-\infty, -q_{\alpha/2}) \bigcup (q_{\alpha/2}, \infty)\)
  • \(R_{crazy} = (0, q_{1/2 - \alpha})\)

(a) Show that, if \(z\sim \mathcal{N}\left(0,1\right)\), then \(p(z\in \boldsymbol{R}) = \alpha\) for each rejection region \(R\).

(b) Show that the test “reject if \(t \in R\)” is a valid level-\(\alpha\) test for \(H_0\), for each \(R\).

(c) Now, suppose that \(\beta_0 = 0\) but \({\beta^{*}}\) is very large and positive (e.g. \(1000\)). Which regions reject \(H_0\)? Which regions do not?

(d) Now, suppose that \(\beta_0 = 0\) but \({\beta^{*}}\) is very large and negative (e.g. \(-1000\)). Which regions reject \(H_0\)? Which regions do not?

(e) Now, suppose that \(\beta_0 = 0\) and \({\beta^{*}}\) is very small and positive relative to its estimated standard deviation, but not exactly zero (e.g. \(1 / 1000\)). What is the probability of each region rejecting \(H_0\)?

(f) Consider the following confidence intervals:

  • \(C_{u} = (-\infty, \hat{\beta}+ \frac{1}{\sqrt{N}} q_\alpha \sigma / s)\)
  • \(C_{l} = (\hat{\beta}- \frac{1}{\sqrt{N}} q_\alpha \sigma / s, \infty)\)
  • \(C_{lu} = (\hat{\beta}- \frac{1}{\sqrt{N}} q_{\alpha / 2} \sigma / s, \hat{\beta}+ \frac{1}{\sqrt{N}} q_{\alpha / 2} \sigma / s)\)
  • \(C_{crazy} = (-\infty, \hat{\beta}) \bigcup \, (\hat{\beta}+ \frac{1}{\sqrt{N}} q_{1/2 - \alpha} \sigma / s)\)

Show that each confidence interval is precisely the set of null hypothesis \(\beta_0\) that you would not reject according to the corresponding rejection region.

(g) Show that \(p({\beta^{*}}\in C) = 1 - \alpha\) for each confidence interval.

(h) In ordinary language, describe reasons for and against using each confidence interval in terms of what kinds of alternatives you are interested in having power against.

Solutions

(a) With \(z\sim \mathcal{N}\left(0,1\right)\):

  • \(p(z\in R_u) = 1 - \Phi(q_\alpha) = 1 - (1 - \alpha) = \alpha\)
  • \(p(z\in R_l) = \Phi(-q_\alpha) = 1 - \Phi(q_\alpha) = \alpha\)
  • \(p(z\in R_{ul}) = \Phi(-q_{\alpha/2}) + 1 - \Phi(q_{\alpha/2}) = 2\alpha/2 = \alpha\)
  • \(p(z\in R_{crazy}) = \Phi(q_{1/2 - \alpha}) - \Phi(0) = 1 - (1/2 - \alpha) - 1/2 = \alpha\)

(b) Under \(H_0: {\beta^{*}}= \beta_0\), the test statistic \(t= \frac{\hat{\beta}- \beta_0}{\sigma / (s\sqrt{N})}\) is exactly \(\mathcal{N}\left(0,1\right)\) since \(\hat{\beta}- \beta_0 \sim \mathcal{N}\left(0, \sigma^2/(N s^2)\right)\). So \(p(t\in R \mid H_0) = p(z\in R) = \alpha\) by part (a).

(c) We can plug in \[ t = \frac{\hat{\beta}- {\beta^{*}}+ {\beta^{*}}- \beta_0}{\sigma / (\sqrt{N} s)} = \frac{\hat{\beta}- {\beta^{*}}}{\sigma / (\sqrt{N} s)} + \frac{{\beta^{*}}- \beta_0}{\sigma / (\sqrt{N} s)} = \mathcal{N}\left(0,1\right) + \sqrt{N} \frac{{\beta^{*}}- \beta_0}{\sigma / s}. \]

From, this, it follows that if \({\beta^{*}}\gg 0\), then \(t \gg 0\) with high probability. So

  • \(R_u\): Rejects
  • \(R_l\): Does not reject
  • \(R_{ul}\): Rejects
  • \(R_{crazy}\): Does not reject

(d) Reasoning as in (c), \(t \ll 0\), so

  • \(R_u\): Does not reject
  • \(R_l\): Rejects
  • \(R_{ul}\): Rejects
  • \(R_{crazy}\): Does not reject

(e) Reasoning as above, for \({\beta^{*}}\approx 0\), we approximately have \(t \sim \mathcal{N}\left(0,1\right)\). So

  • \(R_u\): Rejects with probability \(\alpha\)
  • \(R_l\): Rejects with probability \(\alpha\)
  • \(R_{ul}\): Rejects with probability \(\alpha\)
  • \(R_{crazy}\): Rejects with probability only slightly higher than \(\alpha\)

(f) For each \(R\), we find the set of \(\beta_0\) such that \(t\notin R\). First, we can solve for \(\beta_0\) in the definition of \(t\):

\[ \beta_0 = \hat{\beta}- \frac{1}{\sqrt{N}} (\sigma / s) t, \] so that, for any \(t_0\), \[ t < t_0 \quad \Leftrightarrow \quad \beta_0 > \frac{1}{\sqrt{N}} (\sigma / s) t_0. \]

The results follow from plugging this expression into the definitions of the rejection regions.

Note that because \(t\) was constructed by subtracting \(\beta_0\), the direction of the rejection regions flip signs. We could equally well have defined a statistic which is the negative of \(t\), which would have the statistic rejection regions and \(\beta_0\) confidence intervals aligned in their direction.

(g) When the null is true, then \(p(t\in R) = \alpha\). And, by (f), \(t\in R\) if an only if \({\beta^{*}}\notin C\). So \[ \begin{aligned} p({\beta^{*}}\in C) ={}& p(t\notin R) & \textrm{(always true)} \\ ={}& 1 - \alpha. & \textrm{(under $H_0$)} \end{aligned} \]

(h)

We see that each region has power against different alternatives. Specifically, if \({\beta^{*}}\) is larger than \(\beta_0\), then \(R_u\) rejects but \(R_l\) does not, and \(R_u\) rejects more often than \(R_{ul}\). The rejection region \(R_{crazy}\) only has extra power against alternatives that are very close to zero — and not much extra power, at that. This is the reason \(R_{crazy}\) is not very useful, despite giving rise to a valid hypothesis test.

5 Interpreting tests and confidence intervals

Consider a regression \(y_n \sim \boldsymbol{x}_n^\intercal\boldsymbol{\beta}\), with the given confidence intervals for \(\beta_1\). In each problem, let \(\hat{v}= \left( (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \hat{\sigma}^2\right)_{11}\) denote the standard error for \(\hat{\boldsymbol{\beta}}_1\). Assume that we have a reasonably large number of observations.

Assume that the response is patient well-being, and \(x_{n1}\) is a treatment indicator for a medical procedure, so that \(\beta_1\) is the modeled effect of the procedure on the patient’s health. Here, \(\beta_1 > 0\) means the treatment has a positive (good) health benefit. The other regressors are explanatory covariates. The treatment is randomly assigned, so that it is independent of the other covariates.

For each problem, discuss briefly whether the inferential statement is reasonable or unreasonable and why.

(a) Consider the one-sided interval \(I = (\hat{\beta}_1 - 1.64 \sqrt{\hat{v}}, \infty)\), and assume that \(\hat{\beta}_1 - 1.64 \sqrt{\hat{v}} = - 3.4\). Since this interval contains \(\beta = 10\), we conclude that the treatment might have a very positive effect.

(b) Consider the two-sided interval \(I = \hat{\beta}_1 \pm 1.96 \sqrt{\hat{v}}\). We find that the interval contains \(0\), therefore we know that the treatment has no effect.

(c) Consider the two-sided interval \(I = \hat{\beta}_1 \pm 1.96 \sqrt{\hat{v}}\). We find that the interval does not contain \(\hat{\beta}_1 - 1.96 \sqrt{\hat{v}} > 0\), which is tentative evidence that the treatment has some non–zero, positive effect.

(d) Consider (c), except that when we plot the residuals \(\hat{\varepsilon}_n\) versus \(x_{n2}\) and find that the distribution of \(\hat{\varepsilon}_n\) has greater spread when \(\left|x_{n2}\right|\) is larger.

(e) We observe a new patient with regressors \(x_{*}\), and construct a confidence interval \(I\) such that for \(\mathbb{P}\left(x_{*}^\intercal\boldsymbol{\beta}\in I_{*}\right) \ge 0.95\). We expect that their unobserved well-being, \(y_*\), is probably in \(I_{*}\).

Solutions

Solutions:

(a) This one–sided interval is based on a test with no power against very high effects, and so the conclusion is unreasonable.

(b) Failing to reject the null does not mean the null is true. The conclusion is unreasonable.

(c) This is reasonable for the population studied, though tentative, since it is only reasonable if the other assumptions of the model are correct, such as homoskedasticity, no outliers or very high leverage points, and so on.

(d) Since there is heteroskedasticity, the intervals are likely to be invalid, and the conclusion is unreasonable.

(e) The interval \(I_{*}\) does not cover \(y_*\); it only covers \(\boldsymbol{\beta}^\intercal x_{*}\). The variability in the residual \(\varepsilon_*\) must also be taken into account. The conclusion is unreasonable.

6 When F-tests and t-tests disgree

In this problem we show how t-tests, which perform inference on individual coefficients, and F-tests, which perform inference on groups of coefficients, can appear to disagree with one another.

For simplicity, consider a setting where the regressor matrix \(\boldsymbol{X}\) is orthogonal and normalized so that \(\boldsymbol{X}^\intercal\boldsymbol{X}= N \boldsymbol{I}_P\). We will also assume that for some \({\boldsymbol{\beta}^{*}}\), \(\boldsymbol{Y}= \boldsymbol{X}{\boldsymbol{\beta}^{*}}+ \boldsymbol{\varepsilon}\) where \(\varepsilon_n \sim \mathcal{N}\left(0, 1\right)\) independently of each other and of \(\boldsymbol{X}\). You may also assume that \(N\) is large enough so that \(\hat{\sigma}\approx \sigma = 1\).

Throughout, assume that we are doing two-sided t tests and one-sided F tests with level \(\alpha = 0.05\).

(a) Under our assumptions show that the t test statistic for the null \(H_0: {\boldsymbol{\beta}^{*}}_1 = 0\) is given by

\[ t_1 := \sqrt{N} \frac{\hat{\boldsymbol{\beta}}_1}{\hat{\sigma}} \approx \sqrt{N}\hat{\boldsymbol{\beta}}_1. \]

(b) Under our assumptions show that the F test statistic for the null \(F_0: {\boldsymbol{\beta}^{*}}= \boldsymbol{0}\) is given by \[ \phi := \frac{N}{P} \frac{\hat{\boldsymbol{\beta}}^\intercal\hat{\boldsymbol{\beta}}}{\hat{\sigma}^2} \approx \frac{N}{P} \hat{\boldsymbol{\beta}}^\intercal\hat{\boldsymbol{\beta}}. \]

(c) Now, suppose that \({\boldsymbol{\beta}^{*}}_1 = \delta / \sqrt{N}\), and that \({\boldsymbol{\beta}^{*}}_p = 0\) for \(p=2,\ldots,P\). Show that the components of \(\hat{\boldsymbol{\beta}}\) are independent gaussians, each with variance \(1/N\), and where \(\hat{\boldsymbol{\beta}}_1\) has mean \(\delta / \sqrt{N}\) and \(\hat{\boldsymbol{\beta}}_p\) has mean \(0\) for \(p \ge 2\).

(d) Under (c), show that \(t_1\) rejects \(H_0\) with high probability as \(\delta \rightarrow \infty\), no matter what \(P\) and \(N\) is.

(e) Under (c), show that for a fixed \(\delta\) and \(N\), then the F test fails to reject at non-zero rate as \(P \rightarrow N\). Hint: You can use the fact that \(\frac{1}{P} \mathcal{\chi}^2_{P} \approx \mathcal{N}\left(1, \frac{2}{P}\right)\) by the CLT.

(f) Combining (d) and (e), we have constructed a situation where we reject the null that \({\boldsymbol{\beta}^{*}}_1 = 0\) but fail to reject the null that \({\boldsymbol{\beta}^{*}}= \boldsymbol{0}\). But these results appear to contradict one another, since \({\boldsymbol{\beta}^{*}}_1 \ne 0\) implies \({\boldsymbol{\beta}^{*}}\ne \boldsymbol{0}\). How is this possible?

Solutions

Note throughout that the distribution of \(\hat{\boldsymbol{\beta}}\) is given by \[ \hat{\boldsymbol{\beta}}\sim \mathcal{N}\left({\boldsymbol{\beta}^{*}}, \sigma^2 (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1}\right) = \mathcal{N}\left({\boldsymbol{\beta}^{*}}, \frac{1}{N} I_P\right). \]

(a) The t-statistic for \(H_0: {\boldsymbol{\beta}^{*}}_1 = 0\) is \[ t_1 = \frac{\hat{\boldsymbol{\beta}}_1 - 0}{\hat{\sigma}\sqrt{((\boldsymbol{X}^\intercal\boldsymbol{X})^{-1})_{11}}} = \frac{\hat{\boldsymbol{\beta}}_1}{\hat{\sigma}/ \sqrt{N}} = \sqrt{N} \frac{\hat{\boldsymbol{\beta}}_1}{\hat{\sigma}} \approx \sqrt{N} \hat{\boldsymbol{\beta}}_1. \]

(b) The F-statistic for \(F_0: {\boldsymbol{\beta}^{*}}= \boldsymbol{0}\) is \[ \phi = \frac{\hat{\boldsymbol{\beta}}^\intercal(\boldsymbol{X}^\intercal\boldsymbol{X}) \hat{\boldsymbol{\beta}}/ P} {\hat{\sigma}^2} = \frac{N \hat{\boldsymbol{\beta}}^\intercal I_p \hat{\boldsymbol{\beta}}/ P} {\hat{\sigma}^2} = \frac{N}{P} \frac{\hat{\boldsymbol{\beta}}^\intercal\hat{\boldsymbol{\beta}}}{\hat{\sigma}^2} \approx \frac{N}{P} \hat{\boldsymbol{\beta}}^\intercal\hat{\boldsymbol{\beta}}. \]

(c) This follows by plugging into the above formula.

(d) Under (c), we can write \[ \hat{\boldsymbol{\beta}}= \frac{\delta}{\sqrt{N}} + \mathcal{N}\left(0, 1/\sqrt{N}\right), \] so \[ t_1 \approx \delta + \mathcal{N}\left(0,1\right). \] By making \(\delta\) large, we can increase the probability of rejection This holds regardless of \(P\), for \(N\) large enough that we can use the approximation \(\hat{\sigma}\approx \sigma\).

(e) Under (c), we can write \(\hat{\boldsymbol{\beta}}^\intercal\hat{\boldsymbol{\beta}}= \frac{\delta^2}{N} + \mathcal{\chi}^2_{P} / N\), so that \[ \phi \approx \frac{\delta^2}{P} + \frac{1}{P}\mathcal{\chi}^2_{P} \approx \frac{\delta^2}{P} + \mathcal{N}\left(1, \frac{2}{P}\right). \] Since the rejection region will be centered at \(1\) and of width \(q / \sqrt{P}\) for some \(q\), we will fail to reject when \[ \frac{\delta^2}{P} < \frac{q}{\sqrt{P}} \quad\Leftrightarrow\quad \delta^2 < q \sqrt{P}. \] We need to choose \(N\) large enough that \(P\) can be made large enough, but for any \(\delta\) we can eventually find \(N\) and \(P\) large enough that the F test fails to reject with nonzero probability.

(f) The F-test and t-test have power against different kinds of alternatives. Specifically, the t-test has the advantage of looking only at exactly the right coefficient, whereas the F-test is performing a simultaneous test of all the coefficients at once, most of which are zero.