$$ \newcommand{\mybold}[1]{\boldsymbol{#1}} \newcommand{\trans}{\intercal} \newcommand{\norm}[1]{\left\Vert#1\right\Vert} \newcommand{\abs}[1]{\left|#1\right|} \newcommand{\bbr}{\mathbb{R}} \newcommand{\bbz}{\mathbb{Z}} \newcommand{\bbc}{\mathbb{C}} \newcommand{\gauss}[1]{\mathcal{N}\left(#1\right)} \newcommand{\chisq}[1]{\mathcal{\chi}^2_{#1}} \newcommand{\studentt}[1]{\mathrm{StudentT}_{#1}} \newcommand{\fdist}[2]{\mathrm{FDist}_{#1,#2}} \newcommand{\argmin}[1]{\underset{#1}{\mathrm{argmin}}\,} \newcommand{\projop}[1]{\underset{#1}{\mathrm{Proj}}\,} \newcommand{\proj}[1]{\underset{#1}{\mybold{P}}} \newcommand{\expect}[1]{\mathbb{E}\left[#1\right]} \newcommand{\prob}[1]{\mathbb{P}\left(#1\right)} \newcommand{\dens}[1]{\mathit{p}\left(#1\right)} \newcommand{\var}[1]{\mathrm{Var}\left(#1\right)} \newcommand{\cov}[1]{\mathrm{Cov}\left(#1\right)} \newcommand{\sumn}{\sum_{n=1}^N} \newcommand{\meann}{\frac{1}{N} \sumn} \newcommand{\cltn}{\frac{1}{\sqrt{N}} \sumn} \newcommand{\trace}[1]{\mathrm{trace}\left(#1\right)} \newcommand{\diag}[1]{\mathrm{Diag}\left(#1\right)} \newcommand{\grad}[2]{\nabla_{#1} \left. #2 \right.} \newcommand{\gradat}[3]{\nabla_{#1} \left. #2 \right|_{#3}} \newcommand{\fracat}[3]{\left. \frac{#1}{#2} \right|_{#3}} \newcommand{\W}{\mybold{W}} \newcommand{\w}{w} \newcommand{\wbar}{\bar{w}} \newcommand{\wv}{\mybold{w}} \newcommand{\X}{\mybold{X}} \newcommand{\x}{x} \newcommand{\xbar}{\bar{x}} \newcommand{\xv}{\mybold{x}} \newcommand{\Xcov}{\Sigmam_{\X}} \newcommand{\Xcovhat}{\hat{\Sigmam}_{\X}} \newcommand{\Covsand}{\Sigmam_{\mathrm{sand}}} \newcommand{\Covsandhat}{\hat{\Sigmam}_{\mathrm{sand}}} \newcommand{\Z}{\mybold{Z}} \newcommand{\z}{z} \newcommand{\zv}{\mybold{z}} \newcommand{\zbar}{\bar{z}} \newcommand{\Y}{\mybold{Y}} \newcommand{\Yhat}{\hat{\Y}} \newcommand{\y}{y} \newcommand{\yv}{\mybold{y}} \newcommand{\yhat}{\hat{\y}} \newcommand{\ybar}{\bar{y}} \newcommand{\res}{\varepsilon} \newcommand{\resv}{\mybold{\res}} \newcommand{\resvhat}{\hat{\mybold{\res}}} \newcommand{\reshat}{\hat{\res}} \newcommand{\betav}{\mybold{\beta}} \newcommand{\betavhat}{\hat{\betav}} \newcommand{\betahat}{\hat{\beta}} \newcommand{\betastar}{{\beta^{*}}} \newcommand{\bv}{\mybold{\b}} \newcommand{\bvhat}{\hat{\bv}} \newcommand{\alphav}{\mybold{\alpha}} \newcommand{\alphavhat}{\hat{\av}} \newcommand{\alphahat}{\hat{\alpha}} \newcommand{\omegav}{\mybold{\omega}} \newcommand{\gv}{\mybold{\gamma}} \newcommand{\gvhat}{\hat{\gv}} \newcommand{\ghat}{\hat{\gamma}} \newcommand{\hv}{\mybold{\h}} \newcommand{\hvhat}{\hat{\hv}} \newcommand{\hhat}{\hat{\h}} \newcommand{\gammav}{\mybold{\gamma}} \newcommand{\gammavhat}{\hat{\gammav}} \newcommand{\gammahat}{\hat{\gamma}} \newcommand{\new}{\mathrm{new}} \newcommand{\zerov}{\mybold{0}} \newcommand{\onev}{\mybold{1}} \newcommand{\id}{\mybold{I}} \newcommand{\sigmahat}{\hat{\sigma}} \newcommand{\etav}{\mybold{\eta}} \newcommand{\muv}{\mybold{\mu}} \newcommand{\Sigmam}{\mybold{\Sigma}} \newcommand{\rdom}[1]{\mathbb{R}^{#1}} \newcommand{\RV}[1]{\tilde{#1}} \def\A{\mybold{A}} \def\A{\mybold{A}} \def\av{\mybold{a}} \def\a{a} \def\B{\mybold{B}} \def\S{\mybold{S}} \def\sv{\mybold{s}} \def\s{s} \def\R{\mybold{R}} \def\rv{\mybold{r}} \def\r{r} \def\V{\mybold{V}} \def\vv{\mybold{v}} \def\v{v} \def\U{\mybold{U}} \def\uv{\mybold{u}} \def\u{u} \def\W{\mybold{W}} \def\wv{\mybold{w}} \def\w{w} \def\tv{\mybold{t}} \def\t{t} \def\Sc{\mathcal{S}} \def\ev{\mybold{e}} \def\Lammat{\mybold{\Lambda}} $$

Confidence intervals and hypothesis testing

\(\,\)

Goals

  • Understand the relationship between confidence intervals and hypothesis tests in the context of linear regression
    • Understand the t value and Pr(>|t|) fields in the output of lm
    • Be able to think critically about the meaning and limitations of strict hypothesis tests

Confidence intervals and hypothesis tests

T-statistics

Suppose we’re interested in the value \(\beta_k\), the \(k\)–th entry of \(\betav\) in for some regression \(\y_n \sim \betav^\trans \xv_n\). Recall that we have been finding \(\v\) such that

\[ \sqrt{N} (\beta_k - \beta) \rightarrow \gauss{0, \v}. \]

For example, under homoskedastic assumptions with \(\y_n = \xv_n^\trans \beta + \res_n\), we have

\[ \begin{aligned} \v =& \sigma^2 (\Xcov^{-1})_{kk} \textrm{ where } \\ \Xcov =& \lim_{N \rightarrow \infty} \frac{1}{N} \X^\trans \X \textrm{ and } \\ \sigma^2 =& \var{\res_n}. \end{aligned} \]

Typically we don’t know \(\v\), but have \(\hat\v\) such that \(\hat\v \rightarrow \v\) as \(N \rightarrow \infty\). Again, under homoeskedastic assumptions,

\[ \begin{aligned} \hat\v =& \hat\sigma^2 \left(\frac{1}{N} \X^\trans \X \right)_{kk} \textrm{ where } \\ \hat\sigma^2 =& \frac{1}{N-P} \sumn \reshat_n^2. \end{aligned} \]

Putting all this together, the quantity

\[ \t = \frac{\sqrt{N} (\betahat_k - \beta_k)}{\sqrt{\hat\v}} = \frac{\betahat_k - \beta_k}{\sqrt{\hat\v / N}} \]

has an approximately standard normal distribution for large \(N\).

Quantities of this form are called “T–statistics,” since, under our normal assumptions, we have shown that

\[ \t \sim \studentt{N-P}, \]

exactly for all \(N\). Despite it’s name, it’s worth remembering that a T–statistic is actually not Student T distributed in general; it is asymptotically normal. Recall that for large \(N\), the Student T and standard normal distributions coincide.

Plugging in values for \(\beta_k\)

However, there’s something funny about a “T-statistic” — as written, you cannot compute it, because you don’t know \(\beta_k\). In fact, finding what values \(\beta_k\) might plausibly take is the whole point of statistical inference.

So what good is a T–statistic? Informally, one way to reason about it is as follows. Let’s take some concrete values for an example. Suppose guess that \(\beta_k^0\) is the value, and compute

\[ \betahat_k = 2 \quad\textrm{and}\quad \sqrt{\hat\v / N} = 3 \quad\textrm{so}\quad \t = \frac{2 - \beta_k^0}{3}. \]

We use the superscript \(0\) to indicate that \(\beta_k^0\) is our guess, not necessarily the true value.

Suppose we plug in some particular value, such as \(\beta_k^0 = 32\). Using this value, we compute our T–statistic, and find that it’s very large — in our example, we would have \(\t = (2 - 32) / 3 = -30\). It’s very unlikely to get a standard normal (or Student T) draw this large. Therefore, either:

  • We got a very (very very very very) unusual draw of our standard normal or
  • We guessed wrong, i.e. \(\beta_k \ne \beta_k^0 = 32\).

In this way, we might consider it plausible to “reject” the hypothesis that \(\beta_k = 32\).

There’s a subtle problem with the preceding reasoning, however. Suppose we do the same calculation with \(\beta_k^0 = 1\). Then \(\t = (2 - 1) / 3 = 1/3\). This is a much more typical value for a standard normal distribution. However, the probability of getting exactly \(1/3\) — or, indeed, any particular value — is zero, since the normal distribution is continuous valued. (This problem is easiest to see with continuous random variables, but the same basic problem will occur when the distribution is discrete but spread over a large number of possible values.)

Rejection regions

To resolve this problem, we can specify regions that we consider implausible. That is, suppose we take a region \(R\) such that, if \(\t\) is standard normal (or Student-T), then

\[ \prob{\t \in R} \le \alpha \quad\textrm{form some small }\alpha. \]

For example, we might take \(\Phi^{-1}(\cdot)\) to be the inverse CDF of \(\t\) if \(\beta_k = \beta_k^0\). Then we can take

\[ R_{ts} = \{\t: \abs{t} \ge q \} \quad\textrm{where } q = \Phi^{-1}(\alpha / 2)\\ \]

where \(q\) is an \(\alpha / 2\) quantile of the distribution of \(\t\). But there are other choices, such as

\[ \begin{aligned} R_{u} ={}& \{\t: \t \ge q \} \quad\textrm{where } q = \Phi^{-1}(1 - \alpha) \\ R_{l} ={}& \{\t: \t \le q \} \quad\textrm{where } q = \Phi^{-1}(\alpha) \\ R_{m} ={}& \{\t: \abs{\t} \le q \} \quad\textrm{where } q = \Phi^{-1}(0.5 + \alpha / 2) \quad\textrm{(!!!)}\\ R_{\infty} ={}& \begin{cases} \emptyset & \textrm{ with independent probability } \alpha \\ (-\infty,\infty) & \textrm{ with independent probability } 1 - \alpha \\ \end{cases} \quad\textrm{(!!!)} \end{aligned} \]

The last two may seem silly, but they are still rejection regions into which \(\t\) is unlikely to fall if it has a standard normal distribution.

How can we think about \(\alpha\), and about the choice of the region? Recall that

  • If \(\t \in R\), we “reject” the proposed value of \(\beta_k^0\)
  • If \(\t \notin R\), we “fail to reject” the given value of \(\beta_k^0\).

Of course, we don’t “accept” the value of \(\beta_k^0\) in the sense of believing that \(\beta_k^0 = \beta_k\) — if nothing else, there will always be multiple values of \(\beta_k^0\) that we do not reject, and \(\beta_k\) cannot be equal to all of them.

So there are two ways to make an error:

  • Type I error: We are correct and \(\beta_k = \beta_k^0\), but \(\t \in R\) and we reject
  • Type II error: We are incorrect and \(\beta_k \ne \beta_k^0\), but \(\t \notin R\) and we fail to reject

By definition of the region \(R\), we have that

\[ \prob{\textrm{Type I error}} \le \alpha. \]

This is true for all the regions above, including the silly ones!

What about the Type II error? It must depend on the “true” value of \(\beta_k\), and on the shape of the rejection region we choose. Note that

\[ \t = \frac{\betahat_k - \beta_k^0}{\sqrt{\hat\v / N}} = \frac{\betahat_k - \beta_k}{\sqrt{\hat\v / N}} + \frac{\beta_k - \beta_k^0}{\sqrt{\hat\v / N}} \]

So if the true value \(\beta_k \gg \beta_k^0\), then our \(\t\) statistic is too large, and so on.

For example:

  • Suppose \(\beta_k \gg \beta_k^0\).
    • Then \(\t\) is too large and positive.
    • \(R_u\) and \(R_{ts}\) will reject, but \(R_l\) will not.
    • The Type II error of \(R_u\) will be lowest, then \(R_{ts}\), then \(R_l\).
    • \(R_l\) actually has greater Type II error than the silly regions, \(R_\infty\) and \(R_m\).
  • Suppose \(\beta_k \ll \beta_k^0\).
    • Then \(\t\) is too large and negative.
    • \(R_l\) and \(R_{ts}\) will reject, but \(R_u\) will not.
    • The Type II error of \(R_l\) will be lowest, then \(R_{ts}\), then \(R_u\).
    • \(R_u\) actually has greater Type II error than the silly regions, \(R_\infty\) and \(R_m\).
  • Suppose \(\beta_k = \beta_k^0 + \delta\) for some very small \(\delta\).
    • Then \(\t\) has about the same distribution as when \(\beta_k^0 = \beta_k\).
    • All the regions reject just about as often as we commit a Type I error, that is, a proportion \(\alpha\) of the time.

Thus the shape of the region determines which alternatives you are able to reject. The probability of “rejecting” under a particular alternative is called the “power” of a test; the power is one minus the Type II error rate.

The null and alternative

Statistics has some formal language to distinguish between the “guess” \(\beta_k^0\) and other values.

  • The guess \(\beta_k^0\) is called the “null hypothesis”
    • Falsely rejecting the null hypothesis is called a Type I error
    • By construction, Type I errors occurs with probability at most \(\alpha\)
  • The class of potential other values of \(\beta_k\) is called the “alternative hypothesis.”
    • Falsely failling to reject the null hypothesis is called a Type II error
    • Type II errors’ probability depends on the alternative(s) and the rejection region shape.

The choice of a test statistic (here, \(\t\)), together with a rejection region (here, \(R\)) constitute a “test” of the null hypothesis. In general, one can imagine constructing many different tests, with different theoretical guarantees and power.

Confidence intervals

Often in applied statistics, a big deal is made about a single hypothesis test, particularly the null that \(\beta_k^0 = 0\). Often this is not a good idea. Typically, we do not care whether \(\beta_k\) is precisely zero; rather, we care about the set of plausible values \(\beta_k\) might take. The distinction can be expressed as the difference between statistical and practical significance:

  • Statistical significance is the size of an effect relative to sampling variability
  • Practical significance is the size of the effect in terms of its effect on reality.

For example, suppose that \(\beta_k\) is nonzero but very small, but \(\sqrt{\hat\v / N}\) is very small, too. We might reject the null hypothesis \(\beta_k^0 = 0\) with a high degree of certainty, and call our result statistically significant. However, a small value of \(\beta_k\) may still not be a meaningful effect size for the problem at hand, i.e., it may not be practically significant.

A remendy is confidence intervals, which are actually closely related to our hypothesis tests. Recall that we have been constructing intervals of the form

\[ \prob{\beta_k \in I} \ge 1-\alpha \]

where

\[ I = \left(\betahat_k \pm q \hat\v / \sqrt{N}\right), \]

where \(q = \Phi^{-1}(\alpha / 2)\), and \(\Phi\) is the CDF of either the standard normal or Student T distribution. It turns out that \(I\) is precisely the set of values that we would not reject with region \(R_{ts}\). And, indeed, given a confidence interval, a valid test of the hypothesis \(\beta_k^0\) is given by rejecting if an only if \(\beta_k^0 \in I\).

This duality is entirely general:

  • The set of values that a valid test does not reject is a valid confidence interval
  • Checking whether a value falls in a valid confidence interval is a valid test