$$ \newcommand{\mybold}[1]{\boldsymbol{#1}} \newcommand{\trans}{\intercal} \newcommand{\norm}[1]{\left\Vert#1\right\Vert} \newcommand{\abs}[1]{\left|#1\right|} \newcommand{\bbr}{\mathbb{R}} \newcommand{\bbz}{\mathbb{Z}} \newcommand{\bbc}{\mathbb{C}} \newcommand{\gauss}[1]{\mathcal{N}\left(#1\right)} \newcommand{\chisq}[1]{\mathcal{\chi}^2_{#1}} \newcommand{\studentt}[1]{\mathrm{StudentT}_{#1}} \newcommand{\fdist}[2]{\mathrm{FDist}_{#1,#2}} \newcommand{\iid}{\overset{\mathrm{IID}}{\sim}} \newcommand{\argmin}[1]{\underset{#1}{\mathrm{argmin}}\,} \newcommand{\projop}[1]{\underset{#1}{\mathrm{Proj}}\,} \newcommand{\proj}[1]{\underset{#1}{\mybold{P}}} \newcommand{\expect}[1]{\mathbb{E}\left[#1\right]} \newcommand{\prob}[1]{\mathbb{P}\left(#1\right)} \newcommand{\dens}[1]{\mathit{p}\left(#1\right)} \newcommand{\var}[1]{\mathrm{Var}\left(#1\right)} \newcommand{\cov}[1]{\mathrm{Cov}\left(#1\right)} \newcommand{\sumn}{\sum_{n=1}^N} \newcommand{\meann}{\frac{1}{N} \sumn} \newcommand{\cltn}{\frac{1}{\sqrt{N}} \sumn} \newcommand{\trace}[1]{\mathrm{trace}\left(#1\right)} \newcommand{\diag}[1]{\mathrm{Diag}\left(#1\right)} \newcommand{\grad}[2]{\nabla_{#1} \left. #2 \right.} \newcommand{\gradat}[3]{\nabla_{#1} \left. #2 \right|_{#3}} \newcommand{\fracat}[3]{\left. \frac{#1}{#2} \right|_{#3}} \newcommand{\W}{\mybold{W}} \newcommand{\w}{w} \newcommand{\wbar}{\bar{w}} \newcommand{\wv}{\mybold{w}} \newcommand{\X}{\mybold{X}} \newcommand{\x}{x} \newcommand{\xbar}{\bar{x}} \newcommand{\xv}{\mybold{x}} \newcommand{\Xcov}{\mybold{M}_{\X}} \newcommand{\Xcovhat}{\hat{\mybold{M}}_{\X}} \newcommand{\Covsand}{\Sigmam_{\mathrm{sand}}} \newcommand{\Covsandhat}{\hat{\Sigmam}_{\mathrm{sand}}} \newcommand{\Z}{\mybold{Z}} \newcommand{\z}{z} \newcommand{\zv}{\mybold{z}} \newcommand{\zbar}{\bar{z}} \newcommand{\Y}{\mybold{Y}} \newcommand{\Yhat}{\hat{\Y}} \newcommand{\y}{y} \newcommand{\yv}{\mybold{y}} \newcommand{\yhat}{\hat{\y}} \newcommand{\ybar}{\bar{y}} \newcommand{\res}{\varepsilon} \newcommand{\resv}{\mybold{\res}} \newcommand{\resvhat}{\hat{\mybold{\res}}} \newcommand{\reshat}{\hat{\res}} \newcommand{\betav}{\mybold{\beta}} \newcommand{\betavhat}{\hat{\betav}} \newcommand{\betahat}{\hat{\beta}} \newcommand{\betastar}{{\beta^{*}}} \newcommand{\betavstar}{{\betav^{*}}} \newcommand{\loss}{\mathscr{L}} \newcommand{\losshat}{\hat{\loss}} \newcommand{\f}{f} \newcommand{\fhat}{\hat{f}} \newcommand{\bv}{\mybold{\b}} \newcommand{\bvhat}{\hat{\bv}} \newcommand{\alphav}{\mybold{\alpha}} \newcommand{\alphavhat}{\hat{\av}} \newcommand{\alphahat}{\hat{\alpha}} \newcommand{\omegav}{\mybold{\omega}} \newcommand{\gv}{\mybold{\gamma}} \newcommand{\gvhat}{\hat{\gv}} \newcommand{\ghat}{\hat{\gamma}} \newcommand{\hv}{\mybold{\h}} \newcommand{\hvhat}{\hat{\hv}} \newcommand{\hhat}{\hat{\h}} \newcommand{\gammav}{\mybold{\gamma}} \newcommand{\gammavhat}{\hat{\gammav}} \newcommand{\gammahat}{\hat{\gamma}} \newcommand{\new}{\mathrm{new}} \newcommand{\zerov}{\mybold{0}} \newcommand{\onev}{\mybold{1}} \newcommand{\id}{\mybold{I}} \newcommand{\sigmahat}{\hat{\sigma}} \newcommand{\etav}{\mybold{\eta}} \newcommand{\muv}{\mybold{\mu}} \newcommand{\Sigmam}{\mybold{\Sigma}} \newcommand{\rdom}[1]{\mathbb{R}^{#1}} \newcommand{\RV}[1]{{#1}} \def\A{\mybold{A}} \def\A{\mybold{A}} \def\av{\mybold{a}} \def\a{a} \def\B{\mybold{B}} \def\b{b} \def\S{\mybold{S}} \def\sv{\mybold{s}} \def\s{s} \def\R{\mybold{R}} \def\rv{\mybold{r}} \def\r{r} \def\V{\mybold{V}} \def\vv{\mybold{v}} \def\v{v} \def\vhat{\hat{v}} \def\U{\mybold{U}} \def\uv{\mybold{u}} \def\u{u} \def\W{\mybold{W}} \def\wv{\mybold{w}} \def\w{w} \def\tv{\mybold{t}} \def\t{t} \def\Sc{\mathcal{S}} \def\ev{\mybold{e}} \def\Lammat{\mybold{\Lambda}} \def\Q{\mybold{Q}} \def\eps{\varepsilon} $$

Confidence Intervals and Hypothesis Testing

\(\,\)

Goals

  • Understand the relationship between confidence intervals and hypothesis tests for the sample mean
    • Deriving hypothesis tests using a pivot
    • Deriving confidence intervals from a family of hypothesis tests
    • Be able to think critically about the meaning and limitations of strict hypothesis tests

Confidence intervals and hypothesis tests for sample means

Let us return again to our sample means problem from the very beginning of class and derive hypothesis tests and confidence intervals. Since sample means are a special case of linear regression, the intuition and techniques will extend readily to the more general case.

Here is a histogram of the final exam scores from last year’s 151A class. There were 50 students, and the maximum score was 40.

ggplot() + geom_histogram(aes(x=scores), bins=40)

Recall that we discussed separating the “innate difficulty of the class” from the “particularities of that particular class” by modeling \(\y_n\) as IID from some population with \(\expect{\y_n} =: \mu\), which we approximate with \(\ybar = \meann \y_n = 29.36\).

We will ask questions like:

  • Is it plausible that \(\mu\) was as high as \(35\)?
  • What range of values might \(\mu\) plausibly take?

The first is like a “hypothesis test” and the second is a “confidence interval.” We will see that a confidence interval is precisely the set of values that a hypothesis test does not reject, and that a confidence interval leads precisely to a set of hypothesis tests that check whether the hypothesis is in the interval.

Deriving a sampling distribution.

By the CLT, we know that, whatever \(\mu\) is, for large \(N\), we have approximately \[ \sqrt{N} (\ybar - \mu) = \frac{1}{\sqrt{N}} \sumn (\y_n - \mu) \sim \gauss{0, \sigma^2}, \] where \(\sigma^2 = \var{\y_n}\). In our example, we measured \(\sigma \approx 9.87\) using the sample standard deviation. Today, for simplicity, we will simply assume that this is the true value of \(\sigma\) — i.e., that we know \(\sigma\) and can just use it in our calculations. Later we will deal with the more complicated question of accounting for the uncertainty in \(\sigma\). Similarly, for simplicity, we will uncritically assume that the CLT holds exactly.

Using \(\sigma\) and properties of the normal distribution, we can define the random variable \[ \z := \frac{\ybar - \mu}{\sigma / \sqrt{N}} = \frac{29.36 - \mu}{9.87 / \sqrt{50} } = \frac{29.36 - \mu}{1.4} \sim \gauss{0 ,1}. \] This is called a “z-statistic,” which is a general and not very meaningful name for statistics computed by subtracting a true mean and dividing by a true standard deviation from a random observation.

Since \(\z\) is a standard normal distribution, it seems to characterize the variability of \(\mu\). However, there’s something funny about \(\z\) — as written, you cannot compute it, because you don’t know \(\mu\). In fact, finding what values \(\mu\) might plausibly take is the whole point of statistical inference.

So what good is a z–statistic? Informally, one way to reason about it is as follows. Let’s take some concrete values for an example. Suppose guess that \(\mu^0 = 35\) is the value, and compute

\[ \z = \frac{\ybar - \mu^0}{\sigma / \sqrt{N} } = \frac{29.36 - 35}{1.4} = -4.03 \]

We use the superscript \(0\) to indicate that \(\mu^0\) is our guess, not necessarily the true value.

A value as large as -4.03 is quite unusual under the standard normal distribution. Therefore,

  • We got a very unusual draw of our standard normal or
  • We guessed wrong, i.e. \(\mu \ne \mu^0 = 35\).

In this way, we might consider it plausible to “reject” the hypothesis that \(\mu = 35\), since otherwise we must accept that we got a very unusual standard normal draw.

There’s a subtle problem with the preceding reasoning, however. Suppose we do the same calculation with \(\mu^0 = 30\). Then \[ \z = \frac{29.36 - 30}{1.4} = -0.46. \] This is a much more typical value for a standard normal distribution. However, the probability of getting exactly -0.46 — or, indeed, any particular value — is zero, since the normal distribution is continuous valued. (This problem is easiest to see with continuous random variables, but the same basic problem will occur when the distribution is discrete but spread over a large number of possible values.)

Rejection regions

To resolve this problem, we can specify regions that we consider implausible. That is, suppose we take a region \(R\) such that, if \(\z\) is standard normal, then

\[ \prob{\z \in R} \le \alpha \quad\textrm{form some small }\alpha. \]

For example, we might take \(\Phi^{-1}(\cdot)\) to be the inverse CDF of the standard normal. Then we can take

\[ R_{ts} = \{\z: \abs{z} \ge q \} \quad\textrm{where } q = \Phi^{-1}(1 - \alpha / 2). \] Here, the “ts” stands for “two–sided.” If \(\z \sim \gauss{0,1}\), then

\[ \begin{aligned} \prob{R_{ts}} ={}& \prob{\abs{z} \ge q} \\={}& \prob{\z \ge q \textrm{ or } \z \le -q} \\={}& \prob{\z \ge q} + \prob{\z \le -q} & \textrm{(the regions are disjoint)} \\={}& 2 \prob{\z \ge q} & \textrm{(the standard normal is symmetric)} \\={}& 2 ( 1- \prob{\z < q}) \\={}& 2 ( 1- \Phi(q)) & \textrm{(definition of $\Phi$)} \\={}& 2 ( 1- \Phi(\Phi^{-1}(1 - \alpha / 2))) & \textrm{(definition of $q$)} \\={}& 2 ( 1- (1 - \alpha / 2)) \\={}& \alpha. \end{aligned} \]

Putting our reasoning together, we migth argue that

  • If \(\mu^0 = \mu\) (that is, if our hypothesis is correct), then \(\z \sim \gauss{0,1}\).
  • If \(\z \sim \gauss{0,1}\), then \(\prob{\z \in R_{ts}} \le \alpha\).
  • Therefore, fi \(\z \in R_{ts}\), we either got an unusual draw of a standard normal or \(\mu^0 \ne \mu\).

Although superficially compelling, this reasoning is not watertight, since — as we will now see – there are many regions \(R\) such that \(\prob{\z \in R} = \alpha\) when \(\mu^0 = \mu\), some of which obviously tell us nothing about the true value of \(\mu\).

Other rejection regions, and type I and type II error

We can choose other rejection regions. You might be familiar with \[ \begin{aligned} R_{u} ={}& \{\t: \t \ge q \} \quad\textrm{where } q = \Phi^{-1}(1 - \alpha) \\ R_{l} ={}& \{\t: \t \le q \} \quad\textrm{where } q = \Phi^{-1}(\alpha). \end{aligned} \]

The “u” stands for “upper” and the “l” for lower.

Furthermore, there are silly options, such as

\[ \begin{aligned} R_{m} ={}& \{\t: \abs{\t} \le q \} \quad\textrm{where } q = \Phi^{-1}(0.5 + \alpha / 2) \quad\textrm{(!!!)}\\ R_{\infty} ={}& \begin{cases} \emptyset & \textrm{ with independent probability } \alpha \\ (-\infty,\infty) & \textrm{ with independent probability } 1 - \alpha \\ \end{cases} \quad\textrm{(!!!)} \end{aligned} \]

The last two may seem absurd, but they are still rejection regions into which \(\z\) is unlikely to fall if it has a standard normal distribution.

Given this, how can we think about \(\alpha\), and about the choice of the region? Recall that

  • If \(\z \in R\), we “reject” the proposed value of \(\mu^0\)
  • If \(\z \notin R\), we “fail to reject” the given value of \(\mu^0\).

Of course, we don’t “accept” the value of \(\mu^0\) in the sense of believing that \(\mu^0 = \mu\) — if nothing else, there will always be multiple values of \(\mu^0\) that we do not reject, and \(\mu\) cannot be equal to all of them.

So there are two ways to make an error:

  • Type I error: We are correct and \(\mu = \mu^0\), but \(\t \in R\) and we reject
  • Type II error: We are incorrect and \(\mu \ne \mu^0\), but \(\z \notin R\) and we fail to reject

By definition of the region \(R\), we have that

\[ \prob{\textrm{Type I error}} \le \alpha. \]

This is true for all the regions above, including the silly ones!

What about the Type II error? It must depend on the “true” value of \(\mu\), and on the shape of the rejection region we choose. Note that

\[ \z = \frac{\ybar - \mu^0}{\sigma / \sqrt{N}} = \frac{\ybar - \mu}{\sigma / \sqrt{N}} + \frac{\mu - \mu^0}{\sigma / \sqrt{N}} \sim \gauss{0, 1} + \frac{\mu - \mu^0}{\sigma / \sqrt{N}}. \]

So if the true value \(\mu \gg \mu^0\), then our \(\z\) statistic is too large, and so on.

For example:

  • Suppose \(\mu \gg \mu^0\).
    • Then \(\z\) is too large and positive.
    • \(R_u\) and \(R_{ts}\) will reject, but \(R_l\) will not.
    • The Type II error of \(R_u\) will be lowest, then \(R_{ts}\), then \(R_l\).
    • \(R_l\) actually has greater Type II error than the silly regions, \(R_\infty\) and \(R_m\).
  • Suppose \(\mu \ll \mu^0\).
    • Then \(\z\) is too large and negative.
    • \(R_l\) and \(R_{ts}\) will reject, but \(R_u\) will not.
    • The Type II error of \(R_l\) will be lowest, then \(R_{ts}\), then \(R_u\).
    • \(R_u\) actually has greater Type II error than the silly regions, \(R_\infty\) and \(R_m\).
  • Suppose \(\mu = \mu^0 + \delta\) for some very small \(\delta\).
    • Then \(\z\) has about the same distribution as when \(\mu^0 = \mu\).
    • All the regions reject just about as often as we commit a Type I error, that is, a proportion \(\alpha\) of the time.

Thus the shape of the region determines which alternatives you are able to reject. The probability of “rejecting” under a particular alternative is called the “power” of a test; the power is one minus the Type II error rate.

The null and alternative

Statistics has some formal language to distinguish between the “guess” \(\mu^0\) and other values.

  • The guess \(\mu^0\) is called the “null hypothesis”
    • Falsely rejecting the null hypothesis is called a Type I error
    • By construction, Type I errors occurs with probability at most \(\alpha\)
  • The class of potential other values of \(\mu\) is called the “alternative hypothesis.”
    • Falsely failling to reject the null hypothesis is called a Type II error
    • Type II errors’ probability depends on the alternative(s) and the rejection region shape.

The choice of a test statistic (here, \(\z\)), together with a rejection region (here, \(R\)) constitute a “test” of the null hypothesis. In general, one can imagine constructing many different tests, with different theoretical guarantees and power.

Confidence intervals

Often in applied statistics, a big deal is made about a single hypothesis test, particularly the null that \(\mu^0 = 0\). Often this is not a good idea. For example, in our testing case, we can very easily reject \(\mu^0 = 0\), since

\[ \z = \frac{29.36 - 0}{1.4} = 20.97 \]

is very very very very unlikely under a standard normal. However, this only means that the tests were not mean zero. We in fact know this with certainty — the test scores must be non–negative, so the only way they could be mean zero if they were all identically zero, which we know to be false, since we observed at least one non–zero test score. So we can reject \(\mu^0 = 0\) with absolute certainty. (Note that the CLT assumption breaks down if \(\mu\) is actually close to zero.)

In fact, in many cases, we do not care whether \(\mu\) is precisely zero; rather, we care about the set of plausible values \(\mu\) might take. The distinction can be expressed as the difference between statistical and practical significance:

  • Statistical significance is the size of an effect relative to sampling variability and some reference value of interest (or some mindless default, like \(0\))
  • Practical significance is the size of the effect in terms of its effect on reality.

For example, suppose that \(\mu\) is nonzero but very small, but \(\sqrt{\hat\v / N}\) is very small, too. We might reject the null hypothesis \(\mu^0 = 0\) with a high degree of certainty, and call our result statistically significant. However, a small value of \(\mu\) may still not be a meaningful effect size for the problem at hand, i.e., it may not be practically significant.

A remendy is confidence intervals, which are actually closely related to our hypothesis tests. Let \(R^c\) denote the complement of \(R\), that is, all values not in \(R\).

Assuming that \(\z \sim \gauss{0,1}\), recall that we have been constructing regions \(R\) of the form

\[ \begin{aligned} 1 - \alpha \le{}& 1 - \prob{\z \in R} \\ ={}& \prob{\z \in R^c} \\ ={}& \prob{\frac{\ybar - \mu}{\sigma / \sqrt{N}} \in R^c}. \end{aligned} \]

We can solve this expression to get a region that \(\mu\) lies in with high probability. For example, with \(\R_{ts}\),

\[ \begin{aligned} 1 - \alpha \le{}& \prob{-q \le \frac{\ybar - \mu}{\sigma / \sqrt{N}} \le q} \\={}& \prob{\frac{- q \sigma}{\sqrt{N}} \le \ybar - \mu \le \frac{q \sigma}{\sqrt{N}}} \\={}& \prob{\ybar - \frac{q \sigma}{\sqrt{N}} \le \mu \le \ybar + \frac{q \sigma}{\sqrt{N}}}. \end{aligned} \]

Taking the region \(I = \ybar \pm \frac{q \sigma}{\sqrt{N}}\), it follows that \(\prob{\mu \in I} \ge 1 - \alpha\). Here, \(I\) is precisely the set of values that we would not reject with region \(R_{ts}\). Note that the interval \(I\) is random (it depends on \(\ybar\)), so from one realization to the next, we expect \(I\) to change. However, the true \(\mu\) will lie in it at least \(1- \alpha\) of the time.

This duality is entirely general:

  • The set of values that a valid test does not reject is a valid confidence interval
  • Checking whether a value falls in a valid confidence interval is a valid test

Stepping back

What have we done?

  1. Starting from our measurement \(\ybar\), we constructed a statistic, \(\z\), which required \(\mu\) to compute, but whose distribution didn’t depend on \(\mu\) (and which also happened to be easy to compute). Such a statistic is called a “pivot.”
  2. We then took a rejection region of low probability under the pivot’s distribution. If the statistic falls into this unlikely region for some candidate value of \(\mu^0\), we “reject” the hypothesis that \(\mu = \mu^0\).
  3. The set of all values that we wouldn’t reject is a “confidence interval,” a random inverval that contains \(\mu\) with high probability.

Over and over for the next few weeks, we’re going to follow these steps, but with linear regressions rather than simple sample means. We will get more complicated statistics, and distributions other than the normal. But all of the reasoning — and in particular the concern about the power of the test — will apply equally.