ggplot() + geom_histogram(aes(x=scores), bins=40)
This lecture supplements the reading
Here is a histogram of the final coures grades from the Fall 2024 151A class. There were 70 students, and the maximum score was 1. An A was in \([0.9, 1]\), and so on.
Suppose I collected all the grades and labeled them \(y_1\), \(y_2\), up to \(y_N\), where there are \(N\) students in the class. I then compute
\[ \bar{y}= \frac{1}{N} \sum_{n=1}^Ny_n. \]
In this case, \(\bar{y}= 0.8942795\). Today we will talk about:
Though this may seem belaboured, it will set us up well for the corresponding interpretations in the more complicated setting of linear regression.
One intepretation of this is simply a measure of the “center” of the distribution.
Of course, other measures of center exist, like the median, which in this case was 0.9303375. In general, the mean and median are of course different:
\[ \textrm{Median} = 0.9303375 \ne 0.8942795 = \textrm{Mean}. \]
This is useful as a summary statistic, answering something roughly like “what was a roughly typical score,” maybe for the purpose of assessing whether the course was too hard or too easy. For example, we might compare the two semesters I taught 151A:
These histograms contain more information than the simple difference in means — in Fall 2024 the mean was \(0.8942795\), and in Spring 2024 the mean was \(0.8313854\). This difference might be part of an argument that the course was more difficult in Spring 2024 than it was in Fall 2024.
In this use, there is no statistical uncertainty. We haven’t really introduced randomness. We ask what the sample mean of scores for this particular class is, and we computed it.
Suppose we don’t start by trying to measure the property of some unknown distribution, but rather are simply trying to make a prediction about next year, but again under the assumption that students are randomly sampled from the same distribution and the class does not change.
Suppose we guess \(\beta\) (for “guess”). We want our guess to be close to the actual exam scores on average across students; suppose we’re willing to measure “close” by the squared error we make, so that if a student’s real exam score is \(y\), and we guess \(\beta\), we pay a “cost” of \((y- \beta)^2\). We’d like to choose \(\hat{\beta}\) to solve
\[ \beta^* := \underset{\beta}{\mathrm{argmin}}\, \mathbb{E}\left[(y- \beta)^2\right], \]
where the expectation is taken over the distribution of test scores. The problem is that we don’t know the distribution of \(y\), and so cannot take the above expectation. So we approximate it with the sample average,
\[ \hat{\beta}:= \underset{\beta}{\mathrm{argmin}}\, \frac{1}{N} \sum_{n=1}^N(y_n - \beta)^2. \]
It turns out that \(\beta^* = \mathbb{E}\left[y\right]\), and \(\hat{\beta}= \bar{y}\). Here, we’re only trying to make a good prediction rather than estimate a distribution parameter, but the sample mean still turns out to be the quantity that we want.
Even though this may seem to be a more innocuous problem than estimating an unknown population mean, we still have to make the same assumptions about what causes variation from one year to the next. When you ask whether this is a likely to be a good predictor for this year’s test scores, the same set of concerns arise as when trying to estimate a population mean.
We might doubt that the number 0.8942795 is unreasonably precise. After all, if we had somehow given the “exactly same class and exam” to a “different set of students,” then the final exam scores would have been different, even though the thing we’re ostensibly trying to measure, the “hardness” of the exam, would be the same. In this model, we implicitly imagine that the test scores we saw were \(N\) draws from some hypothetical infinte distribution of a random variable \(y\). We would like to know \(\mathbb{E}\left[y\right]\), but have only \(\bar{y}\). It will be useful to give \(\mathbb{E}\left[y\right]\) a name; let’s call it \(\mu := \mathbb{E}\left[y\right]\).
It’s worth reflecting for how hard it is to make the preceding conceit precise. It’s impossible to teach the “same class” to a different set of students. Further, even if you could, students choose this class, they are not randomly assigned. And there is no infinite pool of students. Still, this conceit seems to capture something intuitively sensible — that the precise score we saw may depends in some way on the ideosyncracies of these particular students, and would plausibly be different had those ideosyncracies been different but still “typical.”
Let’s spend some time talking about in what sense \(\bar{y}\) is an “estimate” of \(\mu\). This is the kind of thing classical statistics thinks about very carefully.
Taking the conceit for granted, we want to know \(\mu\), which is just a number, but we observe \(\bar{y}\), which we are imagining is a random variable (since it is a function of \(y_1, \ldots, y_N\), which are random). In what sense does \(\bar{y}\) tell us anything about \(\mu\)? As long as \(\mathrm{Var}\left(y\right) < \infty\), the most important property \(\bar{y}\) has is “consistency”:
\[ \bar{y}\rightarrow \mu \quad\textrm{as }N \rightarrow \infty. \]
This follows by the “law of large numbers,” of LLN, which will be an important tool in discussing the properties of linear regression.
Note that the left hand side is a random variable, but the right hand side is a (non–random) constant. We won’t deal with this carefully in this class, but formally we mean something like
\[ \mathbb{P}\left(\left|\bar{y}- \mu\right| > \varepsilon\right) \rightarrow 0 \quad\textrm{as }N \rightarrow \infty, \textrm{ for any }\varepsilon > 0. \]
This is a one of a (class of) deterministic limits that can apply to random variables. Specifically, this is called “convergence in probability”. There are in fact many different modes of probabilistic convergence, and their study is a very interesting (but more advanced) topic. (In this case, it happens that the LLN also applies with “almost sure” convergence.)
Of course, there are many estimators besides \(\bar{y}\) which are also consistent. Here are a few:
\[ \begin{aligned} \frac{1}{N - 1} \sum_{n=1}^Ny_n \quad\quad\quad \bar{y}+ \frac{1}{N} \quad\quad\quad \exp(1/N) \bar{y} \quad\quad\quad \frac{1}{\lfloor N/2 \rfloor} \sum_{n=1}^{\lfloor N / 2 \rfloor} y_n, \end{aligned} \]
and so on. Why you would choose one over another is a major topic in statistics which we will touch on only lightly in this course.
How close is \(\bar{y}\) to \(\mu\) for any particular \(N\)? It’s impossible to know precisely, since we don’t actually know the distribution of \(y\) — we don’t even know its mean. But for large \(N\), we can take advantage of another asymptotic result, the central limit theorem, or CLT. Suppose that we know \(\sigma := \sqrt{\mathrm{Var}\left(y\right)}\). Then
\[ \frac{1}{\sqrt{N}} \sum_{n=1}^N\frac{y_n - \mu}{\sigma} \rightarrow \mathcal{N}\left(0, 1\right) \quad\textrm{as }N \rightarrow \infty. \]
The CLT will also be a key tool in studying the properties of linear regression.
Note that the left hand side is a random variable and the right hand side is also a random variable. Here, we mean \[ \left| \mathbb{P}\left(\frac{1}{\sqrt{N}} \sum_{n=1}^N\frac{y_n - \mu}{\sigma} \le z\right) - \mathbb{P}\left(\mathcal{N}\left(0, 1\right) \le z\right) \right| \rightarrow 0 \quad\textrm{as }N \rightarrow \infty, \textrm{ for any }z. \]
This is called “convergence in distribution.” Note that it’s the same as saying the distribution function of the left hand side converges pointwise to the distribution function of the right hand side. Again, we won’t be too concerned with modes of probabilistic convergence in this class.
Using the CLT, we can say things like the following. Suppose that we choose \(z_\alpha\) so that \(\mathbb{P}\left(-z_\alpha \le \mathcal{N}\left(0, 1\right) \le z_\alpha\right) = 0.95\). Then by the CLT,
\[ \begin{aligned} 0.95 ={}& \mathbb{P}\left(-z_\alpha \le \mathcal{N}\left(0, 1\right) \le z_\alpha\right)\\ \approx{}& \mathbb{P}\left(-z_\alpha \le \frac{1}{\sqrt{N}} \sum_{n=1}^N\frac{y_n - \mu}{\sigma} \le z_\alpha\right) & \textrm{(by the CLT applied twice)}\\ =& \mathbb{P}\left(-\sigma z_\alpha \le \frac{1}{\sqrt{N}} \sum_{n=1}^N(y_n - \mu) \le \sigma z_\alpha\right) & \textrm{(algebra)}\\ =& \mathbb{P}\left( - \frac{\sigma}{\sqrt{N}} z_\alpha \le \frac{1}{N} \sum_{n=1}^N(y_n - \mu) \le \frac{\sigma}{\sqrt{N}} z_\alpha \right) & \textrm{(algebra)}\\ =& \mathbb{P}\left( - \frac{\sigma}{\sqrt{N}} z_\alpha \le \bar{y}- \mu \le \frac{\sigma}{\sqrt{N}} z_\alpha\right) & \textrm{(algebra)}\\ =& \mathbb{P}\left( \bar{y}- \frac{\sigma}{\sqrt{N}} z_\alpha \le \mu \le \bar{y}+ \frac{\sigma}{\sqrt{N}} z_\alpha\right). & \textrm{(algebra)}\\ \end{aligned} \]
This means that, whatever \(\mu\) is, then with 95% probability (under IID random sampling), it lies in the interval \(\bar{y}\pm \frac{\sigma}{\sqrt{N}} z_\alpha\). That is, we have constructed a 95% two–sided confidence interval for the unknown \(\mu\), which in accounts for the random variability in \(\bar{y}\) as a measure of \(\mu\).
In this case, estimating \(\sigma\) using the sample variance estimator \(\frac{1}{N} \sum_{n=1}^N(y_n - \bar{y})^2\), the confidence interval is \([0.87, 0.92]\).
It’s worth reflecting on what this does and does not mean. For example, does this mean that, if I don’t change the exam or syllabus this year, that we are extremely unlikely to see an average grade above \(0.92\)? Certainly not; this is an underestimate of any reasonable notion of subjective uncertainty in this year’s grades. But one might think of it as a rough lower bound on the subjective uncertainty, since our model holds many things constant that cannot plausibly be held constant in reality. For instance, in the Spring 2024 semester I also taught 151A. In that semester, the exam grade was \(0.8313854\), which is well outside our confidence interval \([0.87, 0.92]\).
Here are some questions we will answer in this class:
What is the connection with linear regression? Recall that linear regression is the problem of finding \(\hat{\beta}\) that solves
\[ \hat{\beta}:= \underset{\beta}{\mathrm{argmin}}\, \sum_{n=1}^N(y_n - \beta x_n)^2, \]
for some “regressors” \(x_n\). It is interpreted as finding the “best” fit (in a squared error sense) through a cloud of points. If we take \(x_n = 1\), we can see that the sample mean is in fact a linear regression problem, with \(x_n\) taken to be identially one!
As a consequence, much of the intution and techniques we’ve developed here will apply to the much more general linear regression setting, with some additional formal complexity, but essentially the same set of core ideas.
---
title: "Sample means"
format:
html:
code-fold: false
code-tools: true
execute:
enabled: true
theme: [sandstone]
sidebar: true
---
::: {.content-visible when-format="html"}
{{< include /macros.tex >}}
:::
# Goals
- Understand the sample mean from a few different perspectives
- As a generic measure of centrality
- As a minimizer of squared loss
- A quantity that converges to the mean
- As a very special linear regression
# Reading
This lecture supplements the reading
- @freedman:2009:statistical Chapter 1
- @hastie:2009:isl Chapter 2.1
# The sample mean of test scores
```{r}
#| echo: false
#| output: false
library(tidyverse)
library(quarto)
root_dir <- quarto::find_project_root()
grades_df <- read.csv(file.path(root_dir, "datasets/grades/grades.csv")) %>%
filter(class == "151A")
scores <- grades_df %>% filter(semester == "Fall 2024") %>% pull(grade)
n_obs <- length(scores)
ybar <- mean(scores)
sigmahat <- sqrt(mean((scores - ybar)^2))
zalpha <- qnorm(1 - 0.05 / 2)
```
```{r}
#| echo: false
#| output: false
scores_old <- grades_df %>% filter(semester == "Spring 2024") %>% pull(grade)
ybar_old <- mean(scores_old)
```
Here is a histogram of the final coures grades from the Fall 2024 151A class. There
were `r n_obs` students, and the maximum score was 1. An A was in $[0.9, 1]$,
and so on.
```{r}
ggplot() + geom_histogram(aes(x=scores), bins=40)
```
Suppose I collected all the grades
and labeled them $y_1$, $y_2$, up to $y_N$, where there are $N$
students in the class. I then compute
$$
\ybar = \meann \y_n.
$$
In this case, $\ybar = `r ybar`$. Today we will talk about:
- What does this tell me?
- How can I understand what I've done mathematically?
- Conceptually?
- In what sense is there any "uncertainty" associated with this computation?
Though this may seem belaboured, it will set us up well for the corresponding
interpretations in the more complicated setting of linear regression.
# A numerical summary
One intepretation of this is simply a measure of the "center" of the distribution.
```{r}
ggplot() +
geom_histogram(aes(x=scores), bins=40) +
geom_vline(aes(xintercept=ybar), color="red")
```
Of course, other measures of center exist, like the median, which in this
case was `r median(scores)`. In general, the mean and median are of course different:
$$
\textrm{Median} = `r median(scores)` \ne `r mean(scores)` = \textrm{Mean}.
$$
This is useful as a summary statistic, answering something roughly
like "what was a roughly typical score," maybe for the purpose of
assessing whether the course was too hard or too easy. For example,
we might compare the two semesters I taught 151A:
```{r}
ggplot(grades_df) +
geom_histogram(aes(x=grade), bins=40) +
facet_grid(semester ~ .)
```
These histograms contain more information than the simple difference
in means --- in Fall 2024 the mean was $`r ybar`$, and in Spring
2024 the mean was $`r ybar_old`$. This difference might be
part of an argument that the course was more difficult in Spring 2024
than it was in Fall 2024.
In this use, there is no *statistical* uncertainty. We haven't
really introduced randomness. We ask what
the sample mean of scores for *this particular class* is, and we computed
it.
- If you wanted to make the argument "Spring 2024 was harder than Fall 2024",
is this all the information you need?
- Why do you care about the question of which semester was harder? How
does that affect the kinds of evidence you find relevant?
- What kinds of uncertainty are part of this argument?
# The number that minimizes the squared "prediction error"
Suppose we don't start by trying to measure the property of some unknown distribution,
but rather are simply trying to make a prediction about next year, but again under
the assumption that students are randomly sampled from the same distribution
and the class does not change.
Suppose we guess $\beta$ (for "guess"). We want our guess to be close
to the actual exam scores on average across students; suppose we're willing
to measure "close" by the squared error we make, so that if a student's
real exam score is $y$, and we guess $\beta$, we pay a "cost" of $(\y - \beta)^2$.
We'd like to choose $\betahat$ to solve
$$
\beta^* := \argmin{\beta} \expect{(\y - \beta)^2},
$$
where the expectation is taken over the distribution of test scores. The problem
is that we don't know the distribution of $\y$, and so cannot take the above
expectation. So we approximate it with the sample average,
$$
\betahat := \argmin{\beta} \meann (\y_n - \beta)^2.
$$
It turns out that $\beta^* = \expect{y}$, and $\betahat = \ybar$. Here,
we're only trying to make a good prediction rather than estimate a distribution
parameter, but the sample mean still turns out to be the quantity that we want.
Even though this may seem to be a more innocuous problem than estimating
an unknown population mean, we still have to make the same assumptions about
what causes variation from one year to the next. When you ask whether
this is a likely to be a good predictor for this year's test scores, the same set
of concerns arise as when trying to estimate a population mean.
- Do you think that $`r ybar`$ is a good guess at what grade you'll get? Why
or why not?
- How does the fact that the two semesters' means were so different affect
your opinion of the quality of the guess $`r ybar`$?
- Do you think your grade this semester will be a random draw from this histogram?
In what ways yes, and in what ways no?
# An estimator of an unknown mean
We might doubt that the number `r ybar` is unreasonably precise. After
all, if we had somehow given the "exactly same class and exam" to a "different
set of students," then the final exam scores would have been different,
even though the thing we're ostensibly trying to measure, the "hardness"
of the exam, would be the same. In this model, we implicitly imagine that the test
scores we saw were $N$ draws from some hypothetical infinte distribution
of a random variable $y$. We would like to know $\expect{y}$, but have
only $\ybar$. It will be useful to give $\expect{y}$ a name; let's call it $\mu := \expect{y}$.
It's worth reflecting for how hard it is to make the preceding conceit precise. It's
impossible to teach the "same class" to a different set of students. Further,
even if you could, students choose
this class, they are not randomly assigned. And there is no infinite pool of
students. Still, this conceit seems to capture something intuitively
sensible --- that the precise score we saw may depends in some way on
the ideosyncracies of these particular students, and would plausibly be
different had those ideosyncracies been different but still "typical."
Let's spend some time talking about in what sense $\ybar$ is an "estimate"
of $\mu$. This is the kind of thing classical statistics thinks about very
carefully.
## The law of large numbers
Taking the conceit for granted, we want to know $\mu$, which is just a
number, but we observe $\ybar$, which we are imagining is a random variable (since it is
a function of $\y_1, \ldots, \y_N$, which are random). In what
sense does $\ybar$ tell us anything about $\mu$? As long
as $\var{y} < \infty$, the most
important property $\ybar$ has is "consistency":
$$
\ybar \rightarrow \mu \quad\textrm{as }N \rightarrow \infty.
$$
This follows by the **"law of large numbers," of LLN**, which will be
an important tool in discussing the properties of linear regression.
:::{.callout-note}
Note that the left hand side is a random variable, but the right hand
side is a (non--random) constant. We won't deal with this carefully
in this class, but formally we mean something like
$$
\prob{\abs{\ybar - \mu} > \varepsilon} \rightarrow 0
\quad\textrm{as }N \rightarrow \infty, \textrm{ for any }\varepsilon > 0.
$$
This is a one of a (class of) deterministic limits that can apply to random
variables. Specifically, this is called "convergence in probability". There are in fact many different
modes of probabilistic convergence, and their study is a very interesting (but more advanced)
topic. (In this case, it happens that the LLN also applies with "almost sure" convergence.)
:::
:::{.callout-note}
Of course, there are many estimators besides $\ybar$ which are also consistent. Here
are a few:
$$
\begin{aligned}
\frac{1}{N - 1} \sumn \y_n
\quad\quad\quad
\ybar + \frac{1}{N}
\quad\quad\quad
\exp(1/N) \ybar
\quad\quad\quad
\frac{1}{\lfloor N/2 \rfloor} \sum_{n=1}^{\lfloor N / 2 \rfloor} \y_n,
\end{aligned}
$$
and so on. Why you would choose one over another is a major topic in
statistics which we will touch on only lightly in this course.
:::
## The central limit theorem
How close is $\ybar$ to $\mu$ for any particular $N$? It's impossible
to know precisely, since we don't actually know the distribution
of $\y$ --- we don't even know its mean. But for large $N$, we can
take advantage of another asymptotic result, the **central limit
theorem, or CLT**. Suppose that we know $\sigma := \sqrt{\var{\y}}$. Then
$$
\frac{1}{\sqrt{N}} \sumn \frac{\y_n - \mu}{\sigma} \rightarrow \gauss{0, 1}
\quad\textrm{as }N \rightarrow \infty.
$$
The CLT will also be a key tool in studying the properties of linear regression.
:::{.callout-note}
Note that the left hand side is a random variable and the right hand
side is also a random variable. Here, we mean
$$
\abs{
\prob{\frac{1}{\sqrt{N}} \sumn \frac{\y_n - \mu}{\sigma} \le z} -
\prob{\gauss{0, 1} \le z}
} \rightarrow 0
\quad\textrm{as }N \rightarrow \infty, \textrm{ for any }z.
$$
This is called "convergence in distribution." Note that it's the
same as saying the distribution function of the left hand side
converges pointwise to the distribution function of the right hand side. Again, we won't be too
concerned with modes of probabilistic convergence in this class.
:::
## Confidence intervals
Using the CLT, we can say things like the following. Suppose that we choose
$\z_\alpha$ so that
$\prob{-\z_\alpha \le \gauss{0, 1} \le \z_\alpha} = 0.95$. Then by the CLT,
$$
\begin{aligned}
0.95 ={}& \prob{-\z_\alpha \le \gauss{0, 1} \le \z_\alpha}\\
\approx{}& \prob{-\z_\alpha \le \frac{1}{\sqrt{N}} \sumn \frac{\y_n - \mu}{\sigma} \le \z_\alpha} & \textrm{(by the CLT applied twice)}\\
=& \prob{-\sigma \z_\alpha \le \frac{1}{\sqrt{N}} \sumn (\y_n - \mu) \le \sigma \z_\alpha} & \textrm{(algebra)}\\
=& \prob{
- \frac{\sigma}{\sqrt{N}} \z_\alpha \le
\frac{1}{N} \sumn (\y_n - \mu) \le
\frac{\sigma}{\sqrt{N}} \z_\alpha
} & \textrm{(algebra)}\\
=& \prob{ - \frac{\sigma}{\sqrt{N}} \z_\alpha \le \ybar - \mu \le \frac{\sigma}{\sqrt{N}} \z_\alpha} & \textrm{(algebra)}\\
=& \prob{ \ybar - \frac{\sigma}{\sqrt{N}} \z_\alpha \le \mu \le \ybar + \frac{\sigma}{\sqrt{N}} \z_\alpha}. & \textrm{(algebra)}\\
\end{aligned}
$$
This means that, *whatever $\mu$ is*, then with 95% probability (under IID random sampling),
it lies in the interval $\ybar \pm \frac{\sigma}{\sqrt{N}} \z_\alpha$. That is, we
have constructed a **95% two--sided confidence interval** for the unknown $\mu$,
which in accounts for the random variability in $\ybar$ as a measure of $\mu$.
```{r}
#| echo: false
#| output: false
mu_lower <- round(ybar - zalpha * sigmahat / sqrt(length(scores)), 2)
mu_upper <- round(ybar + zalpha * sigmahat / sqrt(length(scores)), 2)
```
In this case, estimating $\sigma$ using the sample variance estimator $\meann (\y_n - \ybar)^2$,
the confidence interval is $[`r mu_lower`, `r mu_upper`]$.
It's worth reflecting on what this does and does not mean. For example, does this mean that, if I don't change
the exam or syllabus this year, that we are extremely unlikely to see an average grade above $`r mu_upper`$? Certainly
not; this is an underestimate of any reasonable notion of subjective uncertainty in this year's grades. But one
might think of it as a rough lower bound on the subjective uncertainty, since our model holds many things
constant that cannot plausibly be held constant in reality. For instance, in the Spring 2024 semester I also taught 151A. In that semester, the
exam grade was $`r ybar_old`$, which is well outside our confidence
interval $[`r mu_lower`, `r mu_upper`]$.
Here are some questions we will answer in this class:
- Does the fact that our estimate of $\sigma$ is uncertain affect the accuracy of our interval? How can we correct it?
- Would the bootstrap always give the same answer as this kind of computation?
- Is the interval $[`r mu_lower`, `r mu_upper`]$ a good guess for your particular grade?
- The
# A very special linear regression
What is the connection with linear regression? Recall that linear regression is the
problem of finding $\betahat$ that solves
$$
\betahat := \argmin{\beta} \sumn (\y_n - \beta \x_n)^2,
$$
for some "regressors" $\x_n$. It is interpreted as finding the
"best" fit (in a squared error sense) through a cloud of points. If we take
$\x_n = 1$, we can see that the sample mean is in fact a linear regression
problem, with $\x_n$ taken to be identially one!
As a consequence, much of the intution and techniques we've developed
here will apply to the much more general linear regression setting,
with some additional formal complexity, but essentially the same set of core ideas.
# Bibliography