Sample means

Goals

  • Understand the sample mean from a few different perspectives
    • As a generic measure of centrality
    • As a minimizer of squared loss
    • A quantity that converges to the mean
    • As a very special linear regression

Reading

This lecture supplements the reading

  • Freedman (2009) Chapter 1
  • Hastie, Tibshirani, and Friedman (2009) Chapter 2.1

The sample mean of test scores

Here is a histogram of the final coures grades from the Fall 2024 151A class. There were 70 students, and the maximum score was 1. An A was in \([0.9, 1]\), and so on.

ggplot() + geom_histogram(aes(x=scores), bins=40)

Suppose I collected all the grades and labeled them \(y_1\), \(y_2\), up to \(y_N\), where there are \(N\) students in the class. I then compute

\[ \bar{y}= \frac{1}{N} \sum_{n=1}^Ny_n. \]

In this case, \(\bar{y}= 0.8942795\). Today we will talk about:

  • What does this tell me?
  • How can I understand what I’ve done mathematically?
  • Conceptually?
  • In what sense is there any “uncertainty” associated with this computation?

Though this may seem belaboured, it will set us up well for the corresponding interpretations in the more complicated setting of linear regression.

A numerical summary

One intepretation of this is simply a measure of the “center” of the distribution.

ggplot() +
  geom_histogram(aes(x=scores), bins=40) +
  geom_vline(aes(xintercept=ybar), color="red")

Of course, other measures of center exist, like the median, which in this case was 0.9303375. In general, the mean and median are of course different:

\[ \textrm{Median} = 0.9303375 \ne 0.8942795 = \textrm{Mean}. \]

This is useful as a summary statistic, answering something roughly like “what was a roughly typical score,” maybe for the purpose of assessing whether the course was too hard or too easy. For example, we might compare the two semesters I taught 151A:

ggplot(grades_df) +
  geom_histogram(aes(x=grade), bins=40) +
  facet_grid(semester ~ .)

These histograms contain more information than the simple difference in means — in Fall 2024 the mean was \(0.8942795\), and in Spring 2024 the mean was \(0.8313854\). This difference might be part of an argument that the course was more difficult in Spring 2024 than it was in Fall 2024.

In this use, there is no statistical uncertainty. We haven’t really introduced randomness. We ask what the sample mean of scores for this particular class is, and we computed it.

  • If you wanted to make the argument “Spring 2024 was harder than Fall 2024”, is this all the information you need?
  • Why do you care about the question of which semester was harder? How does that affect the kinds of evidence you find relevant?
  • What kinds of uncertainty are part of this argument?

The number that minimizes the squared “prediction error”

Suppose we don’t start by trying to measure the property of some unknown distribution, but rather are simply trying to make a prediction about next year, but again under the assumption that students are randomly sampled from the same distribution and the class does not change.

Suppose we guess \(\beta\) (for “guess”). We want our guess to be close to the actual exam scores on average across students; suppose we’re willing to measure “close” by the squared error we make, so that if a student’s real exam score is \(y\), and we guess \(\beta\), we pay a “cost” of \((y- \beta)^2\). We’d like to choose \(\hat{\beta}\) to solve

\[ \beta^* := \underset{\beta}{\mathrm{argmin}}\, \mathbb{E}\left[(y- \beta)^2\right], \]

where the expectation is taken over the distribution of test scores. The problem is that we don’t know the distribution of \(y\), and so cannot take the above expectation. So we approximate it with the sample average,

\[ \hat{\beta}:= \underset{\beta}{\mathrm{argmin}}\, \frac{1}{N} \sum_{n=1}^N(y_n - \beta)^2. \]

It turns out that \(\beta^* = \mathbb{E}\left[y\right]\), and \(\hat{\beta}= \bar{y}\). Here, we’re only trying to make a good prediction rather than estimate a distribution parameter, but the sample mean still turns out to be the quantity that we want.

Even though this may seem to be a more innocuous problem than estimating an unknown population mean, we still have to make the same assumptions about what causes variation from one year to the next. When you ask whether this is a likely to be a good predictor for this year’s test scores, the same set of concerns arise as when trying to estimate a population mean.

  • Do you think that \(0.8942795\) is a good guess at what grade you’ll get? Why or why not?
  • How does the fact that the two semesters’ means were so different affect your opinion of the quality of the guess \(0.8942795\)?
  • Do you think your grade this semester will be a random draw from this histogram? In what ways yes, and in what ways no?

An estimator of an unknown mean

We might doubt that the number 0.8942795 is unreasonably precise. After all, if we had somehow given the “exactly same class and exam” to a “different set of students,” then the final exam scores would have been different, even though the thing we’re ostensibly trying to measure, the “hardness” of the exam, would be the same. In this model, we implicitly imagine that the test scores we saw were \(N\) draws from some hypothetical infinte distribution of a random variable \(y\). We would like to know \(\mathbb{E}\left[y\right]\), but have only \(\bar{y}\). It will be useful to give \(\mathbb{E}\left[y\right]\) a name; let’s call it \(\mu := \mathbb{E}\left[y\right]\).

It’s worth reflecting for how hard it is to make the preceding conceit precise. It’s impossible to teach the “same class” to a different set of students. Further, even if you could, students choose this class, they are not randomly assigned. And there is no infinite pool of students. Still, this conceit seems to capture something intuitively sensible — that the precise score we saw may depends in some way on the ideosyncracies of these particular students, and would plausibly be different had those ideosyncracies been different but still “typical.”

Let’s spend some time talking about in what sense \(\bar{y}\) is an “estimate” of \(\mu\). This is the kind of thing classical statistics thinks about very carefully.

The law of large numbers

Taking the conceit for granted, we want to know \(\mu\), which is just a number, but we observe \(\bar{y}\), which we are imagining is a random variable (since it is a function of \(y_1, \ldots, y_N\), which are random). In what sense does \(\bar{y}\) tell us anything about \(\mu\)? As long as \(\mathrm{Var}\left(y\right) < \infty\), the most important property \(\bar{y}\) has is “consistency”:

\[ \bar{y}\rightarrow \mu \quad\textrm{as }N \rightarrow \infty. \]

This follows by the “law of large numbers,” of LLN, which will be an important tool in discussing the properties of linear regression.

Note

Note that the left hand side is a random variable, but the right hand side is a (non–random) constant. We won’t deal with this carefully in this class, but formally we mean something like

\[ \mathbb{P}\left(\left|\bar{y}- \mu\right| > \varepsilon\right) \rightarrow 0 \quad\textrm{as }N \rightarrow \infty, \textrm{ for any }\varepsilon > 0. \]

This is a one of a (class of) deterministic limits that can apply to random variables. Specifically, this is called “convergence in probability”. There are in fact many different modes of probabilistic convergence, and their study is a very interesting (but more advanced) topic. (In this case, it happens that the LLN also applies with “almost sure” convergence.)

Note

Of course, there are many estimators besides \(\bar{y}\) which are also consistent. Here are a few:

\[ \begin{aligned} \frac{1}{N - 1} \sum_{n=1}^Ny_n \quad\quad\quad \bar{y}+ \frac{1}{N} \quad\quad\quad \exp(1/N) \bar{y} \quad\quad\quad \frac{1}{\lfloor N/2 \rfloor} \sum_{n=1}^{\lfloor N / 2 \rfloor} y_n, \end{aligned} \]

and so on. Why you would choose one over another is a major topic in statistics which we will touch on only lightly in this course.

The central limit theorem

How close is \(\bar{y}\) to \(\mu\) for any particular \(N\)? It’s impossible to know precisely, since we don’t actually know the distribution of \(y\) — we don’t even know its mean. But for large \(N\), we can take advantage of another asymptotic result, the central limit theorem, or CLT. Suppose that we know \(\sigma := \sqrt{\mathrm{Var}\left(y\right)}\). Then

\[ \frac{1}{\sqrt{N}} \sum_{n=1}^N\frac{y_n - \mu}{\sigma} \rightarrow \mathcal{N}\left(0, 1\right) \quad\textrm{as }N \rightarrow \infty. \]

The CLT will also be a key tool in studying the properties of linear regression.

Note

Note that the left hand side is a random variable and the right hand side is also a random variable. Here, we mean \[ \left| \mathbb{P}\left(\frac{1}{\sqrt{N}} \sum_{n=1}^N\frac{y_n - \mu}{\sigma} \le z\right) - \mathbb{P}\left(\mathcal{N}\left(0, 1\right) \le z\right) \right| \rightarrow 0 \quad\textrm{as }N \rightarrow \infty, \textrm{ for any }z. \]

This is called “convergence in distribution.” Note that it’s the same as saying the distribution function of the left hand side converges pointwise to the distribution function of the right hand side. Again, we won’t be too concerned with modes of probabilistic convergence in this class.

Confidence intervals

Using the CLT, we can say things like the following. Suppose that we choose \(z_\alpha\) so that \(\mathbb{P}\left(-z_\alpha \le \mathcal{N}\left(0, 1\right) \le z_\alpha\right) = 0.95\). Then by the CLT,

\[ \begin{aligned} 0.95 ={}& \mathbb{P}\left(-z_\alpha \le \mathcal{N}\left(0, 1\right) \le z_\alpha\right)\\ \approx{}& \mathbb{P}\left(-z_\alpha \le \frac{1}{\sqrt{N}} \sum_{n=1}^N\frac{y_n - \mu}{\sigma} \le z_\alpha\right) & \textrm{(by the CLT applied twice)}\\ =& \mathbb{P}\left(-\sigma z_\alpha \le \frac{1}{\sqrt{N}} \sum_{n=1}^N(y_n - \mu) \le \sigma z_\alpha\right) & \textrm{(algebra)}\\ =& \mathbb{P}\left( - \frac{\sigma}{\sqrt{N}} z_\alpha \le \frac{1}{N} \sum_{n=1}^N(y_n - \mu) \le \frac{\sigma}{\sqrt{N}} z_\alpha \right) & \textrm{(algebra)}\\ =& \mathbb{P}\left( - \frac{\sigma}{\sqrt{N}} z_\alpha \le \bar{y}- \mu \le \frac{\sigma}{\sqrt{N}} z_\alpha\right) & \textrm{(algebra)}\\ =& \mathbb{P}\left( \bar{y}- \frac{\sigma}{\sqrt{N}} z_\alpha \le \mu \le \bar{y}+ \frac{\sigma}{\sqrt{N}} z_\alpha\right). & \textrm{(algebra)}\\ \end{aligned} \]

This means that, whatever \(\mu\) is, then with 95% probability (under IID random sampling), it lies in the interval \(\bar{y}\pm \frac{\sigma}{\sqrt{N}} z_\alpha\). That is, we have constructed a 95% two–sided confidence interval for the unknown \(\mu\), which in accounts for the random variability in \(\bar{y}\) as a measure of \(\mu\).

In this case, estimating \(\sigma\) using the sample variance estimator \(\frac{1}{N} \sum_{n=1}^N(y_n - \bar{y})^2\), the confidence interval is \([0.87, 0.92]\).

It’s worth reflecting on what this does and does not mean. For example, does this mean that, if I don’t change the exam or syllabus this year, that we are extremely unlikely to see an average grade above \(0.92\)? Certainly not; this is an underestimate of any reasonable notion of subjective uncertainty in this year’s grades. But one might think of it as a rough lower bound on the subjective uncertainty, since our model holds many things constant that cannot plausibly be held constant in reality. For instance, in the Spring 2024 semester I also taught 151A. In that semester, the exam grade was \(0.8313854\), which is well outside our confidence interval \([0.87, 0.92]\).

Here are some questions we will answer in this class:

  • Does the fact that our estimate of \(\sigma\) is uncertain affect the accuracy of our interval? How can we correct it?
  • Would the bootstrap always give the same answer as this kind of computation?
  • Is the interval \([0.87, 0.92]\) a good guess for your particular grade?
  • The

A very special linear regression

What is the connection with linear regression? Recall that linear regression is the problem of finding \(\hat{\beta}\) that solves

\[ \hat{\beta}:= \underset{\beta}{\mathrm{argmin}}\, \sum_{n=1}^N(y_n - \beta x_n)^2, \]

for some “regressors” \(x_n\). It is interpreted as finding the “best” fit (in a squared error sense) through a cloud of points. If we take \(x_n = 1\), we can see that the sample mean is in fact a linear regression problem, with \(x_n\) taken to be identially one!

As a consequence, much of the intution and techniques we’ve developed here will apply to the much more general linear regression setting, with some additional formal complexity, but essentially the same set of core ideas.

Bibliography

Freedman, David. 2009. Statistical Models: Theory and Practice. cambridge university press.
Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. 2009. “An Introduction to Statistical Learning.”