final_v_quizzes <- lm(final ~ quizzes, all_grades_df)
all_grades_df %>%
mutate(final_pred=predict(final_v_quizzes, all_grades_df)) %>%
ggplot() +
geom_point(aes(x=quizzes, y=final)) +
geom_line(aes(x=quizzes, y=final_pred), color="red")
This lecture supplements the reading
Let’s look again at the grades dataset, and consider the question: “will doing well on the quizzes help you do well on the final?” Note that this is a causal question, which can’t be answered directly with observational data. But if there is a strong causal relationship, we might expect to see a strong association. Is there one?
final_v_quizzes <- lm(final ~ quizzes, all_grades_df)
all_grades_df %>%
mutate(final_pred=predict(final_v_quizzes, all_grades_df)) %>%
ggplot() +
geom_point(aes(x=quizzes, y=final)) +
geom_line(aes(x=quizzes, y=final_pred), color="red")
Note that this includes all classes, not just 151A. There’s a relationship, but not quite as strong as you might expect. One way to summarize this scatterplot is to find the straight line that passes through the points. We’ll talk a lot during this lecture about how to fit it.
The upwards slope strongly suggests that, on average, students who do better on quizzes also do better on the final exam.
Recall the simple least squares model:
\[ \begin{align*} y_n :={}& \textrm{Response (e.g. course grade)} \\ x_n :={}& \textrm{Regressor (e.g. final exam grade)}\\ y_n ={}& \beta_2 x_n + \beta_1 + \varepsilon_n \textrm{ Model (straight line through data)}. \end{align*} \tag{1}\]
Here are some key quantities and their names:
For a linear model, we also have:
We might also have estimates of these quantities:
When we form the estimator by minimizing the estimated residuals, we might call the estimate
An estimate will implicitly be least-squares estimates, but precisely what we mean by an estimate may have to come from context.
Note that for any value of \(\beta\), we get a value of the “error” or “residual” \(\varepsilon_n\):
\[ \varepsilon_n = y_n - (\beta_2 x_n + \beta_1). \]
This notation may be a little strange. Note that we get a different value of \(\varepsilon_n\) for any particular values of \(\beta_1\) and \(\beta_2\). So it doesn’t make any sense to say something like “\(\varepsilon_n\) is normally distributed.” In fact, for a fixed \(y_n\) and \(x_n\), it’s best to think of the residual as a function of \(\beta_1\) and \(\beta_2\).
The “least squares fit” is called this because we choose \(\beta_1\) and \(\beta_2\) to make \(\sum_{n=1}^N\varepsilon_n^2\) as small as possible:
\[ \begin{align*} \textrm{Choose }\beta_1,\beta_2\textrm{ so that } \sum_{n=1}^N\varepsilon_n^2 = \sum_{n=1}^N\left( y_n - (\beta_2 x_n + \beta_1) \right)^2 \textrm{ is as small as possible. } \end{align*} \]
We call the values that achieves this minimum \(\hat{\beta}_1\) and \(\hat{\beta}_2\). Similarly, we set \(\hat{y}_n = \hat{\beta}_2 x_n + \hat{\beta}_1\) and define \(\hat{\varepsilon}_n = y_n - \hat{y}_n\). Note that \(\hat{\varepsilon}_n\) is just \(\varepsilon_n\) computed at \(\hat{\beta}_1\), \(\hat{\beta}_2\).
How can we interpret this formula as a summary statistic? Let’s introduce the notation
\[ \begin{align*} \bar{y}={}& \frac{1}{N} \sum_{n=1}^Ny_n \\ \bar{x}={}& \frac{1}{N} \sum_{n=1}^Nx_n \\ \overline{xy}={}& \frac{1}{N} \sum_{n=1}^Nx_n y_n \\ \overline{xx}={}& \frac{1}{N} \sum_{n=1}^Nx_n ^2. \end{align*} \]
You might recall that the solution to the least squares problem is given by
\[ \begin{align*} \hat{\beta}_1 ={}& \bar{y}- \hat{\beta}_2 \bar{x}\\ \hat{\beta}_2 ={}& \frac{\overline{xy}- \bar{x}\, \bar{y}} {\overline{xx}- \bar{x}^2}. \end{align*} \]
These have succinct expressions in terms of familiar summary statistics. We will denote quantities like \(\overline{x}\) as
\[ \widehat{\mathbb{E}}\left[x\right] = \frac{1}{N} \sum_{n=1}^Nx_n = \overline{x}. \]
This is also called the “sample average” or “sample expectation.” It depends on the dataset, and if we’re imagining that dataset to be random, then \(\widehat{\mathbb{E}}\left[x\right]\) is random as well. This is different from the expectation, \(\mathbb{E}\left[x\right]\), which does not depend on any dataset. Recall also the sample covariance is given by
\[ \widehat{\mathrm{Cov}}\left(x, y\right) = \frac{1}{N} \sum_{n=1}^N(x_n - \bar{x}) (y_n - \bar{y}), \]
with the sample variance \(\widehat{\mathrm{Var}}\left(x\right) = \widehat{\mathrm{Cov}}\left(x, x\right)\) as a special case. In these terms, we can see that
\[ \hat{\beta}_2 = \frac{\widehat{\mathrm{Cov}}\left(x, y\right)}{\widehat{\mathrm{Var}}\left(x\right)}. \]
Note that the covariance is scale-dependent, and will change if you rescale \(x\). A scale-invariant measure of association is the sample correlation, which is defined as
\[ \widehat{\mathrm{Corr}}\left(x, y\right) = \frac{\widehat{\mathrm{Cov}}\left(x, y\right)}{\sqrt{\mathrm{Var}\left(x\right) \mathrm{Var}\left(y\right)}}. \]
The sample correlation varies between \(-1\) and \(1\). We can see that we can also write
\[ \hat{\beta}_2 = \sqrt{\frac{\widehat{\mathrm{Var}}\left(y\right)}{\widehat{\mathrm{Var}}\left(x\right)}} \widehat{\mathrm{Corr}}\left(x, y\right). \]
In this sense, the quantity \(\hat{\beta}_2\) is measuring the association between \(x\) and \(y\). Note that \(\widehat{\mathrm{Corr}}\left(x, y\right) = \widehat{\mathrm{Corr}}\left(y, x\right)\), but that \(1 / \widehat{\mathrm{Corr}}\left(y, x\right) \ne \widehat{\mathrm{Corr}}\left(y, x\right)\). So we can’t just reverse the role of \(x\) and \(y\) in the regression! This is because the regression is minimizing vertical distance.
Note that the regression line always passes through the mean, since \[ \hat{\beta}_1 + \hat{\beta}_2 \bar{x}= \bar{y}- \hat{\beta}_2 \bar{x}+ \hat{\beta}_2 \bar{x}. \] So the simple regression line is a line passing through the mean point, with a slope equal to the ratio of the variances times the sample correlation.
Let’s derive the simple least squares formula a few different ways. The sum of squared errors that we’re trying to minimize is smooth and convex, so if there is a minimum it would satisfy
\[ \begin{align*} \left. \frac{\partial \sum_{n=1}^N\varepsilon_n^2}{\partial \beta_1} \right|_{\hat{\beta}_1, \hat{\beta}_2} ={}& 0 \quad\textrm{and} \\ \left. \frac{\partial \sum_{n=1}^N\varepsilon_n^2}{\partial \beta_2} \right|_{\hat{\beta}_1, \hat{\beta}_2} ={}& 0. \end{align*} \]
When is it sufficient to set the gradient equal to zero to find a minumum?
These translate to (after dividing by \(-2 N\))
\[ \begin{align*} \frac{1}{N} \sum_{n=1}^Ny_n - \hat{\beta}_2 \frac{1}{N} \sum_{n=1}^Nx_n - \hat{\beta}_1 ={}& 0 \quad\textrm{and}\\ \frac{1}{N} \sum_{n=1}^Ny_n x_n - \hat{\beta}_2 \frac{1}{N} \sum_{n=1}^Nx_n^2 - \hat{\beta}_1 \frac{1}{N} \sum_{n=1}^Nx_n ={}& 0. \end{align*} \]
Our estimator them must satisfy
\[ \begin{align*} \bar{x}\hat{\beta}_2 + \hat{\beta}_1 ={}& \bar{y}\quad\textrm{and}\\ \overline{xx}\hat{\beta}_2 + \bar{x}\hat{\beta}_1 ={}& \overline{yx}. \end{align*} \]
We have a linear system with two unknowns and two equations. An elegant way to solve them is to subtract \(\bar{x}\) times the first equation from the second, giving:
\[ \begin{align*} \bar{x}\hat{\beta}_1 - \bar{x}\hat{\beta}_1 + \overline{xx}\hat{\beta}_2 - \bar{x}^2 \hat{\beta}_2 ={}& \overline{xy}- \bar{x}\bar{y}\Leftrightarrow\\ \hat{\beta}_2 ={}& \frac{\overline{xy}- \bar{x}\, \bar{y}} {\overline{xx}- \bar{x}^2}, \end{align*} \]
as long as \(\overline{xx}- \bar{x}^2 \ne 0\).
In ordinary language, what does it mean for \(\overline{xx}- \bar{x}^2 = 0\)?
We can then plug this into the first equation giving
\[ \hat{\beta}_1 = \bar{y}- \hat{\beta}_2 \bar{x}. \]
Alternatively, our criterion can be written in matrix form as
\[ \begin{pmatrix} 1 & \bar{x}\\ \bar{x}& \overline{xx} \end{pmatrix} \begin{pmatrix} \hat{\beta}_1 \\ \hat{\beta}_2 \end{pmatrix} = \begin{pmatrix} \bar{y}\\ \overline{xy} \end{pmatrix} \tag{2}\]
Recall that there is a special matrix that allows us to get an expression for \(\hat{\beta}_1\) and \(\hat{\beta}_2\):
\[ \begin{align*} \begin{pmatrix} 1 & \bar{x}\\ \bar{x}& \overline{xx} \end{pmatrix}^{-1} = \frac{1}{\overline{xx}- \bar{x}^2} \begin{pmatrix} \overline{xx}& - \bar{x}\\ -\bar{x}& 1 \end{pmatrix} \end{align*} \]
This matrix is called the “inverse” because
\[ \begin{align*} \begin{pmatrix} 1 & \bar{x}\\ \bar{x}& \overline{xx} \end{pmatrix}^{-1} \begin{pmatrix} 1 & \bar{x}\\ \bar{x}& \overline{xx} \end{pmatrix} = \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix}. \end{align*} \]
Verify the preceding property.
Multiplying both sides of Equation 2 by the matrix inverse gives
\[ \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix} \begin{pmatrix} \hat{\beta}_1 \\ \hat{\beta}_2 \end{pmatrix} = \begin{pmatrix} \hat{\beta}_1 \\ \hat{\beta}_2 \end{pmatrix} = \frac{1}{\overline{xx}- \bar{x}^2} \begin{pmatrix} \overline{xx}& - \bar{x}\\ -\bar{x}& 1 \end{pmatrix} \begin{pmatrix} \bar{y}\\ \overline{xy} \end{pmatrix}. \]
From this we can read off the familiar answer
\[ \begin{align*} \hat{\beta}_2 ={}& \frac{\overline{xy}- \bar{x}\,\bar{y}}{\overline{xx}- \bar{x}^2}\\ \hat{\beta}_1 ={}& \frac{\overline{xx}\,\bar{y}- \overline{xy}\,\bar{x}}{\overline{xx}- \bar{x}^2}\\ ={}& \frac{\overline{xx}\,\bar{y}- \bar{x}^2 \bar{y}+ \bar{x}^2 \bar{y}- \overline{xy}\,\bar{x}} {\overline{xx}- \bar{x}^2}\\ ={}& \bar{y}- \frac{\bar{x}^2 \bar{y}- \overline{xy}\,\bar{x}} {\overline{xx}- \bar{x}^2} \\ ={}& \bar{y}- \hat{\beta}_1 \bar{x}. \end{align*} \]
---
title: "Review: Simple linear regression"
format:
html:
code-fold: false
code-tools: true
execute:
enabled: true
theme: [sandstone]
sidebar: true
---
::: {.content-visible when-format="html"}
{{< include /macros.tex >}}
:::
# Goals
- Review simple linear regression
- The idea of minimizing squared error by setting a derivative equal to zero
- Sample means, covariances, and correlations
- Representing the least squares line as a matrix equation
# Reading
This lecture supplements the reading
- @freedman:2009:statistical Chapter 2
- @ding:2024:linear Chapter 2.1
# Does doing well on the quizzes help you do well on the final exam?
```{r}
#| echo: false
#| output: false
library(tidyverse)
library(quarto)
root_dir <- quarto::find_project_root()
all_grades_df <- read.csv(file.path(root_dir, "datasets/grades/grades.csv"))
```
Let's look again at the grades dataset, and consider the question:
"will doing well on the quizzes help you do well on the final?" Note that
this is a causal question, which can't be answered directly with observational
data. But if there is a strong causal relationship, we might expect to see a strong
association. Is there one?
```{r}
final_v_quizzes <- lm(final ~ quizzes, all_grades_df)
all_grades_df %>%
mutate(final_pred=predict(final_v_quizzes, all_grades_df)) %>%
ggplot() +
geom_point(aes(x=quizzes, y=final)) +
geom_line(aes(x=quizzes, y=final_pred), color="red")
```
Note that this includes all classes, not just 151A. There's a relationship,
but not quite as strong as you might expect. One way to summarize this scatterplot
is to find the straight line that
passes through the points. We'll talk a lot during this lecture about
how to fit it.
The upwards slope strongly suggests that, on average, students who do better
on quizzes also do better on the final exam.
- Does this mean that if you decide to study more for the quizzes, you will do better on the final?
- How to you interpret the variability of points around the line? Does it look large or small?
- Does the spread around the line look to be about the same as the quiz score varies?
# Simple least squares
Recall the simple least squares model:
$$
\begin{align*}
\y_n :={}& \textrm{Response (e.g. course grade)} \\
\x_n :={}& \textrm{Regressor (e.g. final exam grade)}\\
y_n ={}& \beta_2 \x_n + \beta_1 + \res_n \textrm{ Model (straight line through data)}.
\end{align*}
$${#eq-lm-simple}
::: {.callout-tip title='Notation'}
Here are some key quantities and their names:
- $\y_n$: The 'response'
- $\x_n$: The 'regressors' or 'explanatory' variables
For a linear model, we also have:
- $\res_n$: The 'error' or 'residual'
- $\beta_2, \beta_1$: The 'coefficients', 'parameters', 'slope and intercept'
We might also have estimates of these quantities:
- $\betahat_p$: Estimate of $\beta_p$
- $\reshat$: Estimate of $\res_n$
- $\yhat_n$: A 'prediction' or 'fitted value'
$\yhat_n = \betahat_1 + \betahat_2 \x_n$
When we form the estimator by minimizing the estimated residuals, we might call the estimate
- 'Ordinary least squares' (or 'OLS')
- 'Least-squares'
- 'Linear regression'
An estimate will implicitly be least-squares estimates, but precisely what we mean by an estimate may have to come from context.
:::
Note that for any value of $\beta$, we get a value of the "error"
or "residual" $\res_n$:
$$
\res_n = \y_n - (\beta_2 \x_n + \beta_1).
$$
This notation may be a little strange. Note that we get a *different value* of
$\res_n$ for any particular values of $\beta_1$ and $\beta_2$. So it doesn't
make any sense to say something like "$\res_n$ is normally distributed." In
fact, for a fixed $\y_n$ and $\x_n$, it's best to think of the residual
as a *function* of $\beta_1$ and $\beta_2$.
The "least squares fit" is called this because we choose $\beta_1$ and $\beta_2$
to make $\sumn \res_n^2$ as small as possible:
$$
\begin{align*}
\textrm{Choose }\beta_1,\beta_2\textrm{ so that }
\sumn \res_n^2 = \sumn \left( \y_n - (\beta_2 \x_n + \beta_1) \right)^2
\textrm{ is as small as possible. }
\end{align*}
$$
We call the values that achieves this minimum $\betahat_1$ and $\betahat_2$. Similarly,
we set $\yhat_n = \betahat_2 \x_n + \betahat_1$ and define $\reshat_n = \y_n - \yhat_n$.
Note that $\reshat_n$ is just $\res_n$ computed at $\betahat_1$, $\betahat_2$.
# Covariances, variances, and correlations
How can we interpret this formula as a summary statistic? Let's introduce the notation
\newcommand{\xbar}{\overline{\x}}
\newcommand{\ybar}{\overline{\y}}
\newcommand{\xybar}{\overline{\x \y}}
\newcommand{\xxbar}{\overline{\x \x}}
$$
\begin{align*}
\ybar ={}& \meann \y_n \\
\xbar ={}& \meann \x_n \\
\xybar ={}& \meann \x_n \y_n \\
\xxbar ={}& \meann \x_n ^2.
\end{align*}
$$
You might recall that the solution to the least squares problem is given by
$$
\begin{align*}
\betahat_1 ={}& \ybar - \betahat_2 \xbar \\
\betahat_2 ={}&
\frac{\xybar - \xbar \, \ybar}
{\xxbar - \xbar^2}.
\end{align*}
$$
These have succinct expressions in terms of familiar summary statistics. We will
denote quantities like $\overline{\x}$ as
$$
\expecthat{\x} = \meann \x_n = \overline{\x}.
$$
This is also called the "sample average" or "sample expectation." It
depends on the dataset, and if we're imagining that dataset to be random,
then $\expecthat{\x}$ is random as well.
This is different from the expectation, $\expect{\x}$, which does not depend
on any dataset. Recall also the sample covariance is given by
$$
\covhat{\x, \y} = \meann (\x_n - \xbar) (\y_n - \ybar),
$$
with the sample variance $\varhat{\x} = \covhat{\x, \x}$ as a special case. In
these terms, we can see that
$$
\betahat_2 = \frac{\covhat{\x, \y}}{\varhat{\x}}.
$$
Note that the covariance is *scale-dependent*, and will change if you rescale
$\x$. A scale-invariant measure of association is the sample correlation, which is
defined as
$$
\corrhat{\x, \y} = \frac{\covhat{\x, \y}}{\sqrt{\var{\x} \var{\y}}}.
$$
The sample correlation varies between $-1$ and $1$. We can see that we
can also write
$$
\betahat_2 = \sqrt{\frac{\varhat{\y}}{\varhat{\x}}} \corrhat{\x, \y}.
$$
In this sense, the quantity $\betahat_2$ is measuring the association
between $\x$ and $\y$. Note that $\corrhat{\x, \y} = \corrhat{\y, \x}$,
but that $1 / \corrhat{\y, \x} \ne \corrhat{\y, \x}$. So we can't
just reverse the role of $\x$ and $\y$ in the regression! This is because
the regression is minimizing *vertical distance*.
Note that the regression line always passes through the mean, since
$$
\betahat_1 + \betahat_2 \xbar = \ybar - \betahat_2 \xbar + \betahat_2 \xbar.
$$
So the simple regression line is a line passing through the mean
point, with a slope equal to the ratio of the variances times the
sample correlation.
# Simple least squares estimator derivation
Let's derive the simple least squares formula a few different ways.
The sum of squared errors that we're trying to minimize is smooth and convex, so if there
is a minimum it would satisfy
$$
\begin{align*}
\fracat{\partial \sumn \res_n^2}{\partial \beta_1}{\betahat_1, \betahat_2} ={}& 0 \quad\textrm{and} \\
\fracat{\partial \sumn \res_n^2}{\partial \beta_2}{\betahat_1, \betahat_2} ={}& 0.
\end{align*}
$$
::: {.callout-warning title='Question'}
When is it sufficient to set the gradient equal to zero to find a minumum?
:::
These translate to (after dividing by $-2 N$)
$$
\begin{align*}
\meann \y_n - \betahat_2 \meann \x_n - \betahat_1 ={}& 0 \quad\textrm{and}\\
\meann \y_n \x_n - \betahat_2 \meann \x_n^2 - \betahat_1 \meann \x_n ={}& 0.
\end{align*}
$$
Our estimator them must satisfy
$$
\begin{align*}
\xbar \betahat_2 + \betahat_1 ={}& \ybar \quad\textrm{and}\\
\xxbar \betahat_2 + \xbar \betahat_1 ={}& \overline{yx}.
\end{align*}
$$
We have a linear system with two unknowns and two equations. An elegant way to solve them is to subtract $\xbar$ times the first equation from the second, giving:
$$
\begin{align*}
\xbar \betahat_1 - \xbar \betahat_1 +
\xxbar \betahat_2 - \xbar^2 \betahat_2 ={}&
\xybar - \xbar \ybar \Leftrightarrow\\
\betahat_2 ={}&
\frac{\xybar - \xbar \, \ybar}
{\xxbar - \xbar^2},
\end{align*}
$$
as long as $\xxbar - \xbar^2 \ne 0$.
::: {.callout-warning title='Question'}
In ordinary language, what does it mean for $\xxbar - \xbar^2 = 0$?
:::
We can then plug this into the first equation giving
$$
\betahat_1 = \ybar - \betahat_2 \xbar.
$$
# Matrix multiplication version
Alternatively, our criterion can be written in matrix form as
$$
\begin{pmatrix}
1 & \xbar \\
\xbar & \xxbar
\end{pmatrix}
\begin{pmatrix}
\betahat_1 \\
\betahat_2
\end{pmatrix} =
\begin{pmatrix}
\ybar \\
\xybar
\end{pmatrix}
$${#eq-simple-est-as-matrix}
Recall that there is a special matrix that allows us to get an
expression for $\betahat_1$ and $\betahat_2$:
$$
\begin{align*}
\begin{pmatrix}
1 & \xbar \\
\xbar & \xxbar
\end{pmatrix}^{-1} =
\frac{1}{\xxbar - \xbar^2}
\begin{pmatrix}
\xxbar & - \xbar \\
-\xbar & 1
\end{pmatrix}
\end{align*}
$$
This matrix is called the "inverse" because
$$
\begin{align*}
\begin{pmatrix}
1 & \xbar \\
\xbar & \xxbar
\end{pmatrix}^{-1}
\begin{pmatrix}
1 & \xbar \\
\xbar & \xxbar
\end{pmatrix} =
\begin{pmatrix}
1 & 0 \\
0 & 1
\end{pmatrix}.
\end{align*}
$$
::: {.callout-warning title='Exercise'}
Verify the preceding property.
:::
Multiplying both sides of @eq-simple-est-as-matrix by
the matrix inverse gives
$$
\begin{pmatrix}
1 & 0 \\
0 & 1
\end{pmatrix}
\begin{pmatrix}
\betahat_1 \\
\betahat_2
\end{pmatrix} =
\begin{pmatrix}
\betahat_1 \\
\betahat_2
\end{pmatrix} =
\frac{1}{\xxbar - \xbar^2}
\begin{pmatrix}
\xxbar & - \xbar \\
-\xbar & 1
\end{pmatrix}
\begin{pmatrix}
\ybar \\
\xybar
\end{pmatrix}.
$$
From this we can read off the familiar answer
$$
\begin{align*}
\betahat_2 ={}& \frac{\xybar - \xbar\,\ybar}{\xxbar - \xbar^2}\\
\betahat_1 ={}& \frac{\xxbar\,\ybar - \xybar\,\xbar}{\xxbar - \xbar^2}\\
={}& \frac{\xxbar\,\ybar -
\xbar^2 \ybar + \xbar^2 \ybar - \xybar\,\xbar}
{\xxbar - \xbar^2}\\
={}& \ybar - \frac{\xbar^2 \ybar - \xybar\,\xbar}
{\xxbar - \xbar^2} \\
={}& \ybar - \betahat_1 \xbar.
\end{align*}
$$