Review: Simple linear regression

Goals

Review simple linear regression
- The idea of minimizing squared error by setting a derivative equal to zero
- Sample means, covariances, and correlations
- Representing the least squares line as a matrix equation

Reading

This lecture supplements the reading

Freedman (2009) Chapter 2
Ding (2024) Chapter 2.1

Does doing well on the quizzes help you do well on the final exam?

Let’s look again at the grades dataset, and consider the question: “will doing well on the quizzes help you do well on the final?” Note that this is a causal question, which can’t be answered directly with observational data. But if there is a strong causal relationship, we might expect to see a strong association. Is there one?

final_v_quizzes <- lm(final ~ quizzes, all_grades_df)
all_grades_df %>%
mutate(final_pred=predict(final_v_quizzes, all_grades_df)) %>%
  ggplot() +
    geom_point(aes(x=quizzes, y=final)) +
    geom_line(aes(x=quizzes, y=final_pred), color="red")

Note that this includes all classes, not just 151A. There’s a relationship, but not quite as strong as you might expect. One way to summarize this scatterplot is to find the straight line that passes through the points. We’ll talk a lot during this lecture about how to fit it.

The upwards slope strongly suggests that, on average, students who do better on quizzes also do better on the final exam.

Does this mean that if you decide to study more for the quizzes, you will do better on the final?
How to you interpret the variability of points around the line? Does it look large or small?
Does the spread around the line look to be about the same as the quiz score varies?

Simple least squares

Recall the simple least squares model:

\[ \begin{align*} y_n :={}& \textrm{Response (e.g. course grade)} \\ x_n :={}& \textrm{Regressor (e.g. final exam grade)}\\ y_n ={}& \beta_2 x_n + \beta_1 + \varepsilon_n \textrm{ Model (straight line through data)}. \end{align*} \tag{1}\]

Notation

Here are some key quantities and their names:

\(y_n\): The ‘response’
\(x_n\): The ‘regressors’ or ‘explanatory’ variables

For a linear model, we also have:

\(\varepsilon_n\): The ‘error’ or ‘residual’
\(\beta_2, \beta_1\): The ‘coefficients’, ‘parameters’, ‘slope and intercept’

We might also have estimates of these quantities:

\(\hat{\beta}_p\): Estimate of \(\beta_p\)
\(\hat{\varepsilon}\): Estimate of \(\varepsilon_n\)
\(\hat{y}_n\): A ‘prediction’ or ‘fitted value’ \(\hat{y}_n = \hat{\beta}_1 + \hat{\beta}_2 x_n\)

When we form the estimator by minimizing the estimated residuals, we might call the estimate

‘Ordinary least squares’ (or ‘OLS’)
‘Least-squares’
‘Linear regression’

An estimate will implicitly be least-squares estimates, but precisely what we mean by an estimate may have to come from context.

Note that for any value of \(\beta\), we get a value of the “error” or “residual” \(\varepsilon_n\):

\[ \varepsilon_n = y_n - (\beta_2 x_n + \beta_1). \]

This notation may be a little strange. Note that we get a different value of \(\varepsilon_n\) for any particular values of \(\beta_1\) and \(\beta_2\). So it doesn’t make any sense to say something like “\(\varepsilon_n\) is normally distributed.” In fact, for a fixed \(y_n\) and \(x_n\), it’s best to think of the residual as a function of \(\beta_1\) and \(\beta_2\).

The “least squares fit” is called this because we choose \(\beta_1\) and \(\beta_2\) to make \(\sum_{n=1}^N\varepsilon_n^2\) as small as possible:

\[ \begin{align*} \textrm{Choose }\beta_1,\beta_2\textrm{ so that } \sum_{n=1}^N\varepsilon_n^2 = \sum_{n=1}^N\left( y_n - (\beta_2 x_n + \beta_1) \right)^2 \textrm{ is as small as possible. } \end{align*} \]

We call the values that achieves this minimum \(\hat{\beta}_1\) and \(\hat{\beta}_2\). Similarly, we set \(\hat{y}_n = \hat{\beta}_2 x_n + \hat{\beta}_1\) and define \(\hat{\varepsilon}_n = y_n - \hat{y}_n\). Note that \(\hat{\varepsilon}_n\) is just \(\varepsilon_n\) computed at \(\hat{\beta}_1\), \(\hat{\beta}_2\).

Covariances, variances, and correlations

How can we interpret this formula as a summary statistic? Let’s introduce the notation

\[ \begin{align*} \bar{y}={}& \frac{1}{N} \sum_{n=1}^Ny_n \\ \bar{x}={}& \frac{1}{N} \sum_{n=1}^Nx_n \\ \overline{xy}={}& \frac{1}{N} \sum_{n=1}^Nx_n y_n \\ \overline{xx}={}& \frac{1}{N} \sum_{n=1}^Nx_n ^2. \end{align*} \]

You might recall that the solution to the least squares problem is given by

\[ \begin{align*} \hat{\beta}_1 ={}& \bar{y}- \hat{\beta}_2 \bar{x}\\ \hat{\beta}_2 ={}& \frac{\overline{xy}- \bar{x}\, \bar{y}} {\overline{xx}- \bar{x}^2}. \end{align*} \]

These have succinct expressions in terms of familiar summary statistics. We will denote quantities like \(\overline{x}\) as

\[ \widehat{\mathbb{E}}\left[x\right] = \frac{1}{N} \sum_{n=1}^Nx_n = \overline{x}. \]

This is also called the “sample average” or “sample expectation.” It depends on the dataset, and if we’re imagining that dataset to be random, then \(\widehat{\mathbb{E}}\left[x\right]\) is random as well. This is different from the expectation, \(\mathbb{E}\left[x\right]\), which does not depend on any dataset. Recall also the sample covariance is given by

\[ \widehat{\mathrm{Cov}}\left(x, y\right) = \frac{1}{N} \sum_{n=1}^N(x_n - \bar{x}) (y_n - \bar{y}), \]

with the sample variance \(\widehat{\mathrm{Var}}\left(x\right) = \widehat{\mathrm{Cov}}\left(x, x\right)\) as a special case. In these terms, we can see that

\[ \hat{\beta}_2 = \frac{\widehat{\mathrm{Cov}}\left(x, y\right)}{\widehat{\mathrm{Var}}\left(x\right)}. \]

Note that the covariance is scale-dependent, and will change if you rescale \(x\). A scale-invariant measure of association is the sample correlation, which is defined as

\[ \widehat{\mathrm{Corr}}\left(x, y\right) = \frac{\widehat{\mathrm{Cov}}\left(x, y\right)}{\sqrt{\mathrm{Var}\left(x\right) \mathrm{Var}\left(y\right)}}. \]

The sample correlation varies between \(-1\) and \(1\). We can see that we can also write

\[ \hat{\beta}_2 = \sqrt{\frac{\widehat{\mathrm{Var}}\left(y\right)}{\widehat{\mathrm{Var}}\left(x\right)}} \widehat{\mathrm{Corr}}\left(x, y\right). \]

In this sense, the quantity \(\hat{\beta}_2\) is measuring the association between \(x\) and \(y\). Note that \(\widehat{\mathrm{Corr}}\left(x, y\right) = \widehat{\mathrm{Corr}}\left(y, x\right)\), but that \(1 / \widehat{\mathrm{Corr}}\left(y, x\right) \ne \widehat{\mathrm{Corr}}\left(y, x\right)\). So we can’t just reverse the role of \(x\) and \(y\) in the regression! This is because the regression is minimizing vertical distance.

Note that the regression line always passes through the mean, since \[ \hat{\beta}_1 + \hat{\beta}_2 \bar{x}= \bar{y}- \hat{\beta}_2 \bar{x}+ \hat{\beta}_2 \bar{x}. \] So the simple regression line is a line passing through the mean point, with a slope equal to the ratio of the variances times the sample correlation.

Simple least squares estimator derivation

Let’s derive the simple least squares formula a few different ways. The sum of squared errors that we’re trying to minimize is smooth and convex, so if there is a minimum it would satisfy

\[ \begin{align*} \left. \frac{\partial \sum_{n=1}^N\varepsilon_n^2}{\partial \beta_1} \right|_{\hat{\beta}_1, \hat{\beta}_2} ={}& 0 \quad\textrm{and} \\ \left. \frac{\partial \sum_{n=1}^N\varepsilon_n^2}{\partial \beta_2} \right|_{\hat{\beta}_1, \hat{\beta}_2} ={}& 0. \end{align*} \]

Question

When is it sufficient to set the gradient equal to zero to find a minumum?

These translate to (after dividing by \(-2 N\))

\[ \begin{align*} \frac{1}{N} \sum_{n=1}^Ny_n - \hat{\beta}_2 \frac{1}{N} \sum_{n=1}^Nx_n - \hat{\beta}_1 ={}& 0 \quad\textrm{and}\\ \frac{1}{N} \sum_{n=1}^Ny_n x_n - \hat{\beta}_2 \frac{1}{N} \sum_{n=1}^Nx_n^2 - \hat{\beta}_1 \frac{1}{N} \sum_{n=1}^Nx_n ={}& 0. \end{align*} \]

Our estimator them must satisfy

\[ \begin{align*} \bar{x}\hat{\beta}_2 + \hat{\beta}_1 ={}& \bar{y}\quad\textrm{and}\\ \overline{xx}\hat{\beta}_2 + \bar{x}\hat{\beta}_1 ={}& \overline{yx}. \end{align*} \]

We have a linear system with two unknowns and two equations. An elegant way to solve them is to subtract \(\bar{x}\) times the first equation from the second, giving:

\[ \begin{align*} \bar{x}\hat{\beta}_1 - \bar{x}\hat{\beta}_1 + \overline{xx}\hat{\beta}_2 - \bar{x}^2 \hat{\beta}_2 ={}& \overline{xy}- \bar{x}\bar{y}\Leftrightarrow\\ \hat{\beta}_2 ={}& \frac{\overline{xy}- \bar{x}\, \bar{y}} {\overline{xx}- \bar{x}^2}, \end{align*} \]

as long as \(\overline{xx}- \bar{x}^2 \ne 0\).

Question

In ordinary language, what does it mean for \(\overline{xx}- \bar{x}^2 = 0\)?

We can then plug this into the first equation giving

\[ \hat{\beta}_1 = \bar{y}- \hat{\beta}_2 \bar{x}. \]

Matrix multiplication version

Alternatively, our criterion can be written in matrix form as

\[ \begin{pmatrix} 1 & \bar{x}\\ \bar{x}& \overline{xx} \end{pmatrix} \begin{pmatrix} \hat{\beta}_1 \\ \hat{\beta}_2 \end{pmatrix} = \begin{pmatrix} \bar{y}\\ \overline{xy} \end{pmatrix} \tag{2}\]

Recall that there is a special matrix that allows us to get an expression for \(\hat{\beta}_1\) and \(\hat{\beta}_2\):

\[ \begin{align*} \begin{pmatrix} 1 & \bar{x}\\ \bar{x}& \overline{xx} \end{pmatrix}^{-1} = \frac{1}{\overline{xx}- \bar{x}^2} \begin{pmatrix} \overline{xx}& - \bar{x}\\ -\bar{x}& 1 \end{pmatrix} \end{align*} \]

This matrix is called the “inverse” because

\[ \begin{align*} \begin{pmatrix} 1 & \bar{x}\\ \bar{x}& \overline{xx} \end{pmatrix}^{-1} \begin{pmatrix} 1 & \bar{x}\\ \bar{x}& \overline{xx} \end{pmatrix} = \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix}. \end{align*} \]

Exercise

Verify the preceding property.

Multiplying both sides of Equation 2 by the matrix inverse gives

\[ \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix} \begin{pmatrix} \hat{\beta}_1 \\ \hat{\beta}_2 \end{pmatrix} = \begin{pmatrix} \hat{\beta}_1 \\ \hat{\beta}_2 \end{pmatrix} = \frac{1}{\overline{xx}- \bar{x}^2} \begin{pmatrix} \overline{xx}& - \bar{x}\\ -\bar{x}& 1 \end{pmatrix} \begin{pmatrix} \bar{y}\\ \overline{xy} \end{pmatrix}. \]

From this we can read off the familiar answer

\[ \begin{align*} \hat{\beta}_2 ={}& \frac{\overline{xy}- \bar{x}\,\bar{y}}{\overline{xx}- \bar{x}^2}\\ \hat{\beta}_1 ={}& \frac{\overline{xx}\,\bar{y}- \overline{xy}\,\bar{x}}{\overline{xx}- \bar{x}^2}\\ ={}& \frac{\overline{xx}\,\bar{y}- \bar{x}^2 \bar{y}+ \bar{x}^2 \bar{y}- \overline{xy}\,\bar{x}} {\overline{xx}- \bar{x}^2}\\ ={}& \bar{y}- \frac{\bar{x}^2 \bar{y}- \overline{xy}\,\bar{x}} {\overline{xx}- \bar{x}^2} \\ ={}& \bar{y}- \hat{\beta}_1 \bar{x}. \end{align*} \]

References

Ding, Peng. 2024. “Linear Model and Extensions.” arXiv Preprint arXiv:2401.00649.

Freedman, David. 2009. Statistical Models: Theory and Practice. cambridge university press.