Davidson, MacKinnon, et al. (2004) Sections 1.4–1.5
Have my grades been increasing over time?
Let’s look again at the grades dataset, and consider the question: “have my grades been increasing over time?” I personally am interested in whether some aspect of my courses has been causing an upward trend, and maybe you are interested in extrapolating such a trend to the present semester.
grade_v_time <-lm(grade ~ time, all_grades_df)print(summary(grade_v_time))
Call:
lm(formula = grade ~ time, data = all_grades_df)
Residuals:
Min 1Q Median 3Q Max
-0.86223 -0.03835 0.03343 0.07532 0.13777
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.846658 0.020112 42.098 <2e-16 ***
time 0.015571 0.007198 2.163 0.0315 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1234 on 252 degrees of freedom
Multiple R-squared: 0.01823, Adjusted R-squared: 0.01434
F-statistic: 4.68 on 1 and 252 DF, p-value: 0.03146
Here, we see a very weak but increasing trend, probably driven by some very low grades in the first two semesters.
However, this comparison, even as it is, is complicated by the fact that time periods 1 and 2 were 151A, and time periods 3 and 4 were 154. In fact, if regress on the class label instead, I get a simliar result:
Here, we can see how the two fits differ. The prediction for regression on class is a step function for the change in which class is being taught, where the time trend is continuous.
Question:
Can we possibly disentangle the effect of time from which class is being taught? How?
One way to ask this formally is to run the regression on both:
lm(grade ~ time + class, all_grades_df) %>%summary()
Call:
lm(formula = grade ~ time + class, data = all_grades_df)
Residuals:
Min 1Q Median 3Q Max
-0.86615 -0.04175 0.03580 0.07452 0.13385
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.863349 0.027039 31.930 <2e-16 ***
time 0.002801 0.015584 0.180 0.858
class154 0.031012 0.033566 0.924 0.356
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1234 on 251 degrees of freedom
Multiple R-squared: 0.02156, Adjusted R-squared: 0.01376
F-statistic: 2.765 on 2 and 251 DF, p-value: 0.06488
Interestingly, we see that though each regression is statistically significant separately, in the combined regression neither are significant. This is one way of saying, using regression, that we cannot disentangle the effect of class and time with this dataset. In other words, without more data, the answer to our question is inconclusive — although there certainly doesn’t seem to be strong evidence for marked trends.
This is no longer simple linear regression! In this unit, we will study versions of regression that include more than one covariate in this way.
Matrix notation
The simple linear regression formula came from combining the equations that set the univariate gradients equal to zero, and then recognizing a matrix equation. We can in fact do both at the same time! But first we need some notation
Here is a formal definition of the type of model that we will study for the vast majority of the semester:
I will always use \(N\) for the number of observed data points, and \(P\) for the dimension of the regression vector.
Equation 1 is a general form of simpler cases. For example, if we take \(x_{n1} \equiv 1\), \(x_{n2}= x_n\) to be some scalar, and \(P = 2\), then Equation 1 becomes ?@eq-lm-simple:
But in general we only observe \(y_n\) and \(x_{n1}, \ldots, x_{nP}\), and we choose \(\beta_1, \ldots, \beta_P\) to make the residuals small. (How we do this precisely will be something we talk about at great length.)
The general form of Equation 1 can be written more compactly using matrix and vector notation. Specifically, if we let
I have written \(\boldsymbol{x}_n^\intercal\boldsymbol{\beta}\) for the “dot product” or “inner product” between \(\boldsymbol{x}_n\) and \(\boldsymbol{\beta}\). Writing it in this way clarifies the relationship with matrix notation below.
There are many other ways to denote inner products in the literature, including \(\boldsymbol{x}_n \cdot \boldsymbol{\beta}\) and \(<\boldsymbol{x}_n, \boldsymbol{\beta}>\).
We can compactify it even further if we stack the \(n\) observations: % \[
\begin{aligned}
y_1 ={}& \boldsymbol{x}_1 ^\intercal\boldsymbol{\beta}+ \varepsilon_{1} \\
y_2 ={}& \boldsymbol{x}_2 ^\intercal\boldsymbol{\beta}+ \varepsilon_{2} \\
\vdots\\
y_N ={}& \boldsymbol{x}_N ^\intercal\boldsymbol{\beta}+ \varepsilon_{N} \\
\end{aligned}
\]
As before we can stack the responses and residuals:
I will use upper case bold letters for multi-dimensional matrices like \(\boldsymbol{X}\). But I may also use upper case bold letters even when the quantity could also be a column vector, when I think it’s more useful to think of the quantity as a matrix with a single column. Examples are \(\boldsymbol{Y}\) above, or \(\boldsymbol{X}\) when \(P = 1\).
This is a quadratic function of the vector \(\boldsymbol{\beta}\). We wish to find the minimum of this quantity as a function of \(\boldsymbol{\beta}\). We might hope that the minimum occurs at a point where the gradient of this expression is zero.
Rather than compute the univariate derivative with respect to each component, we can compute the multivariate gradient with respect to the vector.
Let’s recall some facts from vector calculus.
Notation
Take \(\boldsymbol{z}\in \mathbb{R}^P\) to be a \(P\)–vector. and let \(f(\boldsymbol{z})\) a scalar–valued function of the vector \(z\). We write
This is a set of \(P\) equations in \(P\) unknowns. If it is not degenerate, one can solve for \(\hat{\boldsymbol{\beta}}\). That is, if the matrix \(\boldsymbol{X}^\intercal\boldsymbol{X}\) is invertible, then we can multiply both sides of Equation 4 by \((\boldsymbol{X}^\intercal\boldsymbol{X})^{-1}\) to get
We’re going to be talking again and again about the “ordinary least squares” (“OLS”) problem \[
\hat{\boldsymbol{\beta}}= \underset{\beta}{\mathrm{argmin}}\, \boldsymbol{\varepsilon}^\intercal\boldsymbol{\varepsilon}
\quad\textrm{where}\quad
\boldsymbol{Y}= \boldsymbol{X}\boldsymbol{\beta}+ \boldsymbol{\varepsilon}
\quad\textrm{where}\quad
y_n = \boldsymbol{x}_n^\intercal\boldsymbol{\beta}+ \varepsilon_n \textrm{ for all }n=1,\ldots,N.
\]
It will be nice to have some shorthand for this problem so I don’t have to write this out every time. All of the following will be understood as shorthand for the preceding problem. In each, the fact that \(\hat{\boldsymbol{\beta}}\) minimizes the sum of squared residuals is implicit.
(The final shorthand is closest to the notation for the Rlm function.) This is convenient because, for example, certain properties of the regression \(y_n \sim \boldsymbol{x}_n\) don’t necessarily need to commit to which symbol we use for the coefficients. Symbols for the missing pieces will hopefuly be clear from context as necessary.
Notation
Unlike many other regression texts (and the lm function), I will not necessarily assume that a constant is included in the regression. One can always take a generic regression \(y_n \sim \boldsymbol{x}_n\) to include a constant by assuming that one of the entries of \(\boldsymbol{x}_n\) is one. At some points my convention of not including a constant by default will lead to formulas that may be at odds with some textbooks. But these differences are superficial, and are, in my mind, more than made up for by the generality and simplicity of treating constants as just another regressor.
Some simple familiar examples in matrix form
The sample mean
Sample means are in fact a special case of multi linear regression. Seeing this will be helpful when we interpret categorical variables.
Notation
I will use \(\boldsymbol{1}\) to denote a vector full of ones. Usually it will be an \(N\)–vector, but sometimes its dimension will just be implicit. Similarly, \(\boldsymbol{0}\) is a vector of zeros.
We showed earlier that the sample mean is a special case of the regression \(y_n \sim 1 \cdot \beta\). This can be expressed in matrix notation by taking \(\boldsymbol{X}= \boldsymbol{1}\) as a \(N\times 1\) vector. We then have
\[
\boldsymbol{X}^\intercal\boldsymbol{X}= \boldsymbol{1}^\intercal\boldsymbol{1}= \sum_{n=1}^N1 \cdot 1 = N,
\]
so \(\boldsymbol{X}^\intercal\boldsymbol{X}\) is invertible as long as \(N > 0\) (i.e., if you have at least one datapoint), with \((\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} = 1/N\). We also have
\[
\boldsymbol{X}^\intercal\boldsymbol{Y}= \boldsymbol{1}^\intercal\boldsymbol{Y}= \sum_{n=1}^N1 \cdot y_n = N \bar{y},
\]
For completness, let’s also see what happens when we omit the constant from simple linear regression. Suppose that we regress \(y_n \sim x_n\) where \(x_n\) is a scalar.
Warning
R adds a constant by default! To remove it, run something like lm(y ~ x - 1, my_dataframe).
Let’s suppose that \(\mathbb{E}\left[x_n\right] = 0\) and \(\mathrm{Var}\left(x_n\right) = \sigma^2 > 0\). We have
Depending on the distribution of \(x_n\), it may be possible for \(\boldsymbol{X}^\intercal\boldsymbol{X}\) to be non–invertible!
Exercise
Produce a distribution for \(x_n\) where \(\boldsymbol{X}^\intercal\boldsymbol{X}\) is non–invertible with positive probability for any \(N\).
However, as \(N \rightarrow \infty\), \(\frac{1}{N} \boldsymbol{X}^\intercal\boldsymbol{X}\rightarrow \sigma^2\) by the LLN, and since \(\sigma^2 > 0\), \(\frac{1}{N} \boldsymbol{X}^\intercal\boldsymbol{X}\) will be invertible with probability approaching one as \(N\) goes to infinity.
Common quantities
Given a regression fit, there are a few key quantities that we’ll use over and over again.
This is the regressions’ “guess” at the response for a given value of \(\boldsymbol{x}_n\). Analogously, we can define the “error”, “residual”, or “fitted residual:”
\[
\begin{aligned}
\hat{\boldsymbol{Y}}^\intercal\hat{\boldsymbol{Y}}:={}& ESS = \textrm{"Explained sum of squares"} \\
\boldsymbol{Y}^\intercal\boldsymbol{Y}:={}& TSS = \textrm{"Total sum of squares"} \\
\hat{\boldsymbol{\varepsilon}}^\intercal\hat{\boldsymbol{\varepsilon}}:={}& RSS = \textrm{"Residual sum of squares"}.
\end{aligned}
\]
In your homework, you will show that \(TSS = ESS + RSS\). It follows immediately that \(0 \le ESS \le TSS\), and we can define \[
R^2 := \frac{ESS}{TSS} \in [0, 1].
\]
In a sense, high \(R^2\) means a “good fit” in the sense that the least squares fit has low error.
High \(R^2\) is not necessarily a good fit
But high \(R^2\) does not necessarily mean that the fit is accurate or useful! In particular, by increasing the number of regressors, you can only make \(R^2\) increase, and there are clearly silly regressions with great \(R^2\).
Here are two examples:
\(\boldsymbol{Y}\sim \boldsymbol{Y}\)
\(\boldsymbol{Y}\sim \boldsymbol{I}\)
References
Davidson, Russell, James G MacKinnon, et al. 2004. Econometric Theory and Methods. Vol. 5. Oxford University Press New York.
Ding, Peng. 2024. “Linear Model and Extensions.”arXiv Preprint arXiv:2401.00649.
Freedman, David. 2009. Statistical Models: Theory and Practice. cambridge university press.