$$ \newcommand{\mybold}[1]{\boldsymbol{#1}} \newcommand{\trans}{\intercal} \newcommand{\norm}[1]{\left\Vert#1\right\Vert} \newcommand{\abs}[1]{\left|#1\right|} \newcommand{\bbr}{\mathbb{R}} \newcommand{\bbz}{\mathbb{Z}} \newcommand{\bbc}{\mathbb{C}} \newcommand{\gauss}[1]{\mathcal{N}\left(#1\right)} \newcommand{\chisq}[1]{\mathcal{\chi}^2_{#1}} \newcommand{\studentt}[1]{\mathrm{StudentT}_{#1}} \newcommand{\fdist}[2]{\mathrm{FDist}_{#1,#2}} \newcommand{\iid}{\overset{\mathrm{IID}}{\sim}} \newcommand{\argmin}[1]{\underset{#1}{\mathrm{argmin}}\,} \newcommand{\projop}[1]{\underset{#1}{\mathrm{Proj}}\,} \newcommand{\proj}[1]{\underset{#1}{\mybold{P}}} \newcommand{\expect}[1]{\mathbb{E}\left[#1\right]} \newcommand{\prob}[1]{\mathbb{P}\left(#1\right)} \newcommand{\dens}[1]{\mathit{p}\left(#1\right)} \newcommand{\var}[1]{\mathrm{Var}\left(#1\right)} \newcommand{\cov}[1]{\mathrm{Cov}\left(#1\right)} \newcommand{\sumn}{\sum_{n=1}^N} \newcommand{\meann}{\frac{1}{N} \sumn} \newcommand{\cltn}{\frac{1}{\sqrt{N}} \sumn} \newcommand{\trace}[1]{\mathrm{trace}\left(#1\right)} \newcommand{\diag}[1]{\mathrm{Diag}\left(#1\right)} \newcommand{\grad}[2]{\nabla_{#1} \left. #2 \right.} \newcommand{\gradat}[3]{\nabla_{#1} \left. #2 \right|_{#3}} \newcommand{\fracat}[3]{\left. \frac{#1}{#2} \right|_{#3}} \newcommand{\W}{\mybold{W}} \newcommand{\w}{w} \newcommand{\wbar}{\bar{w}} \newcommand{\wv}{\mybold{w}} \newcommand{\X}{\mybold{X}} \newcommand{\x}{x} \newcommand{\xbar}{\bar{x}} \newcommand{\xv}{\mybold{x}} \newcommand{\Xcov}{\mybold{M}_{\X}} \newcommand{\Xcovhat}{\hat{\mybold{M}}_{\X}} \newcommand{\Covsand}{\Sigmam_{\mathrm{sand}}} \newcommand{\Covsandhat}{\hat{\Sigmam}_{\mathrm{sand}}} \newcommand{\Z}{\mybold{Z}} \newcommand{\z}{z} \newcommand{\zv}{\mybold{z}} \newcommand{\zbar}{\bar{z}} \newcommand{\Y}{\mybold{Y}} \newcommand{\Yhat}{\hat{\Y}} \newcommand{\y}{y} \newcommand{\yv}{\mybold{y}} \newcommand{\yhat}{\hat{\y}} \newcommand{\ybar}{\bar{y}} \newcommand{\res}{\varepsilon} \newcommand{\resv}{\mybold{\res}} \newcommand{\resvhat}{\hat{\mybold{\res}}} \newcommand{\reshat}{\hat{\res}} \newcommand{\betav}{\mybold{\beta}} \newcommand{\betavhat}{\hat{\betav}} \newcommand{\betahat}{\hat{\beta}} \newcommand{\betastar}{{\beta^{*}}} \newcommand{\betavstar}{{\betav^{*}}} \newcommand{\loss}{\mathscr{L}} \newcommand{\losshat}{\hat{\loss}} \newcommand{\f}{f} \newcommand{\fhat}{\hat{f}} \newcommand{\bv}{\mybold{\b}} \newcommand{\bvhat}{\hat{\bv}} \newcommand{\alphav}{\mybold{\alpha}} \newcommand{\alphavhat}{\hat{\av}} \newcommand{\alphahat}{\hat{\alpha}} \newcommand{\omegav}{\mybold{\omega}} \newcommand{\gv}{\mybold{\gamma}} \newcommand{\gvhat}{\hat{\gv}} \newcommand{\ghat}{\hat{\gamma}} \newcommand{\hv}{\mybold{\h}} \newcommand{\hvhat}{\hat{\hv}} \newcommand{\hhat}{\hat{\h}} \newcommand{\gammav}{\mybold{\gamma}} \newcommand{\gammavhat}{\hat{\gammav}} \newcommand{\gammahat}{\hat{\gamma}} \newcommand{\new}{\mathrm{new}} \newcommand{\zerov}{\mybold{0}} \newcommand{\onev}{\mybold{1}} \newcommand{\id}{\mybold{I}} \newcommand{\sigmahat}{\hat{\sigma}} \newcommand{\etav}{\mybold{\eta}} \newcommand{\muv}{\mybold{\mu}} \newcommand{\Sigmam}{\mybold{\Sigma}} \newcommand{\rdom}[1]{\mathbb{R}^{#1}} \newcommand{\RV}[1]{{#1}} \def\A{\mybold{A}} \def\A{\mybold{A}} \def\av{\mybold{a}} \def\a{a} \def\B{\mybold{B}} \def\b{b} \def\S{\mybold{S}} \def\sv{\mybold{s}} \def\s{s} \def\R{\mybold{R}} \def\rv{\mybold{r}} \def\r{r} \def\V{\mybold{V}} \def\vv{\mybold{v}} \def\v{v} \def\vhat{\hat{v}} \def\U{\mybold{U}} \def\uv{\mybold{u}} \def\u{u} \def\W{\mybold{W}} \def\wv{\mybold{w}} \def\w{w} \def\tv{\mybold{t}} \def\t{t} \def\Sc{\mathcal{S}} \def\ev{\mybold{e}} \def\Lammat{\mybold{\Lambda}} \def\Q{\mybold{Q}} \def\eps{\varepsilon} $$

Class organization and philosophy

Some guiding principles

This is a course about linear models. You probably all know what linear models are already — in short, they are models which fit straight lines through data, possibly high-dimensional data. Every setting we consider in this class will have the following attributes:

  • A bunch of data points. We’ll index with \(n = 1, \ldots, N\).
  • Each datapoint consists of:
    • A scalar-valued “response” \(y_n\)
    • A vector-valued “regressor” \(\xv_n = (\x_{n1}, \ldots, \x_{nP})\).
Notation

Throughout the course, I will (almost) always use the letter “x” for regressors and the letter “y” for responses. There will always be \(N\) datapoints, and the regressors will be \(P\) dimensional. Vectors and matrices will be boldface.

Of course, I may deviate from this (and any) notation convention by saying so explicitly.

Linear models as a gateway into statistics and machine learning writ large

We will be interested in what \(\xv_n\) can tell us about \(\y_n\). This setup is called a “regression problem,” and can be attacked with lots of models, including non-linear models. But we will focus on approaches to this problem that operate via fitting straight lines to the dependence of \(y_n\) on \(\xv_n\).

Relative to a lot of other machine learning or statistical procedures, linear models are relatively easy to analyze and understand. Yet they are also complex enough to exhibit a lot of the strengths and pitfalls of all machine learning and statistics. So really, this is only partly a course about linear models per se. I hope to make it a course about concepts in statistics in machine learning more generally, but viewed within the relatively simple framework of linear models. Some examples that one might touch on at least briefly include:

  • Asymptotics under misspecification
  • Regularization
  • Sparsity
  • The bias / variance tradeoff
  • The influence function
  • Domain transfer
  • Distributed learning
  • Conformal inference
  • Permutation testing
  • Bayesian methods
  • Benign overfitting

I will probably not get to all of these, though you should feel free to ask about them in office hours. Students in the past have also profitably studied more advanced methods as part of their final project.

Linear models as a gateway into epistemology

In Lecture XXI of Wittgenstein’s “Lectures on the Foundations of Mathematics” we find this diagram and quote:

Entrails of a Goose

In the same passage, he goes on to say,

“A use of language has normally what we might call a point. This is immensely important. Although it’s true this is a matter of degree, and we can’t just say where it ends.”

When we run a regression, we must have a point. Otherwise we are just looking at goose entrails.

Our results and conclusions will be expressed in formal mathematical statements and in software. For the purpose of this class, I view mathematics and coding as analogous to language, grammar, and style: you need to have a command of these things in order to say something. But the content of this course doesn’t stop and math and conding, just as learning language alone does not give you something to say. Linear models will be a mathematical and computational tool for communicating with and about the real world. Datasets can speak to us in the language of linear models, and we can communicate with other humans through the language of linear models. Learning to communicate effectively in this way is the most important content of this course, and is a skill that will remain relevent whether or not you ever interpret or fit another linear model in your life.

Statistical analyses are mathematical, and so are often formulated as beginning from “assumptions,” from which we derive statistical conclusions, from which we derive real world conclusions. For example, we might look at housing price data and:

  1. Assume that the housing prices are a linear function of square footage, with IID normally distributed errors,
  2. Using the assumption construct a 95% confidence interval for the slope of the line and,
  3. Infer something about how much more a particular house would sell for if you build it larger.

A statistics class will typically spend a lot of time on getting from (1) to (2). (For better or worse, the present course will not be a complete exception.) The problem is that the assumption (1) is flatly absurd, and that the leap from (2) to (3) is far more problematic than the leap from (1) to (2). On the other hand, it is probably not really the case that steps (1) and (2) tell you nothing at all about the relationship between square footage and housing costs, it’s just that they don’t constitute any watertight or formal style of inference. Probably a better word for (1) would be “conceit” rather than “assumption,” since you know it to be false, but are willing to entertain it tenatively in order to learn something from the data. Here we can come back to Wittgenstein’s generosity: “They say, after all, that it gives them some guidance”.

Whether or not a statistical analysis is “good” cannot be evaluated outside a particular context. It is not a mathematical property, even though mathematical techniques can illuminate egregious errors. And it appears to be too strict to say that the assumptions (or conceits) are false and so the whole thing is nonsense, and yet it is clear that statistical analysis does not (typically) live up to any strong claim of logical rigor. So we have to ask contextual questions. Why do we care about the conclusions of this analysis? What will they be used for? Who needs to understand our analysis? What are the consequences of certain kinds of errors, including the application of implausible conceits?

Outside of a classroom, you will probably never encounter a linear model without a real question and context attached to it. I will make a real effort in this class to respect this fact, and always present data in context, to the extent possible within a classroom setting. I hope you will in turn get in the habit of always thinking about the context of a problem, even when going through formal homework and lab exercises.

For pedagogical reasons we may have to step into abstract territory at times, but I will make an effort to tie what we learn back to reality, and, in grading we’ll make sure to reward your efforts to do so as well. Just as there is not a “correct” essay in an English class, this will often mean that there are not “correct” analyses for a dataset, even though there are certainly better and worse approaches, as well as unambiguous errors.