$$ \newcommand{\mybold}[1]{\boldsymbol{#1}} \newcommand{\trans}{\intercal} \newcommand{\norm}[1]{\left\Vert#1\right\Vert} \newcommand{\abs}[1]{\left|#1\right|} \newcommand{\bbr}{\mathbb{R}} \newcommand{\bbz}{\mathbb{Z}} \newcommand{\bbc}{\mathbb{C}} \newcommand{\gauss}[1]{\mathcal{N}\left(#1\right)} \newcommand{\chisq}[1]{\mathcal{\chi}^2_{#1}} \newcommand{\studentt}[1]{\mathrm{StudentT}_{#1}} \newcommand{\fdist}[2]{\mathrm{FDist}_{#1,#2}} \newcommand{\argmin}[1]{\underset{#1}{\mathrm{argmin}}\,} \newcommand{\projop}[1]{\underset{#1}{\mathrm{Proj}}\,} \newcommand{\proj}[1]{\underset{#1}{\mybold{P}}} \newcommand{\expect}[1]{\mathbb{E}\left[#1\right]} \newcommand{\prob}[1]{\mathbb{P}\left(#1\right)} \newcommand{\dens}[1]{\mathit{p}\left(#1\right)} \newcommand{\var}[1]{\mathrm{Var}\left(#1\right)} \newcommand{\cov}[1]{\mathrm{Cov}\left(#1\right)} \newcommand{\sumn}{\sum_{n=1}^N} \newcommand{\meann}{\frac{1}{N} \sumn} \newcommand{\cltn}{\frac{1}{\sqrt{N}} \sumn} \newcommand{\trace}[1]{\mathrm{trace}\left(#1\right)} \newcommand{\diag}[1]{\mathrm{Diag}\left(#1\right)} \newcommand{\grad}[2]{\nabla_{#1} \left. #2 \right.} \newcommand{\gradat}[3]{\nabla_{#1} \left. #2 \right|_{#3}} \newcommand{\fracat}[3]{\left. \frac{#1}{#2} \right|_{#3}} \newcommand{\W}{\mybold{W}} \newcommand{\w}{w} \newcommand{\wbar}{\bar{w}} \newcommand{\wv}{\mybold{w}} \newcommand{\X}{\mybold{X}} \newcommand{\x}{x} \newcommand{\xbar}{\bar{x}} \newcommand{\xv}{\mybold{x}} \newcommand{\Xcov}{\Sigmam_{\X}} \newcommand{\Xcovhat}{\hat{\Sigmam}_{\X}} \newcommand{\Covsand}{\Sigmam_{\mathrm{sand}}} \newcommand{\Covsandhat}{\hat{\Sigmam}_{\mathrm{sand}}} \newcommand{\Z}{\mybold{Z}} \newcommand{\z}{z} \newcommand{\zv}{\mybold{z}} \newcommand{\zbar}{\bar{z}} \newcommand{\Y}{\mybold{Y}} \newcommand{\Yhat}{\hat{\Y}} \newcommand{\y}{y} \newcommand{\yv}{\mybold{y}} \newcommand{\yhat}{\hat{\y}} \newcommand{\ybar}{\bar{y}} \newcommand{\res}{\varepsilon} \newcommand{\resv}{\mybold{\res}} \newcommand{\resvhat}{\hat{\mybold{\res}}} \newcommand{\reshat}{\hat{\res}} \newcommand{\betav}{\mybold{\beta}} \newcommand{\betavhat}{\hat{\betav}} \newcommand{\betahat}{\hat{\beta}} \newcommand{\betastar}{{\beta^{*}}} \newcommand{\bv}{\mybold{\b}} \newcommand{\bvhat}{\hat{\bv}} \newcommand{\alphav}{\mybold{\alpha}} \newcommand{\alphavhat}{\hat{\av}} \newcommand{\alphahat}{\hat{\alpha}} \newcommand{\omegav}{\mybold{\omega}} \newcommand{\gv}{\mybold{\gamma}} \newcommand{\gvhat}{\hat{\gv}} \newcommand{\ghat}{\hat{\gamma}} \newcommand{\hv}{\mybold{\h}} \newcommand{\hvhat}{\hat{\hv}} \newcommand{\hhat}{\hat{\h}} \newcommand{\gammav}{\mybold{\gamma}} \newcommand{\gammavhat}{\hat{\gammav}} \newcommand{\gammahat}{\hat{\gamma}} \newcommand{\new}{\mathrm{new}} \newcommand{\zerov}{\mybold{0}} \newcommand{\onev}{\mybold{1}} \newcommand{\id}{\mybold{I}} \newcommand{\sigmahat}{\hat{\sigma}} \newcommand{\etav}{\mybold{\eta}} \newcommand{\muv}{\mybold{\mu}} \newcommand{\Sigmam}{\mybold{\Sigma}} \newcommand{\rdom}[1]{\mathbb{R}^{#1}} \newcommand{\RV}[1]{\tilde{#1}} \def\A{\mybold{A}} \def\A{\mybold{A}} \def\av{\mybold{a}} \def\a{a} \def\B{\mybold{B}} \def\S{\mybold{S}} \def\sv{\mybold{s}} \def\s{s} \def\R{\mybold{R}} \def\rv{\mybold{r}} \def\r{r} \def\V{\mybold{V}} \def\vv{\mybold{v}} \def\v{v} \def\U{\mybold{U}} \def\uv{\mybold{u}} \def\u{u} \def\W{\mybold{W}} \def\wv{\mybold{w}} \def\w{w} \def\tv{\mybold{t}} \def\t{t} \def\Sc{\mathcal{S}} \def\ev{\mybold{e}} \def\Lammat{\mybold{\Lambda}} $$

Note on linear and affine functions

This course is called “linear models,” so it’s worth being clear about what makes a function linear. As we will see, in some sense it might be better to call the course “affine models.”

Suppose we have a function $f(\cdot)$ from one space to another. For example, the input space can be $\mathbb{R}^2$ (2-vectors) and the output can be scalars, $\mathbb{R}$. We’ll write the input space as $\mathbb{I}$ and the output as $\mathbb{O}$. We assume that addition and scalar multiplication make sense in both $\mathbb{I}$ and $\mathbb{O}$.

Let $\z$ and $\z'$ be inputs in $\mathbb{I}$, and let $\alpha \in \mathbb{R}$. In some generality, a function $f(\cdot)$ from one space to another is called linear if it satsfies, for all $\alpha$, $\z$, and $\z'$:

\[ f(cdot)\textrm{is a linear function if and only if }\quad f(\alpha \z) = \alpha f(\z) \quad\textrm{and}\quad f(\z + \z') = f(\z) + f(\z'). \]

For example, fix $\beta \in \mathbb{R}^2$, and let $f(\z) = \beta^\trans \z$. Then $\mathbb{I} = \mathbb{R}^2$, $\mathbb{O} = \mathbb{R}$. Then $f(\cdot)$ is a linear function.

By this defintion, the regression function \[ f(\z) = \beta_0 + \beta_1 \z_n + \res_n \]

is not linear in $\z_n$. It’s not even linear if we take $\x = (1, \z_n)$ and write $\y_n = g(\x) = \beta^\trans \x_n + \res_n$, because of the residual.

Of course, in non-formal language, we might describe them as “linear” simply because their graph is a straight line. This is one justification for the class name “linear models”.

Formally, both $f(\cdot)$ and $g(\cot)$ are “affine function,” which means they are linear with an offset.
For the purposes of this class, we can define affine functions as:¹

\[ g(\cdot) \textrm{ is an affine function if and only if } \textrm{there exists a }b \in \mathbb{O} \textrm{ such that }f(\cdot) - b \textrm{ is linear.} \]

Given this, we can see that the relationship $\y_n = \beta_0 + \beta_1 \z_n + \res_n$ is affine as maps from

\[ \begin{align*} \beta &\mapsto \y_n\\ \z_n &\mapsto \y_n\\ \res_n &\mapsto \y_n. \end{align*} \]

Maybe the course should be called “affine models”!

A final note: it is likely that the models are called “linear models” because the expectation of $\y_n$ is, in fact, linear in both the regressors and coefficient, under the assumption that the residuals have mean zero:

\[ \expect{}{\y_n} = \beta^\trans \x_n. \]

However, keeping with the spirit of the class, I prefer not to bake such an assumption into the defintion of a linear model, reserving stochastisicity for concrete situations.

Footnotes

As I did in lecture, one could instead write the linear transforms in a way that depends on $b$, which is then asserted to exist. But that is a bit clumsy, and equivalent to the given definition. Wikipedia defines affine transformations in terms of invariance relations, whereas Wolfram appears to limit to $\mathbb{R}^d$. I’m not sure whether my definition here is official, but I think it strikes a nice balance.↩︎