Multilinear regression as loss minimization.
\(\textcolor{white}{\LaTeX}\)
Goals
- Derive the general form of the ordinary least squares (OLS) estimator in matrix notation
- Review simple least squares derivation
- Review matrix notation
- Review vector calculus
- Derive the general OLS formula and show that the simple least squares is a special case
Simple least squares
Recall the simple least squares model:
\[ \begin{align*} \y_n :={}& \textrm{Response (e.g. sales price)} \\ \x_n :={}& \textrm{Regressor (e.g. square footage)}\\ y_n ={}& \beta_2 \x_n + \beta_1 + \res_n \textrm{ Model (straight line through data)}. \end{align*} \tag{1}\]
Here are some key quantities and their names:
- \(\y_n\): The ‘response’
- \(\x_n\): The ‘regressors’ or ‘explanatory’ variables
For a linear model, we also have:
- \(\res_n\): The ‘error’ or ‘residual’
- \(\beta_2, \beta_1\): The ‘coefficients’, ‘parameters’, ‘slope and intercept’
We might also have estimates of these quantities:
- \(\betahat_p\): Estimate of \(\beta_p\)
- \(\reshat\): Estimate of \(\res_n\)
- \(\yhat_n\): A ‘prediction’ or ‘fitted value’ \(\yhat_n = \betahat_1 + \betahat_2 \x_n\)
When we form the estimator by minimizing the estimated residuals, we might call the estimate
- ‘Ordinary least squares’ (or ‘OLS’)
- ‘Least-squares’
- ‘Linear regression’
An estimate will implicitly be least-squares estimates, but precisely what we mean by an estimate may have to come from context.
Note that for any value of \(\beta\), we get a value of the “error” or “residual” \(\res_n\):
\[ \res_n = \y_n - (\beta_2 \x_n + \beta_1). \]
The “least squares fit” is called this because we choose \(\beta_1\) and \(\beta_2\) to make \(\sumn \res_n^2\) as small as possible:
\[ \begin{align*} \textrm{Choose }\betahat_2,\betahat_1\textrm{ so that } \sumn \res_n^2 = \sumn \left( \y_n - (\betahat_2 \x_n + \betahat_1) \right)^2 \textrm{ is as small as possible.} \end{align*} \]
How do we do this for the simple least squares model? And what if we have more regressors?
Simple least squares estimator derivation
The quantity we’re trying to minimize is smooth and convex, so if there is a minimum it would satisfy
\[ \begin{align*} \fracat{\partial \sumn \res_n^2}{\partial \beta_1}{\betahat_1, \betahat_2} ={}& 0 \quad\textrm{and} \\ \fracat{\partial \sumn \res_n^2}{\partial \beta_2}{\betahat_1, \betahat_2} ={}& 0. \end{align*} \]
When is it sufficient to set the gradient equal to zero to find a minumum?
These translate to (after dividing by \(-2 N\))
\[ \begin{align*} \meann \y_n - \betahat_2 \meann \x_n - \betahat_1 ={}& 0 \quad\textrm{and}\\ \meann \y_n \x_n - \betahat_2 \meann \x_n^2 - \betahat_1 \meann \x_n ={}& 0. \end{align*} \]
Let’s introduce the notation
\[ \begin{align*} \overline{y} ={}& \meann \y_n \\ \overline{x} ={}& \meann \x_n \\ \overline{xy} ={}& \meann \x_n \y_n \\ \overline{xx} ={}& \meann \x_n ^2, \end{align*} \]
Our estimator them must satisfy
\[ \begin{align*} \overline{x} \betahat_2 + \betahat_1 ={}& \overline{y} \quad\textrm{and}\\ \overline{xx} \betahat_2 + \overline{x} \betahat_1 ={}& \overline{yx}. \end{align*} \]
We have a linear system with two unknowns and two equations. An elegant way to solve them is to subtract \(\overline{x}\) times the first equation from the second, giving:
\[ \begin{align*} \overline{x} \betahat_1 - \overline{x} \betahat_1 + \overline{xx} \betahat_2 - \overline{x}^2 \betahat_2 ={}& \overline{xy} - \overline{x} \overline{y} \Leftrightarrow\\ \betahat_2 ={}& \frac{\overline{xy} - \overline{x} \, \overline{y}} {\overline{xx} - \overline{x}^2}, \end{align*} \]
as long as \(\overline{xx} - \overline{x}^2 \ne 0\).
In ordinary language, what does it mean for \(\overline{xx} - \overline{x}^2 = 0\)?
We can then plug this into the first equation giving
\[ \betahat_1 = \overline{y} - \betahat_2 \overline{x}. \]
Matrix multiplication version
Alternatively, our criterion can be written in matrix form as
\[ \begin{pmatrix} 1 & \overline{x} \\ \overline{x} & \overline{xx} \end{pmatrix} \begin{pmatrix} \betahat_1 \\ \betahat_2 \end{pmatrix} = \begin{pmatrix} \overline{y} \\ \overline{xy} \end{pmatrix} \tag{2}\]
Recall that there is a special matrix that allows us to get an expression for \(\betahat_1\) and \(\betahat_2\):
\[ \begin{align*} \begin{pmatrix} 1 & \overline{x} \\ \overline{x} & \overline{xx} \end{pmatrix}^{-1} = \frac{1}{\overline{xx} - \overline{x}^2} \begin{pmatrix} \overline{xx} & - \overline{x} \\ -\overline{x} & 1 \end{pmatrix} \end{align*} \]
This matrix is called the “inverse” because
\[ \begin{align*} \begin{pmatrix} 1 & \overline{x} \\ \overline{x} & \overline{xx} \end{pmatrix}^{-1} \begin{pmatrix} 1 & \overline{x} \\ \overline{x} & \overline{xx} \end{pmatrix} = \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix}. \end{align*} \]
Verify the preceding property.
Multiplying both sides of Equation 2 by the matrix inverse gives
\[ \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix} \begin{pmatrix} \betahat_1 \\ \betahat_2 \end{pmatrix} = \begin{pmatrix} \betahat_1 \\ \betahat_2 \end{pmatrix} = \frac{1}{\overline{xx} - \overline{x}^2} \begin{pmatrix} \overline{xx} & - \overline{x} \\ -\overline{x} & 1 \end{pmatrix} \begin{pmatrix} \overline{y} \\ \overline{xy} \end{pmatrix}. \]
From this we can read off the familiar answer
\[ \begin{align*} \betahat_2 ={}& \frac{\overline{xy} - \overline{x}\,\overline{y}}{\overline{xx} - \overline{x}^2}\\ \betahat_1 ={}& \frac{\overline{xx}\,\overline{y} - \overline{xy}\,\overline{x}}{\overline{xx} - \overline{x}^2}\\ ={}& \frac{\overline{xx}\,\overline{y} - \overline{x}^2 \overline{y} + \overline{x}^2 \overline{y} - \overline{xy}\,\overline{x}} {\overline{xx} - \overline{x}^2}\\ ={}& \overline{y} - \frac{\overline{x}^2 \overline{y} - \overline{xy}\,\overline{x}} {\overline{xx} - \overline{x}^2} \\ ={}& \overline{y} - \betahat_1 \overline{x}. \end{align*} \]
Matrix notation
The preceding formula came from combining the equations that set the univariate gradients equal to zero, and then recognizing a matrix equation. We can in fact do both at the same time! But first we need some notation
Here is a formal definition of the type of model that we will study for the vast majority of the semester:
\[ \begin{align*} \y_n ={}& \beta_1 \x_{n1} + \beta_2 \x_{n2} + \ldots + \x_{nP} + \res_{n}, \quad\textrm{For }n=1,\ldots,N. \end{align*} \tag{3}\]
I will always use \(N\) for the number of observed data points, and \(P\) for the dimension of the regression vector.
Equation 3 is a general form of simpler cases. For example, if we take \(\x_{n1} \equiv 1\), \(\x_{n2}= \x_n\) to be some scalar, and \(P = 2\), then Equation 3 becomes Equation 1:
\[ \begin{align*} \y_n ={}& \beta_1 + \beta_2 \x_{n} + \res_{n}, \quad\textrm{For }n=1,\ldots,N. \end{align*} \]
The residuals \(\res_n\) measure the “misfit” of the line. If you know \(\beta_1, \ldots, \beta_P\), then you can compute
\[ \begin{align*} \res_n ={}& \y_n - (\beta_1 \x_{n1} + \beta_2 \x_{n2} + \ldots + \x_{nP}). \end{align*} \]
But in general we only observe \(\y_n\) and \(\x_{n1}, \ldots, \x_{nP}\), and we choose \(\beta_1, \ldots, \beta_P\) to make the residuals small. (How we do this precisely will be something we talk about at great length.)
The general form of Equation 3 can be written more compactly using matrix and vector notation. Specifically, if we let
\[ \begin{align*} \xv_n := \begin{pmatrix} \x_{n1} \\ \x_{n2} \\ \vdots \\ \x_{nP} \end{pmatrix} \quad \textrm{and} \quad \betav := \begin{pmatrix} \beta_{1} \\ \beta_2 \\ \vdots \\ \beta_{P} \end{pmatrix} \end{align*} \]
Bold lowercase variables are column vectors (unless otherwise specified).
Recall that the “transpose” operator \((\cdot)^\trans\) flips the row and columns of a matrix. For example,
\[ \begin{align*} \xv_n ^\trans = \begin{pmatrix} \x_{n1} & \x_{n2} & \ldots & \x_{nP} \end{pmatrix}. \end{align*} \]
By matrix multiplication rules,
\[ \begin{align*} \xv_n^\trans \betav = \begin{pmatrix} \x_{n1} & \x_{n2} & \ldots & \x_{nP} \end{pmatrix} \quad\quad\quad \begin{pmatrix} \beta_{1} \\ \beta_2 \\ \vdots \\ \beta_{P} \end{pmatrix} = \beta_1 \x_{n1} + \beta_2 \x_{n2} + \ldots + \x_{nP}. \end{align*} \]
I have written \(\xv_n^\trans \betav\) for the “dot product” or “inner product” between \(\xv_n\) and \(\betav\). Writing it in this way clarifies the relationship with matrix notation below.
There are many other ways to denote inner products in the literature, including \(\xv_n \cdot \betav\) and \(<\xv_n, \betav>\).
Then we can compactly write
\[ \begin{align*} \y_n ={}& \xv_n ^\trans \betav + \res_{n}, \quad\textrm{For }n=1,\ldots,N. \end{align*} \]
We can compactify it even further if we stack the \(n\) observations: % \[ \begin{align*} \y_1 ={}& \xv_1 ^\trans \betav + \res_{1} \\ \y_2 ={}& \xv_2 ^\trans \betav + \res_{2} \\ \vdots\\ \y_N ={}& \xv_N ^\trans \betav + \res_{N} \\ \end{align*} \]
As before we can stack the responses and residuals:
\[ \begin{align*} \Y := \begin{pmatrix} \y_{1} \\ \y_2 \\ \vdots \\ \y_{P} \end{pmatrix} \quad \textrm{and} \quad \resv := \begin{pmatrix} \res_{1} \\ \res_2 \\ \vdots \\ \res_{P} \end{pmatrix} \end{align*} \]
We can also stack the regressors:
\[ \begin{align*} \X := \begin{pmatrix} \x_{11} & \x_{12} & \ldots & \x_{1P}\\ \x_{21} & \x_{22} & \ldots & \x_{2P}\\ \vdots\\ \x_{n1} & \x_{n2} & \ldots & \x_{nP}\\ \vdots\\ \x_{N1} & \x_{N2} & \ldots & \x_{NP} \end{pmatrix} = \begin{pmatrix} \xv_{1}^\trans \\ \xv_{2}^\trans \\ \vdots \\ \xv_n^\trans \\ \vdots \\ \xv_{N}^\trans \end{pmatrix} \end{align*} \]
I will use upper case bold letters for multi-dimensional matrices like \(\X\). But I may also use upper case bold letters even when the quantity could also be a column vector, when I think it’s more useful to think of the quantity as a matrix with a single column. Examples are \(\Y\) above, or \(\X\) when \(P = 1\).
Note that by matrix multiplication rules,
\[ \begin{align*} \X = \begin{pmatrix} \xv_{1}^\trans \\ \xv_{2}^\trans \\ \vdots \\ \xv_n^\trans \\ \vdots \\ \xv_{N}^\trans \end{pmatrix} \quad\quad\quad \X \betav = \begin{pmatrix} \xv_{1}^\trans\betav \\ \xv_{2}^\trans\betav \\ \vdots \\ \xv_n^\trans\betav \\ \vdots \\ \xv_{N}^\trans\betav \end{pmatrix} \end{align*} \]
so we end up with the extremely tidy expression
\[ \begin{align*} \y_n ={}& \beta_1 \x_{n1} + \beta_2 \x_{n2} + \ldots + \x_{nP} + \res_{n}, \quad\textrm{For }n=1,\ldots,N \\\\&\textrm{is the same as}\quad\\\\ \Y ={}& \X \betav + \resv. \end{align*} \tag{4}\]
In the case of simple least squares, we can write
\[ \begin{align*} \X := \begin{pmatrix} 1 & \x_{1}\\ 1 & \x_{2}\\ \vdots & \vdots\\ 1 & \x_{N}\\ \end{pmatrix}, \end{align*} \tag{5}\]
and verify that the \(n\)–th row of Equation 4 is the same as Equation 1.
Least squares in matrix notation
Using our tidy expression Equation 4, we can easily write out the sum of the squared errors as
\[ \begin{aligned} \sumn \res_n^2 ={}& \resv^\trans \resv = (\Y - \X \betav)^\trans (\Y - \X \betav) \\={}& \Y^\trans \Y - \betav^\trans \X^\trans \Y - \Y^\trans \X \betav + \betav^\trans \X^\trans \X \betav \\={}& \Y^\trans \Y - 2 \Y^\trans \X \betav + \betav^\trans \X^\trans \X \betav. \end{aligned} \]
This is a quadratic function of the vector \(\betav\). We wish to find the minimum of this quantity as a function of \(\betav\). We might hope that the minimum occurs at a point where the gradient of this expression is zero.
Rather than compute the univariate derivative with respect to each component, we can compute the multivariate gradient with respect to the vector.
Let’s recall some facts from vector calculus.
Take \(\zv \in \mathbb{R}^P\) to be a \(P\)–vector. and let \(f(\zv)\) a scalar–valued function of the vector \(\z\). We write
\[ \frac{\partial f(\zv)}{\partial \zv} = \begin{pmatrix} \frac{\partial}{\partial \z_1} f(\zv) \\ \vdots\\ \frac{\partial}{\partial \z_P} f(\zv) \\ \end{pmatrix}. \]
That is, the partial \(\frac{\partial f(\zv)}{\partial \zv}\) is a \(P\)–vector of the stacked univariate dervatives.
Recall a couple rules from vector calculus. Let \(\vv\) denote a \(P\)–vector and \(\A\) a symmetric matrix. Then
\[ \begin{align*} \frac{\partial \vv^\trans \zv}{\partial \zv} = \vv \quad\textrm{and}\quad \frac{\partial \zv^\trans \A \zv}{\partial \zv} = 2 \A \zv. \end{align*} \]
Prove these results above using univariate derivatives and our stacking convention.
Applying these two rules to our least squares objective,
\[ \begin{aligned} \frac{\partial\resv^\trans \resv }{\partial \betav} ={}& \frac{\partial }{\partial \betav} \Y^\trans \Y - 2 \frac{\partial }{\partial \betav} \Y^\trans \X \betav + \frac{\partial }{\partial \betav} \betav^\trans \X^\trans \X \betav \\ ={}& 0 - 2 \X^\trans \Y + 2 \X^\trans \X \betav. \end{aligned} \]
Assuming our estimator \(\betahat\) sets these partial derivatives are equal to zero, we then get
\[ \begin{align*} \X^\trans \X \betavhat ={}& \X^\trans \Y. \end{align*} \tag{6}\]
This is a set of \(P\) equations in \(P\) unknowns. If it is not degenerate, one can solve for \(\betavhat\). That is, if the matrix \(\X^\trans \X\) is invertible, then we can multiply both sides of Equation 6 by \((\X^\trans \X)^{-1}\) to get
\[ \betavhat = (\X^\trans \X)^{-1} \X^\trans \Y \tag{7}\]
We’re going to be talking again again about the “ordinary least squares” (“OLS”) problem \[ \betavhat = \argmin{\beta} \resv^\trans \resv \quad\textrm{where}\quad \Y = \X \betav + \resv \quad\textrm{where}\quad \y_n = \xv_n^\trans \betav + \res_n \textrm{ for all }n=1,\ldots,N. \]
It will be nice to have some shorthand for this problem so I don’t have to write this out every time. All of the following will be understood as shorthand for the preceding problem. In each, the fact that \(\betavhat\) minimizes the sum of squared residuals is implicit.
\[ \begin{aligned} \Y \sim{}& \X \betav + \resv& \textrm{Only the least squares criterion is implicit}\\ \Y \sim{}& \X \betav & \textrm{$\resv$ implicit}\\ \Y \sim{}& \X & \textrm{$\resv$, $\betav$ implicit}\\ \y_n \sim{}& \xv_n^\trans \betav & \textrm{$\res_n$, $N$ implicit}\\ \y_n \sim{}& \xv_n^\trans \betav + \res_n & \textrm{$N$ implicit}\\ \y_n \sim{}& \xv_n & \textrm{$\res_n$, $\betav$, $N$ implicit}\\ \y_n \sim{}& \x_{n1} + \x_{n2} + \ldots + \x_{nP} & \textrm{$\res_n$, $\betav$, $N$ implicit}\\ \end{aligned} \]
(The final shorthand is closest to the notation for the R
lm
function.) This is convenient because, for example, certain properties of the regression \(\y_n \sim \xv_n\) don’t necessarily need to commit to which symbol we use for the coefficients. Symbols for the missing pieces will hopefuly be clear from context as necessary.
Unlike many other regression texts (and the lm
function), I will not necessarily assume that a constant is included in the regression. One can always take a generic regression \(\y_n \sim \xv_n\) to include a constant by assuming that one of the entries of \(\xv_n\) is one. At some points my convention of not including a constant by default will lead to formulas that may be at odds with some textbooks. But these differences are superficial, and are, in my mind, more than made up for by the generality and simplicity of treating constants as just another regressor.
Some examples
The formulas given in Equation 6 and Equation 7 are enough to calculate every least squares estimator we’ll encounter. But we’d like to have intuition for the meaning of the formulas, and for that it will be useful to start by applying the formulas in some familiar settings.
Recall that, any fit of the form \(\X \betav\) that minimizes the sum of squared errors must take the form \(\X \betavhat\) where \(\betavhat\) satisfies
\[ \begin{align*} \X^\trans \X \betavhat ={}& \X^\trans \Y. \end{align*} \]
However, there may in general be many \(\betavhat\) that satisfy the preceding equation. But if \(\X^\trans \X\) is invertible, then \[ \betavhat = (\X^\trans \X)^{-1} \X^\trans \Y, \]
and there is only one \(\betavhat\) that minimizes the sum of squared error.
Let’s take a careful look at what \(\X^\trans \X\) is measuring, what it means for it to be invertible, as well as what it means when it is not invertible.
The sample mean
I will use \(\onev\) to denote a vector full of ones. Usually it will be an \(N\)–vector, but sometimes its dimension will just be implicit. Similarly, \(\zerov\) is a vector of zeros.
We showed earlier that the sample mean is a special case of the regression \(\y_n \sim 1 \cdot \beta\). This can be expressed in matrix notation by taking \(\X = \onev\) as a \(N\times 1\) vector. We then have
\[ \X^\trans \X = \onev^\trans \onev = \sumn 1 \cdot 1 = N, \]
so \(\X^\trans \X\) is invertible as long as \(N > 0\) (i.e., if you have at least one datapoint), with \((\X^\trans \X)^{-1} = 1/N\). We also have
\[ \X^\trans \Y = \onev^\trans \Y = \sumn 1 \cdot \y_n = N \ybar, \]
and so
\[ \betahat = (\X^\trans \X)^{-1} \X^\trans \Y = (\onev^\trans \onev)^{-1} \onev^\trans \Y = \frac{N \ybar}{N} = \ybar, \]
as expected.
A single regressor
Suppose that we regress \(\y_n \sim \x_n\) where \(\x_n\) is a scalar. Let’s suppose that \(\expect{\x_n} = 0\) and \(\var{\x_n} = \sigma^2 > 0\). We have
\[ \X = \begin{pmatrix} \x_1 \\ \x_2 \\ \vdots\\ \x_N \end{pmatrix} \]
so
\[ \X^\trans \X = \sumn \x_n^2. \]
Depending on the distribution of \(\x_n\), it may be possible for \(\X^\trans \X\) to be non–invertible!
Produce a distribution for \(\x_n\) where \(\X^\trans \X\) is non–invertible with positive probability for any \(N\).
However, as \(N \rightarrow \infty\), \(\frac{1}{N} \X^\trans \X \rightarrow \sigma^2\) by the LLN, and since \(\sigma^2 > 0\), \(\frac{1}{N} \X^\trans \X\) will be invertible with probability approaching one as \(N\) goes to infinity.
One–hot encodings
We discussed one-hot encodings in the context of the Ames housing data. Suppose we have a columns \(k_n \in \{ g, e\}\) indicating whether a kitchen is “good” or “excellent”. A one–hot encoding of this categorical variable is given by
\[ \begin{aligned} \x_{ng} = \begin{cases} 1 & \textrm{ if }k_n = g \\ 0 & \textrm{ if }k_n \ne g \\ \end{cases} && \x_{ne} = \begin{cases} 1 & \textrm{ if }k_n = e \\ 0 & \textrm{ if }k_n \ne e \\ \end{cases} \end{aligned}. \]
We can then regress \(\y_n \sim \beta_g \x_{ng} + \beta_e \x_{ne} = \x_n^\trans \betav\). The corresponding \(\X\) matrix might look like
\[ \begin{aligned} \mybold{k} = \begin{pmatrix} g \\ e \\ g \\ g \\ \vdots \end{pmatrix} && \X = (\xv_g \quad \xv_e) = \begin{pmatrix} 1 & 0 \\ 0 & 1 \\ 1 & 0 \\ 1 & 0 \\ \vdots \end{pmatrix} \end{aligned} \]
Note that \(\xv_g^\trans \xv_g\) is just the number of entries with \(k_n = g\), and \(\xv_g^\trans \xv_e = 0\) because a kitchen is either good or excellent but never both.
We then have
\[ \X^\trans \X = \begin{pmatrix} \xv_g^\trans \xv_g & \xv_g^\trans \xv_e \\ \xv_e^\trans \xv_g & \xv_e^\trans \xv_e \\ \end{pmatrix} = \begin{pmatrix} N_g & 0 \\ 0 & N_e \\ \end{pmatrix}. \]
Then \(\X^\trans \X\) is invertible as long as \(N_g > 0\) and \(N_e > 0\), that is, as long as we have at least one observation of each kitchen type, and
\[ \left(\X^\trans \X\right)^{-1} = \begin{pmatrix} \frac{1}{N_g} & 0 \\ 0 & \frac{1}{N_e} \\ \end{pmatrix}. \]
Similarly, \(\xv_g^\trans \Y\) is just the sum of entries of \(\Y\) where \(k_n = g\), with the analogous conclusion for \(\xv_e\). From this we recover the result that
\[ \betavhat = (\X^\trans \X)^{-1} \X^\trans \Y = \begin{pmatrix} \frac{1}{N_g} & 0 \\ 0 & \frac{1}{N_e} \\ \end{pmatrix} \begin{pmatrix} \sum_{n: k_n=g} \y_n \\ \sum_{n: k_n=e} \y_n \end{pmatrix} = \begin{pmatrix} \frac{1}{N_g} \sum_{n: k_n=g} \y_n \\ \frac{1}{N_e} \sum_{n: k_n=e} \y_n \end{pmatrix}. \]
If we let \(\ybar_e\) and \(\ybar_g\) denote the sample means within each group, we have shows that \(\betahat_g = \ybar_g\) and \(\betahat_e = \ybar_e\), as we proved before without using the matrix formulation.
One–hot encodings and constants
Recall in the Ames housing data, we ran the following two regressions:
\[ \begin{aligned} \y_n \sim{}& \beta_e \x_{ne} + \beta_g \x_{ng} \\ \y_n \sim{}& \gamma_0 + \gamma_g \x_{ng} + \res_n = \z_n^\trans \gammav, \end{aligned} \] where I take \(\gammav = (\gamma_0, \gamma_g)^\trans\) and \(\z_n = (1, \x_{ng})^\trans\).
We found using R
that the best fits were given by
\[ \begin{aligned} \betahat_e =& \ybar_e & \betahat_g =& \ybar_g \\ \gammahat_0 =& \ybar_e & \gammahat_g =& \ybar_g - \ybar_e \\ \end{aligned} \]
We can compute the latter by constructing the \(\Z\) matrix whose rows are \(\z_n^\trans\). (We use \(\Z\) to differentiate the \(\X\) matrix from the previous example.) Using similar reasoning to the one–hot encoding, we see that
\[ \Z^\trans \Z = \begin{pmatrix} N & N_g \\ N_g & N_g \end{pmatrix}. \]
This is invertible as long as \(N_g \ne N\), i.e., as long as there is at least one \(k_n = e\). We have
\[ (\Z^\trans \Z)^{-1} = \frac{1}{N_g (N - N_g)} \begin{pmatrix} N_g & -N_g \\ -N_g & N \end{pmatrix} \quad\textrm{and}\quad \Z^\trans \Y = \begin{pmatrix} \sumn \y_n \\ \sum_{n: k_n=g} \y_n \\ \end{pmatrix} \]
It is possible (but a little tedious) to prove \(\gammahat_0 = \ybar_e\) and \(\gammahat_g = \ybar_g - \ybar_e\) using these formulas. But an easier way to see it is as follows.
Note that \(\x_{ne} + \x_{ng} = 1\). That means we can always re-write the regression with a constant as
\[ \y_n \sim \gamma_0 + \gamma_g \x_{ng} = \gamma_0 (\x_{ne} + \x_{ng}) + \gamma_g \x_{ng} = \gamma_0 \x_{ne} + (\gamma_0 + \gamma_g) \x_{ng}. \]
Now, we already know from the one–hot encoding case that the sum of squared residuals is minimized by setting \(\gammahat_0 = \ybar_e\) and \(\gammahat_0 + \gammahat_g = \ybar_g\). We can then solve for \(\gammahat_g = \ybar_g - \ybar_e\), as expected.
This is case where we have two regressions whose regressors are invertible linear combinations of one another:
\[ \zv_n = \begin{pmatrix} 1 \\ \x_{ng} \end{pmatrix} = \begin{pmatrix} \x_{ne} + \x_{ng} \\ \x_{ng} \end{pmatrix} = \begin{pmatrix} 1 & 1 \\ 1 & 0\\ \end{pmatrix} \begin{pmatrix} \x_{ng} \\ \x_{ne} \end{pmatrix} = \begin{pmatrix} 1 & 1 \\ 1 & 0\\ \end{pmatrix} \xv_n. \]
It follows that if you can acheive a least squares fit with \(\xv_n^\trans \betavhat\), you can achieve exactly the same fit with
\[ \betavhat^\trans \xv_n = \betavhat^\trans \begin{pmatrix} 1 & 1 \\ 1 & 0\\ \end{pmatrix}^{-1} \zv_n, \]
which can be achieved by taking
\[ \gammavhat^\trans = \betavhat^\trans \begin{pmatrix} 1 & 1 \\ 1 & 0\\ \end{pmatrix}^{-1} \Rightarrow \gammavhat = \begin{pmatrix} 1 & 1 \\ 1 & 0\\ \end{pmatrix}^{-T} \betavhat = \frac{1}{-1} \begin{pmatrix} 0 & -1 \\ -1 & 1\\ \end{pmatrix} \betavhat = \begin{pmatrix} \betahat_2 \\ \betahat_1 - \betahat_2 \\ \end{pmatrix}, \]
exactly as expected.
We will see this is an entirely general result: when regressions are related by invertible linear transformations of regressors, the fit does not change, but the optimal coefficients are linear transforms of one another.
Redundant regressors
Suppose we run the (silly) regression \(\y \sim \alpha \cdot 1 + \gamma \cdot 3 + \res_n\). That is, we regress on both the constant \(1\) and the constant \(3\). We have
\[ \X = \begin{pmatrix} 1 & 3 \\ 1 & 3 \\ 1 & 3 \\ \vdots \end{pmatrix} = (\onev \quad 3 \onev) \]
and so
\[ \X^\trans \X = \begin{pmatrix} \onev^\trans \onev & 3 \onev^\trans \onev \\ 3 \onev^\trans \onev & 9 \onev^\trans \onev \\ \end{pmatrix} = N \begin{pmatrix} 1 & 3 \\ 3 & 9 \\ \end{pmatrix} \]
This is not invertible (the second row is \(3\) times the first, and the determinant is \(9 - 3 \cdot 3 = 0\)). So \(\betavhat\) is not defined. What went wrong?
One way to see this is to define \(\beta = \alpha + 3 \gamma\) and write
\[ \y_n = (\alpha + 3 \gamma) + \res_n = \beta + \res_n. \]
There is obviously only one \(\betahat\) that minimizes \(\sumn \res_n^2\), \(\betahat = \ybar\). But there are an infinite set of choices for \(\alpha\) and \(\gamma\) satisfying
\[ \alpha + 3 \gamma = \betahat = \ybar. \]
Specifically, for any value of \(\gamma\) we can take \(\alpha = \ybar - 3 \gamma\), leaving \(\beta\) unchanged. All of these choices for \(\alpha,\gamma\) acheive the same \(\sumn \res_n^2\)! So the least squares criterion cannot distinguish among them.
In general, this is what it means for \(\X^\trans \X\) to be non–invertibile. It happens precisely when there are redundant regressors, and many regression coefficients that result in the same fit.
Redundant regressors and zero eigenvalues
In fact, \(\X^\trans \X\) is invertible precisely when \(\X^\trans \X\) has a zero eigenvalue. In the preceding example, we can see that
\[ \X^\trans \X \begin{pmatrix} 3 \\ -1 \end{pmatrix} = N \begin{pmatrix} 1 & 3 \\ 3 & 9 \\ \end{pmatrix} \begin{pmatrix} 3 \\ -1 \end{pmatrix} = \begin{pmatrix} 0 \\ 0 \end{pmatrix}, \]
so \((3, -1)^\trans\) is a zero eigenvector. (In general you might find this by numerical eigenvalue decomposition, but in this case you can just guess the zero eigenvalue.)
Going back to Equation 6, we see that this means that
\[ \X^\trans \Y = (\X^\trans \X) \begin{pmatrix} \alphahat \\ \gammahat \end{pmatrix} = N \begin{pmatrix} 1 & 3 \\ 3 & 9 \\ \end{pmatrix} \begin{pmatrix} \alphahat \\ \gammahat \end{pmatrix} = N \begin{pmatrix} 1 & 3 \\ 3 & 9 \\ \end{pmatrix} \left( \begin{pmatrix} \alphahat \\ \gammahat \end{pmatrix} + C \begin{pmatrix} 3 \\ -1 \end{pmatrix} \right) \]
for any value of \(C\). This means there are an infinite set of “optimal” values, all of which set the gradient of the loss to zero, and all of which have the same value of the loss function (i.e. acheive the same fit). And you can check that these family of values are exactly the ones that satisfy \(\alpha + 3 \gamma = \betahat = \ybar\), since
\[ \alpha + 3 \gamma = \begin{pmatrix} 1 & 3 \end{pmatrix} \begin{pmatrix} \alpha \\ \gamma \end{pmatrix} \quad\quad\textrm{and}\quad\quad \begin{pmatrix} 1 & 3 \end{pmatrix} \begin{pmatrix} 3 \\ -1 \end{pmatrix} = 0. \]
Soon, we will see that this is a general result: when \(\X^\trans \X\) is not invertible, that means there are many equivalent least squares fits, all characterized precisely by the zero eigenvectors of \(\X^\trans \X\).
Zero variance regressors
An example of redundant regressors occurs when the sample variance of \(\x_n\) is zero and a constant is included in the regression. Specifically, suppose that \(\overline{xx} - \overline{x}^2 = 0\).
Prove that \(\overline{xx} - \overline{x}^2 = 0\) means \(\x_n\) is a constant with \(\x_n = \xbar\). Hint: look at the sample variance of \(\x_n\).
Let’s regress \(\y_n \sim \beta_1 + \beta_2 \x_n\).
For simplicity, let’s take \(\x_n = 3\). In that case we can rewrite our estimating equation as
\[ \y_n = \beta_1 + \beta_2 \x_n + \res_n = (\beta_1 + \beta_2 \xbar) + \res_n. \]
We’re thus in the previous setting with \(\xbar\) in place of the number \(3\).
Orthogonal regressors
Suppose we have regressors such that the columns of \(\X\) are orthonormal. This seems strange at first, since we usually specify the rows of the regressors, not the columns. But in fact we have seen a near–example with one–hot encodings, which are defined row–wise, but which produce orthogonal column vectors in \(\X\). If we divide a one–hot encoding by the square root of the number of ones in the whole dataset, we produce an normal column vector.
If \(\X\) has orthonormal columns, then \(\X^\trans \X = \id\), the identity matrix, and so
\[ \betavhat = (\X^\trans \X)^{-1} \X^\trans \Y = \X^\trans \Y. \]
This is of course the same answer we would have gotten if we had tried to write \(\Y\) in the basis of the column vectors of \(\X\):
\[ \begin{aligned} \Y ={}& \betahat_1 \X_{\cdot 1} + \ldots + \betahat_P \X_{\cdot P} = \X \betavhat \Rightarrow \\ \X^\trans \Y ={}& \X^\trans \X \betavhat = \betavhat \end{aligned} \]
This regression is particularly simple — each component of \(\betavhat\) depends only on its corresponding column of \(\X\).
Note that if each entry of \(\xv_n\) is mean zero, unit variance, and uncorrelated with the other entries, then \(\frac{1}{N} \X^\trans \X \rightarrow \id\) by the LLN. Such a regressor matrix is not typically orthogonal for any particular \(N\), but it approaches orthogonality as \(N\) grows.