$$ \newcommand{\mybold}[1]{\boldsymbol{#1}} \newcommand{\trans}{\intercal} \newcommand{\norm}[1]{\left\Vert#1\right\Vert} \newcommand{\abs}[1]{\left|#1\right|} \newcommand{\bbr}{\mathbb{R}} \newcommand{\bbz}{\mathbb{Z}} \newcommand{\bbc}{\mathbb{C}} \newcommand{\gauss}[1]{\mathcal{N}\left(#1\right)} \newcommand{\chisq}[1]{\mathcal{\chi}^2_{#1}} \newcommand{\studentt}[1]{\mathrm{StudentT}_{#1}} \newcommand{\fdist}[2]{\mathrm{FDist}_{#1,#2}} \newcommand{\iid}{\overset{\mathrm{IID}}{\sim}} \newcommand{\argmin}[1]{\underset{#1}{\mathrm{argmin}}\,} \newcommand{\projop}[1]{\underset{#1}{\mathrm{Proj}}\,} \newcommand{\proj}[1]{\underset{#1}{\mybold{P}}} \newcommand{\expect}[1]{\mathbb{E}\left[#1\right]} \newcommand{\prob}[1]{\mathbb{P}\left(#1\right)} \newcommand{\dens}[1]{\mathit{p}\left(#1\right)} \newcommand{\var}[1]{\mathrm{Var}\left(#1\right)} \newcommand{\cov}[1]{\mathrm{Cov}\left(#1\right)} \newcommand{\sumn}{\sum_{n=1}^N} \newcommand{\meann}{\frac{1}{N} \sumn} \newcommand{\cltn}{\frac{1}{\sqrt{N}} \sumn} \newcommand{\trace}[1]{\mathrm{trace}\left(#1\right)} \newcommand{\diag}[1]{\mathrm{Diag}\left(#1\right)} \newcommand{\grad}[2]{\nabla_{#1} \left. #2 \right.} \newcommand{\gradat}[3]{\nabla_{#1} \left. #2 \right|_{#3}} \newcommand{\fracat}[3]{\left. \frac{#1}{#2} \right|_{#3}} \newcommand{\W}{\mybold{W}} \newcommand{\w}{w} \newcommand{\wbar}{\bar{w}} \newcommand{\wv}{\mybold{w}} \newcommand{\X}{\mybold{X}} \newcommand{\x}{x} \newcommand{\xbar}{\bar{x}} \newcommand{\xv}{\mybold{x}} \newcommand{\Xcov}{\mybold{M}_{\X}} \newcommand{\Xcovhat}{\hat{\mybold{M}}_{\X}} \newcommand{\Covsand}{\Sigmam_{\mathrm{sand}}} \newcommand{\Covsandhat}{\hat{\Sigmam}_{\mathrm{sand}}} \newcommand{\Z}{\mybold{Z}} \newcommand{\z}{z} \newcommand{\zv}{\mybold{z}} \newcommand{\zbar}{\bar{z}} \newcommand{\Y}{\mybold{Y}} \newcommand{\Yhat}{\hat{\Y}} \newcommand{\y}{y} \newcommand{\yv}{\mybold{y}} \newcommand{\yhat}{\hat{\y}} \newcommand{\ybar}{\bar{y}} \newcommand{\res}{\varepsilon} \newcommand{\resv}{\mybold{\res}} \newcommand{\resvhat}{\hat{\mybold{\res}}} \newcommand{\reshat}{\hat{\res}} \newcommand{\betav}{\mybold{\beta}} \newcommand{\betavhat}{\hat{\betav}} \newcommand{\betahat}{\hat{\beta}} \newcommand{\betastar}{{\beta^{*}}} \newcommand{\betavstar}{{\betav^{*}}} \newcommand{\loss}{\mathscr{L}} \newcommand{\losshat}{\hat{\loss}} \newcommand{\f}{f} \newcommand{\fhat}{\hat{f}} \newcommand{\bv}{\mybold{\b}} \newcommand{\bvhat}{\hat{\bv}} \newcommand{\alphav}{\mybold{\alpha}} \newcommand{\alphavhat}{\hat{\av}} \newcommand{\alphahat}{\hat{\alpha}} \newcommand{\omegav}{\mybold{\omega}} \newcommand{\gv}{\mybold{\gamma}} \newcommand{\gvhat}{\hat{\gv}} \newcommand{\ghat}{\hat{\gamma}} \newcommand{\hv}{\mybold{\h}} \newcommand{\hvhat}{\hat{\hv}} \newcommand{\hhat}{\hat{\h}} \newcommand{\gammav}{\mybold{\gamma}} \newcommand{\gammavhat}{\hat{\gammav}} \newcommand{\gammahat}{\hat{\gamma}} \newcommand{\new}{\mathrm{new}} \newcommand{\zerov}{\mybold{0}} \newcommand{\onev}{\mybold{1}} \newcommand{\id}{\mybold{I}} \newcommand{\sigmahat}{\hat{\sigma}} \newcommand{\etav}{\mybold{\eta}} \newcommand{\muv}{\mybold{\mu}} \newcommand{\Sigmam}{\mybold{\Sigma}} \newcommand{\rdom}[1]{\mathbb{R}^{#1}} \newcommand{\RV}[1]{{#1}} \def\A{\mybold{A}} \def\A{\mybold{A}} \def\av{\mybold{a}} \def\a{a} \def\B{\mybold{B}} \def\b{b} \def\S{\mybold{S}} \def\sv{\mybold{s}} \def\s{s} \def\R{\mybold{R}} \def\rv{\mybold{r}} \def\r{r} \def\V{\mybold{V}} \def\vv{\mybold{v}} \def\v{v} \def\vhat{\hat{v}} \def\U{\mybold{U}} \def\uv{\mybold{u}} \def\u{u} \def\W{\mybold{W}} \def\wv{\mybold{w}} \def\w{w} \def\tv{\mybold{t}} \def\t{t} \def\Sc{\mathcal{S}} \def\ev{\mybold{e}} \def\Lammat{\mybold{\Lambda}} \def\Q{\mybold{Q}} \def\eps{\varepsilon} $$

Familiar regression examples in matrix form.

$\textcolor{white}{\LaTeX}$

Goals

Review some familiar examples of linear regression in matrix form, seeing examples where
- One–hot encodings give orthogonal column vectors of $\X$
- …and orthonormal column vectors are easy regression problems!
- Mean zero, uncorrelated regressors give rise to orthonormal $\X$ columns asymptotically (and so to easy regression problems)
- Linear transformations of regressors give equivalent regressions.
- Redundant regressors give rise to OLS fits which are not unique

Some examples

The formulas given in Equation 1 and Equation 2 are enough to calculate every least squares estimator we’ll encounter. But we’d like to have intuition for the meaning of the formulas, and for that it will be useful to start by applying the formulas in some familiar settings.

Recall that, any fit of the form $\X \betav$ that minimizes the sum of squared errors must take the form $\X \betavhat$ where $\betavhat$ satisfies

\[ \begin{align*} \X^\trans \X \betavhat ={}& \X^\trans \Y. \end{align*} \tag{1}\]

However, there may in general be many $\betavhat$ that satisfy the preceding equation. But if $\X^\trans \X$ is invertible, then \[ \betavhat = (\X^\trans \X)^{-1} \X^\trans \Y, \tag{2}\]

and there is only one $\betavhat$ that minimizes the sum of squared error.

Let’s take a careful look at what $\X^\trans \X$ is measuring, what it means for it to be invertible, as well as what it means when it is not invertible.

Working examples

The following examples will be used to illustrate the main ideas.

The sample mean

Notation

I will use $\onev$ to denote a vector full of ones. Usually it will be an $N$–vector, but sometimes its dimension will just be implicit. Similarly, $\zerov$ is a vector of zeros.

We showed earlier that the sample mean is a special case of the regression $\y_n \sim 1 \cdot \beta$. This can be expressed in matrix notation by taking $\X = \onev$ as a $N\times 1$ vector. We then have

\[ \X^\trans \X = \onev^\trans \onev = \sumn 1 \cdot 1 = N, \]

so $\X^\trans \X$ is invertible as long as $N > 0$ (i.e., if you have at least one datapoint), with $(\X^\trans \X)^{-1} = 1/N$. We also have

\[ \X^\trans \Y = \onev^\trans \Y = \sumn 1 \cdot \y_n = N \ybar, \]

and so

\[ \betahat = (\X^\trans \X)^{-1} \X^\trans \Y = (\onev^\trans \onev)^{-1} \onev^\trans \Y = \frac{N \ybar}{N} = \ybar, \]

as expected.

A single (possibly continuous) regressor

Suppose that we regress $\y_n \sim \x_n$ where $\x_n$ is a scalar. Let’s suppose that $\expect{\x_n} = 0$ and $\var{\x_n} = \sigma^2 > 0$. We have

\[ \X = \begin{pmatrix} \x_1 \\ \x_2 \\ \vdots\\ \x_N \end{pmatrix} \]

\[ \X^\trans \X = \sumn \x_n^2. \]

Depending on the distribution of $\x_n$, it may be possible for $\X^\trans \X$ to be non–invertible!

Exercise

Produce a distribution for $\x_n$ where $\X^\trans \X$ is non–invertible with positive probability for any $N$.

However, as $N \rightarrow \infty$, $\frac{1}{N} \X^\trans \X \rightarrow \sigma^2$ by the LLN, and since $\sigma^2 > 0$, $\frac{1}{N} \X^\trans \X$ will be invertible with probability approaching one as $N$ goes to infinity.

One–hot encodings

We discussed one-hot encodings in the context of the Ames housing data. Suppose we have a columns $k_n \in \{ g, e\}$ indicating whether a kitchen is “good” or “excellent”. A one–hot encoding of this categorical variable is given by

\[ \begin{aligned} \x_{ng} = \begin{cases} 1 & \textrm{ if }k_n = g \\ 0 & \textrm{ if }k_n \ne g \\ \end{cases} && \x_{ne} = \begin{cases} 1 & \textrm{ if }k_n = e \\ 0 & \textrm{ if }k_n \ne e \\ \end{cases} \end{aligned}. \]

We can then regress $\y_n \sim \beta_g \x_{ng} + \beta_e \x_{ne} = \x_n^\trans \betav$. The corresponding $\X$ matrix might look like

\[ \begin{aligned} \mybold{k} = \begin{pmatrix} g \\ e \\ g \\ g \\ \vdots \end{pmatrix} && \X = (\xv_g \quad \xv_e) = \begin{pmatrix} 1 & 0 \\ 0 & 1 \\ 1 & 0 \\ 1 & 0 \\ \vdots \end{pmatrix} \end{aligned} \]

Note that $\xv_g^\trans \xv_g$ is just the number of entries with $k_n = g$, and $\xv_g^\trans \xv_e = 0$ because a kitchen is either good or excellent but never both.

We then have

\[ \X^\trans \X = \begin{pmatrix} \xv_g^\trans \xv_g & \xv_g^\trans \xv_e \\ \xv_e^\trans \xv_g & \xv_e^\trans \xv_e \\ \end{pmatrix} = \begin{pmatrix} N_g & 0 \\ 0 & N_e \\ \end{pmatrix}. \]

Then $\X^\trans \X$ is invertible as long as $N_g > 0$ and $N_e > 0$, that is, as long as we have at least one observation of each kitchen type, and

\[ \left(\X^\trans \X\right)^{-1} = \begin{pmatrix} \frac{1}{N_g} & 0 \\ 0 & \frac{1}{N_e} \\ \end{pmatrix}. \]

Similarly, $\xv_g^\trans \Y$ is just the sum of entries of $\Y$ where $k_n = g$, with the analogous conclusion for $\xv_e$. From this we recover the result that

\[ \betavhat = (\X^\trans \X)^{-1} \X^\trans \Y = \begin{pmatrix} \frac{1}{N_g} & 0 \\ 0 & \frac{1}{N_e} \\ \end{pmatrix} \begin{pmatrix} \sum_{n: k_n=g} \y_n \\ \sum_{n: k_n=e} \y_n \end{pmatrix} = \begin{pmatrix} \frac{1}{N_g} \sum_{n: k_n=g} \y_n \\ \frac{1}{N_e} \sum_{n: k_n=e} \y_n \end{pmatrix}. \]

If we let $\ybar_e$ and $\ybar_g$ denote the sample means within each group, we have shows that $\betahat_g = \ybar_g$ and $\betahat_e = \ybar_e$, as we proved before without using the matrix formulation.

Orthogonal regressions are just a bunch of univariate regressions

Suppose we have regressors such that the columns of regressors are orthonormal. This may seem strange at first, since we usually specify the rows of the regressors, not the columns. But in fact we have seen a near–example with one–hot encodings, which are defined row–wise, but which produce orthogonal column vectors when stacked in a matrix. If we divide a one–hot encoding by the square root of the number of ones in the whole dataset, we produce an normal column vector.

Let’s call the design matrix with orthonormal columns $\U$. Then $\U^\trans \U = \id$, the identity matrix, and so, in the regression $\Y \sim \U \betav$,

\[ \betavhat = (\U^\trans \U)^{-1} \U^\trans \Y = \id \U^\trans \Y = \begin{pmatrix} \U_{\cdot 1}^\trans \Y \\ \vdots \\ \U_{\cdot P}^\trans \Y \\ \end{pmatrix}. \]

This regression is particularly simple — each component of $\betavhat$ depends only on its corresponding column of $\U$, not on any of the other columns, and the multilinear regression has become $P$ separate univariate regression problems.

This is of course the same answer we would have gotten if we had tried to write $\Y$ in the basis of the column vectors of $\U$:

\[ \begin{aligned} \Y ={}& \betahat_1 \U_{\cdot 1} + \ldots + \betahat_P \U_{\cdot P} = \U \betavhat \Rightarrow \\ \U^\trans \Y ={}& \U^\trans \U \betavhat = \betavhat. \end{aligned} \]

Uncorrelated, mean–zero regressors and orthogonality

Suppose that we have regressors $\x_{np}$ that are all independent of one another, with $\var{\x_{np}} = \sigma_p^2 < \infty$ and $\expect{\x_{np}} = 0$. If we regress $\y_n \sim \betav^\trans \xv_n$, the corresponding $\X^\trans \X$ matrix is given by

\[ \X^\trans \X = \begin{pmatrix} \sumn \x_{n1}^2 & \sumn \x_{n1} \x_{n2} & \ldots \\ \sumn \x_{n1} \x_{n2} & \sumn \x_{n2}^2 & \ldots \\ \sumn \x_{n1} \x_{np} & \ldots & \sumn \x_{np}^2 \\ \end{pmatrix}. \]

In general, the columns of $\X$ are not orthogonal. For example, $\sumn \x_{n1} \x_{n2} \ne 0$, typically. But, but the LLN and indepedence,

\[\meann \x_{n1} \x_{n2} \rightarrow \expect{\x_{n1} \x_{n2}} = \expect{\x_{n1}} \expect{\x_{n2}} = 0\],

with analogous results for other indices. So, as $N\rightarrow \infty$,

\[ \frac{1}{N} \X^\trans \X = \begin{pmatrix} \sigma_1^2 & 0 & \ldots \\ 0 & \sigma_2^2 & \ldots \\ 0 & \ldots & \sigma_p^2 \\ \end{pmatrix}, \]

a diagonal matrix. In this sense, $\frac{1}{N} \X$ has asymptotically orthogonal columns.

In this sense, it is the dependence between different regressors within a single row that gives complexity to multilinear regression.

Different ways to write the same regression

One–hot encodings and constants

Recall in the Ames housing data, we ran the following two regressions:

\[ \begin{aligned} \y_n \sim{}& \beta_e \x_{ne} + \beta_g \x_{ng} \\ \y_n \sim{}& \gamma_0 + \gamma_g \x_{ng} + \res_n = \z_n^\trans \gammav, \end{aligned} \] where I take $\gammav = (\gamma_0, \gamma_g)^\trans$ and $\z_n = (1, \x_{ng})^\trans$.

We found using R that the best fits were given by

\[ \begin{aligned} \betahat_e =& \ybar_e & \betahat_g =& \ybar_g \\ \gammahat_0 =& \ybar_e & \gammahat_g =& \ybar_g - \ybar_e \\ \end{aligned} \]

We can compute the latter by constructing the $\Z$ matrix whose rows are $\z_n^\trans$. (We use $\Z$ to differentiate the $\X$ matrix from the previous example.) Using similar reasoning to the one–hot encoding, we see that

\[ \Z^\trans \Z = \begin{pmatrix} N & N_g \\ N_g & N_g \end{pmatrix}. \]

This is invertible as long as $N_g \ne N$, i.e., as long as there is at least one $k_n = e$. We have

\[ (\Z^\trans \Z)^{-1} = \frac{1}{N_g (N - N_g)} \begin{pmatrix} N_g & -N_g \\ -N_g & N \end{pmatrix} \quad\textrm{and}\quad \Z^\trans \Y = \begin{pmatrix} \sumn \y_n \\ \sum_{n: k_n=g} \y_n \\ \end{pmatrix} \]

It is possible (but a little tedious) to prove $\gammahat_0 = \ybar_e$ and $\gammahat_g = \ybar_g - \ybar_e$ using these formulas. But an easier way to see it is as follows.

Note that $\x_{ne} + \x_{ng} = 1$. That means we can always re-write the regression with a constant as

\[ \y_n \sim \gamma_0 + \gamma_g \x_{ng} = \gamma_0 (\x_{ne} + \x_{ng}) + \gamma_g \x_{ng} = \gamma_0 \x_{ne} + (\gamma_0 + \gamma_g) \x_{ng}. \]

Now, we already know from the one–hot encoding case that the sum of squared residuals is minimized by setting $\gammahat_0 = \ybar_e$ and $\gammahat_0 + \gammahat_g = \ybar_g$. We can then solve for $\gammahat_g = \ybar_g - \ybar_e$, as expected.

This is case where we have two regressions whose regressors are invertible linear combinations of one another:

\[ \zv_n = \begin{pmatrix} 1 \\ \x_{ng} \end{pmatrix} = \begin{pmatrix} \x_{ne} + \x_{ng} \\ \x_{ng} \end{pmatrix} = \begin{pmatrix} 1 & 1 \\ 1 & 0\\ \end{pmatrix} \begin{pmatrix} \x_{ng} \\ \x_{ne} \end{pmatrix} = \begin{pmatrix} 1 & 1 \\ 1 & 0\\ \end{pmatrix} \xv_n. \]

It follows that if you can acheive a least squares fit with $\xv_n^\trans \betavhat$, you can achieve exactly the same fit with

\[ \betavhat^\trans \xv_n = \betavhat^\trans \begin{pmatrix} 1 & 1 \\ 1 & 0\\ \end{pmatrix}^{-1} \zv_n, \]

which can be achieved by taking

\[ \gammavhat^\trans = \betavhat^\trans \begin{pmatrix} 1 & 1 \\ 1 & 0\\ \end{pmatrix}^{-1} \Rightarrow \gammavhat = \begin{pmatrix} 1 & 1 \\ 1 & 0\\ \end{pmatrix}^{-T} \betavhat = \frac{1}{-1} \begin{pmatrix} 0 & -1 \\ -1 & 1\\ \end{pmatrix} \betavhat = \begin{pmatrix} \betahat_2 \\ \betahat_1 - \betahat_2 \\ \end{pmatrix}, \]

exactly as expected.

We will see this is an entirely general result: when regressions are related by invertible linear transformations of regressors, the fit does not change, but the optimal coefficients are linear transforms of one another.

Redundant regressors and zero eigenvalues

Suppose we run the (silly) regression $\y \sim \alpha \cdot 1 + \gamma \cdot 3 + \res_n$. That is, we regress on both the constant $1$ and the constant $3$. We have

\[ \X = \begin{pmatrix} 1 & 3 \\ 1 & 3 \\ 1 & 3 \\ \vdots \end{pmatrix} = (\onev \quad 3 \onev) \]

and so

\[ \X^\trans \X = \begin{pmatrix} \onev^\trans \onev & 3 \onev^\trans \onev \\ 3 \onev^\trans \onev & 9 \onev^\trans \onev \\ \end{pmatrix} = N \begin{pmatrix} 1 & 3 \\ 3 & 9 \\ \end{pmatrix} \]

This is not invertible (the second row is $3$ times the first, and the determinant is $9 - 3 \cdot 3 = 0$). So $\betavhat$ is not defined. What went wrong?

One way to see this is to define $\beta = \alpha + 3 \gamma$ and write

\[ \y_n = (\alpha + 3 \gamma) + \res_n = \beta + \res_n. \]

There is obviously only one $\betahat$ that minimizes $\sumn \res_n^2$, $\betahat = \ybar$. But there are an infinite set of choices for $\alpha$ and $\gamma$ satisfying

\[ \alpha + 3 \gamma = \betahat = \ybar. \]

Specifically, for any value of $\gamma$ we can take $\alpha = \ybar - 3 \gamma$, leaving $\beta$ unchanged. All of these choices for $\alpha,\gamma$ acheive the same $\sumn \res_n^2$! So the least squares criterion cannot distinguish among them.

In general, this is what it means for $\X^\trans \X$ to be non–invertibile. It happens precisely when there are redundant regressors, and many regression coefficients that result in the same fit.

An eigenvalue perspective on the same result

In fact, $\X^\trans \X$ is invertible precisely when $\X^\trans \X$ has a zero eigenvalue. In the preceding example, we can see that

\[ \X^\trans \X \begin{pmatrix} 3 \\ -1 \end{pmatrix} = N \begin{pmatrix} 1 & 3 \\ 3 & 9 \\ \end{pmatrix} \begin{pmatrix} 3 \\ -1 \end{pmatrix} = \begin{pmatrix} 0 \\ 0 \end{pmatrix}, \]

so $(3, -1)^\trans$ is a zero eigenvector. (In general you might find this by numerical eigenvalue decomposition, but in this case you can just guess the zero eigenvalue.)

Going back to Equation 1, we see that this means that

\[ \X^\trans \Y = (\X^\trans \X) \begin{pmatrix} \alphahat \\ \gammahat \end{pmatrix} = N \begin{pmatrix} 1 & 3 \\ 3 & 9 \\ \end{pmatrix} \begin{pmatrix} \alphahat \\ \gammahat \end{pmatrix} = N \begin{pmatrix} 1 & 3 \\ 3 & 9 \\ \end{pmatrix} \left( \begin{pmatrix} \alphahat \\ \gammahat \end{pmatrix} + C \begin{pmatrix} 3 \\ -1 \end{pmatrix} \right) \]

for any value of $C$. This means there are an infinite set of “optimal” values, all of which set the gradient of the loss to zero, and all of which have the same value of the loss function (i.e. acheive the same fit). And you can check that these family of values are exactly the ones that satisfy $\alpha + 3 \gamma = \betahat = \ybar$, since

\[ \alpha + 3 \gamma = \begin{pmatrix} 1 & 3 \end{pmatrix} \begin{pmatrix} \alpha \\ \gamma \end{pmatrix} \quad\quad\textrm{and}\quad\quad \begin{pmatrix} 1 & 3 \end{pmatrix} \begin{pmatrix} 3 \\ -1 \end{pmatrix} = 0. \]

Soon, we will see that this is a general result: when $\X^\trans \X$ is not invertible, that means there are many equivalent least squares fits, all characterized precisely by the zero eigenvectors of $\X^\trans \X$.

Zero variance regressors

An example of redundant regressors occurs when the sample variance of $\x_n$ is zero and a constant is included in the regression. Specifically, suppose that $\overline{xx} - \overline{x}^2 = 0$.

Exercise

Prove that $\overline{xx} - \overline{x}^2 = 0$ means $\x_n$ is a constant with $\x_n = \xbar$. Hint: look at the sample variance of $\x_n$.

Let’s regress $\y_n \sim \beta_1 + \beta_2 \x_n$.

For simplicity, let’s take $\x_n = 3$. In that case we can rewrite our estimating equation as

\[ \y_n = \beta_1 + \beta_2 \x_n + \res_n = (\beta_1 + \beta_2 \xbar) + \res_n. \]

We’re thus in the previous setting with $\xbar$ in place of the number $3$.