$$ \newcommand{\mybold}[1]{\boldsymbol{#1}} \newcommand{\trans}{\intercal} \newcommand{\norm}[1]{\left\Vert#1\right\Vert} \newcommand{\abs}[1]{\left|#1\right|} \newcommand{\bbr}{\mathbb{R}} \newcommand{\bbz}{\mathbb{Z}} \newcommand{\bbc}{\mathbb{C}} \newcommand{\gauss}[1]{\mathcal{N}\left(#1\right)} \newcommand{\chisq}[1]{\mathcal{\chi}^2_{#1}} \newcommand{\studentt}[1]{\mathrm{StudentT}_{#1}} \newcommand{\fdist}[2]{\mathrm{FDist}_{#1,#2}} \newcommand{\iid}{\overset{\mathrm{IID}}{\sim}} \newcommand{\argmin}[1]{\underset{#1}{\mathrm{argmin}}\,} \newcommand{\projop}[1]{\underset{#1}{\mathrm{Proj}}\,} \newcommand{\proj}[1]{\underset{#1}{\mybold{P}}} \newcommand{\expect}[1]{\mathbb{E}\left[#1\right]} \newcommand{\prob}[1]{\mathbb{P}\left(#1\right)} \newcommand{\dens}[1]{\mathit{p}\left(#1\right)} \newcommand{\var}[1]{\mathrm{Var}\left(#1\right)} \newcommand{\cov}[1]{\mathrm{Cov}\left(#1\right)} \newcommand{\sumn}{\sum_{n=1}^N} \newcommand{\meann}{\frac{1}{N} \sumn} \newcommand{\cltn}{\frac{1}{\sqrt{N}} \sumn} \newcommand{\trace}[1]{\mathrm{trace}\left(#1\right)} \newcommand{\diag}[1]{\mathrm{Diag}\left(#1\right)} \newcommand{\grad}[2]{\nabla_{#1} \left. #2 \right.} \newcommand{\gradat}[3]{\nabla_{#1} \left. #2 \right|_{#3}} \newcommand{\fracat}[3]{\left. \frac{#1}{#2} \right|_{#3}} \newcommand{\W}{\mybold{W}} \newcommand{\w}{w} \newcommand{\wbar}{\bar{w}} \newcommand{\wv}{\mybold{w}} \newcommand{\X}{\mybold{X}} \newcommand{\x}{x} \newcommand{\xbar}{\bar{x}} \newcommand{\xv}{\mybold{x}} \newcommand{\Xcov}{\mybold{M}_{\X}} \newcommand{\Xcovhat}{\hat{\mybold{M}}_{\X}} \newcommand{\Covsand}{\Sigmam_{\mathrm{sand}}} \newcommand{\Covsandhat}{\hat{\Sigmam}_{\mathrm{sand}}} \newcommand{\Z}{\mybold{Z}} \newcommand{\z}{z} \newcommand{\zv}{\mybold{z}} \newcommand{\zbar}{\bar{z}} \newcommand{\Y}{\mybold{Y}} \newcommand{\Yhat}{\hat{\Y}} \newcommand{\y}{y} \newcommand{\yv}{\mybold{y}} \newcommand{\yhat}{\hat{\y}} \newcommand{\ybar}{\bar{y}} \newcommand{\res}{\varepsilon} \newcommand{\resv}{\mybold{\res}} \newcommand{\resvhat}{\hat{\mybold{\res}}} \newcommand{\reshat}{\hat{\res}} \newcommand{\betav}{\mybold{\beta}} \newcommand{\betavhat}{\hat{\betav}} \newcommand{\betahat}{\hat{\beta}} \newcommand{\betastar}{{\beta^{*}}} \newcommand{\betavstar}{{\betav^{*}}} \newcommand{\loss}{\mathscr{L}} \newcommand{\losshat}{\hat{\loss}} \newcommand{\f}{f} \newcommand{\fhat}{\hat{f}} \newcommand{\bv}{\mybold{\b}} \newcommand{\bvhat}{\hat{\bv}} \newcommand{\alphav}{\mybold{\alpha}} \newcommand{\alphavhat}{\hat{\av}} \newcommand{\alphahat}{\hat{\alpha}} \newcommand{\omegav}{\mybold{\omega}} \newcommand{\gv}{\mybold{\gamma}} \newcommand{\gvhat}{\hat{\gv}} \newcommand{\ghat}{\hat{\gamma}} \newcommand{\hv}{\mybold{\h}} \newcommand{\hvhat}{\hat{\hv}} \newcommand{\hhat}{\hat{\h}} \newcommand{\gammav}{\mybold{\gamma}} \newcommand{\gammavhat}{\hat{\gammav}} \newcommand{\gammahat}{\hat{\gamma}} \newcommand{\new}{\mathrm{new}} \newcommand{\zerov}{\mybold{0}} \newcommand{\onev}{\mybold{1}} \newcommand{\id}{\mybold{I}} \newcommand{\sigmahat}{\hat{\sigma}} \newcommand{\etav}{\mybold{\eta}} \newcommand{\muv}{\mybold{\mu}} \newcommand{\Sigmam}{\mybold{\Sigma}} \newcommand{\rdom}[1]{\mathbb{R}^{#1}} \newcommand{\RV}[1]{{#1}} \def\A{\mybold{A}} \def\A{\mybold{A}} \def\av{\mybold{a}} \def\a{a} \def\B{\mybold{B}} \def\b{b} \def\S{\mybold{S}} \def\sv{\mybold{s}} \def\s{s} \def\R{\mybold{R}} \def\rv{\mybold{r}} \def\r{r} \def\V{\mybold{V}} \def\vv{\mybold{v}} \def\v{v} \def\vhat{\hat{v}} \def\U{\mybold{U}} \def\uv{\mybold{u}} \def\u{u} \def\W{\mybold{W}} \def\wv{\mybold{w}} \def\w{w} \def\tv{\mybold{t}} \def\t{t} \def\Sc{\mathcal{S}} \def\ev{\mybold{e}} \def\Lammat{\mybold{\Lambda}} \def\Q{\mybold{Q}} \def\eps{\varepsilon} $$

Least squares as a projection.

Goals

  • Review / introduce some linear algebra
    • Orthogonal bases
    • Eigendecompositions
    • Matrix square roots
  • Derive the OLS estimator from a geometric perspective
    • Derive the form of projection operators without using vector calculus
    • Write the OLS estimator as a projection operator
    • Briefly discuss some consequences of OLS as a projection

Recap

Last lecture we derived the formula

\[ \X^\trans \X \betahat = \X^\trans \Y \]

which, if \(\X^\trans \X\) is inverible, gives

\[ \betahat = (\X^\trans \X)^{-1} \X^\trans \Y. \]

We derived this formula by minimizing the squared error using multivariate calculus. Today we’ll do the same thing using the perspective of orthogonal projection operators without using vector gradients at all.

Doing so will both shed light on the optimization perspective, as well as review a lot of linear algebra concepts that will be useful in the remainder of the class.

Our payoff will be best defined in terms of some quantities we’ll be seeing repeatedly, and which we will now give a name:

Notation

When \(\X^\trans \X \betahat = \X^\trans \Y\) — that is, when \(\betavhat\) is any value that minimizes the sum of squared residuals in the regression \(\Y \sim \X \betav + \resv\), we define

\[ \Yhat := \X \betavhat \quad\quad\textrm{and}\quad\quad \resvhat := \Y - \Yhat = \Y - \X \betavhat. \]

Note that, since \(\betavhat \in \argmin{\beta} \resv^\trans \resv\), we have

\[ \resvhat^\trans \resvhat = \min_\beta \resv^\trans \resv. \]

Note that \(\Yhat\) and \(\resvhat\) are always uniquely defined, even if \(\betavhat\) is not uniquely defined, and that

\[ \Y = \Yhat + \resvhat. \]

Bases and spans

When you first encounter linear algebra, you think of vectors as lists of numbers. A more advanced and useful way to think of vectors as linear combinations of basis functions.

For example, take a vector \(\vv \in \rdom{3}\), and take the unit vectors \(\ev_p\) containing a \(1\) in the \(p\)–th location and zeros elsewhere. We can write

\[ \vv = \begin{pmatrix} \v_1 \\ \v_2 \\ \v_3 \end{pmatrix} = \v_1 \begin{pmatrix} 1 \\ 0 \\ 0 \end{pmatrix} + \v_2 \begin{pmatrix} 0 \\ 1 \\ 0 \end{pmatrix} + \v_3 \begin{pmatrix} 0 \\ 0 \\ 1 \end{pmatrix} = \v_1 \ev_1 + \v_2 \ev_2 + \v_3 \ev_3 = \begin{pmatrix} \ev_1 & \ev_2 & \ev_3 \end{pmatrix} \begin{pmatrix} \v_1 \\ \v_2 \\ \v_3 \end{pmatrix}, \]

where the matrix in the final expression is \(3 \times 3\), and is formed by stacking the basis vectors \(\ev_p\) horizontally. (In fact, this vector is the identity matrix, which accounts for the final expression.)

In general, if we have \(P\) orthonormal basis vectors \(\u_p \in \rdom{P}\), which are stacked in a matrix

\[ \U = \begin{pmatrix} \u_1 & \ldots & \u_P \end{pmatrix}, \]

then \(\U^\trans \U = \id\), so \(\U^\trans = \U^{-1}\), and we can represent \(\vv \in \rdom{P}\) as a linear combination of the \(\u_p\) by solving

\[ % \begin{aligned} \vv ={}& \U \av \Leftrightarrow \\ \U^\trans \vv ={}& \U ^\trans \U \av \Leftrightarrow \\ \U^\trans \vv ={}& \av. \end{aligned} % \]

In other words, the cofficient of \(\vv\) in front of \(\uv_p\) is simply given by \(\vv^\trans \uv_p\).

The “span” of a set of all linear product of the vectors. A set of \(P\) vectors can span a space of dimension at most \(P\). The span of a set of vectors defines a “subspace”. In general, a subspace can have many different sets of vectors that span it, orthonormal, or not.

Let \(\Sc\) be a subspace spanned by orthonormal vectors \(\uv_1, \ldots, \uv_Q\), all in \(\rdom{P}\). We cannot have \(Q > P\), since there can be at most \(P\) linearly independent vectors in \(\rdom{P}\). Correspondingly, when \(Q < P\), there will be some vectors that do not lie in \(\Sc\), since there do exist \(P\) linearly independent vectors in \(\rdom{P}\). We can use the existence of such vectors to construct another subspace, \(\Sc^\perp\), spanned by orthonormal vectors \(\uv^\perp_1, \ldots, \uv^\perp_{P-Q}\), such that \((\uv^\perp_i) ^\trans \uv_j = 0\) for all \(i\) and \(j\). We will collect the vectors \(\uv^\perp_p\) as the column vectors of a matrix \(\U^\perp\). It follows that we can write any vector \(vv \in \rdom{P}\) as

\[ \vv = \U \av + \U^\perp \av^\perp \]

for some unique \(\av\) and \(\av^\perp\).

Projections

Take a subsapce \(\Sc\) of \(\rdom{P}\) with dimension \(Q < P\). We can define the projection of \(\vv \in \rdom{P}\) into \(\Sc\) as follows:

\[ \begin{aligned} \projop{\Sc} \vv = \argmin{\sv \in \Sc} \norm{\vv - \sv}_2^2 = \argmin{\sv \in \Sc} \left(\vv - \sv \right)^\trans \left(\vv - \sv \right). \end{aligned} \]

(This is formally an “orthogonal projection,” but we’ll only be discussing orthogonal projections in this class, so to save space I’ll just be saying “projection.”)

Since \(\sv \in \Sc\), we can write it as a linear combination of the orthonormal vectors \(\u_p\):

\[ \sv = \U \av \quad\textrm{ for some } \av \]

Similarly, we can write

\[ \begin{aligned} \vv =& \U \alphav + \U^\perp \alphav^\perp \textrm{ where } \alpha_p = \u_p^\trans \vv \textrm{ and } \alpha^\perp_p = (\u^\perp_p)^\trans \vv,\\ \textrm{i.e. } \alphav =& \U^\trans \vv \textrm{ and } \alphav^\perp = (\U^\perp)^\trans \vv. \end{aligned} \]

Putting these into the projection, we see that the quantity to be minimized can be written as a function of \(\av\), and that

\[ \begin{aligned} \argmin{\av} \norm{\U \alphav + \U^\perp \alphav^\perp - \U \av}_2^2 ={}& \argmin{\av} \norm{\U (\alphav - \av) + \U^\perp \alphav^\perp}_2^2 \\={}& \argmin{\av} \left( (\alphav - \av)^\trans \U^\trans \U (\alphav - \av) + (\alphav^\perp)^\trans (\U^\perp)^\trans \U^\perp \alphav^\perp + 2 (\alphav - \av)^\trans \U^\trans \U^\perp \alphav^\perp \right) \\={}& \argmin{\av} \left( \norm{\alphav - \av}_2^2 + \norm{\alphav^\perp}_2^2 \right). \end{aligned} \]

The minimum is achieved at

\[ \av = \alphav = \U^\trans \vv, \] so \[ \projop{\Sc} \vv = \U \av = \U \U^\trans \vv =: \proj{\Sc} \vv, \]

where we defined

\[ \proj{\Sc} = \U \U^\trans. \]

Effectively, we have shown that the solution to a minimization problem (the left hand side of the preceding equation) takes the form of a matrix multiplication (the right hand side of the preceding equation).

Projection basis invariance

Something suspicious has happened, though — we started with a subspace, and got an answer that depends on the basis we chose. What if we had chosen a different orthonormal basis, \(\V\), which spans \(\Sc\)? Would we have gotten a different answer?

It turns out, no. Note that the fact that \(\V\) spans \(\Sc\) means that we can write

\[ \V = \U \U^\trans \V \quad\textrm{and}\quad \U = \V \V^\trans \U. \]

That is, \(\V\) is unchanged when projecting into the span of \(\U\), and vice-versa. Using this, we can see that

\[ \begin{aligned} \U \U^\trans \vv ={}& \U (\V \V^\trans \U)^\trans \vv \\ ={}& \U \U^\trans \V \V^\trans \vv \\ ={}& (\U \U^\trans \V) \V^\trans \vv \\ ={}& \V \V^\trans \vv. \end{aligned} \]

So the projection operator does not depend on what basis we choose.

Some properties of projection matrices

The projection matrix has some special properties that are worth mentioning.

  • First, if \(\vv \in \Sc\), then \(\proj{\Sc}\vv = \vv\) Similarly, \(\proj{\Sc} \vv = 0\) is \(\vv \in \Sc^\perp\).

  • Second, \(\proj{\Sc}\) is symmetric, since \(\U \U^\trans = (\U \U^\trans)^\trans\).

  • Third, \(\proj{\Sc}\) is “idempotent”, meaning \(\proj{\Sc} = \proj{\Sc} \proj{\Sc}\).

Exercise

Prove these results.

These properties say that \(\proj{\Sc}\) has a special eigenstructure. In particular the fact that \(\proj{\Sc} = \proj{\Sc} \proj{\Sc}\) means the eigenvalues of \(\proj{\Sc}\) must be either zero or one, as we now show. Let \(\lambda\) be an eigenvalue of \(\proj{\Sc}\) with eigenvector \(\vv\). Then \[ \lambda \vv = \proj{\Sc} \vv = \proj{\Sc} \ldots \proj{\Sc} \vv = \lambda^k \vv. \]

Therefore we need \(\lambda^k = \lambda\) for any \(k\)! Only two values are possible: \(\lambda \in \{ 0, 1\}\). And we know that, for any basis vectors \(\av \in \Sc\) and \(\av^\perp \in \Sc^\perp\),

\[ \proj{\Sc} \av = \av \quad\textrm{and}\quad \proj{\Sc} \av^\perp = 0, \]

so the eigenvectors are precisely any vectors lying entirely in either \(\Sc\) or \(\Sc^\perp\).

The orthogonal projection matrix

Once we have the projection matrix for \(\Sc\), we can easily form the projection matrix onto the set’s orthogonal complement, \(\Sc^\perp\):

\[ \proj{\Sc^\perp} = \id - \proj{\Sc}. \]

One way to see this is via the identity between left and right inverses. By orthogonality,

\[ \begin{aligned} \begin{pmatrix} \U & \U^\perp \end{pmatrix}^\trans \begin{pmatrix} \U & \U^\perp \end{pmatrix} = \id. \end{aligned} \]

Then, by the equivalence between left and right inverses,

\[ \begin{aligned} \id = \begin{pmatrix} \U & \U^\perp \end{pmatrix} \begin{pmatrix} \U & \U^\perp \end{pmatrix}^\trans = \begin{pmatrix} \U & \U^\perp \end{pmatrix} \begin{pmatrix} \U^\trans \\ (\U^\perp)^\trans \end{pmatrix} \end{aligned} = \U \U^\trans + \U^\perp (\U^\perp)^\trans. \]

Solving gives

\[ \begin{aligned} \proj{\Sc^\perp} = \U^\perp (\U^\perp)^\trans = \id - \U \U^\trans = \id - \proj{\Sc}. \end{aligned} \]

Square roots and inverses via eigendecomposition

The projection matrix was defined in terms of orthogonal bases. However, we would like it for spaces spanned by a generic, non-orthonormal set of vectors to solve our least squares problem, since in general \(\X\) does not have orthonormal columns.

To bridge the gap, we can construct a linear operator that converts a set of vectors to an orthogonal basis.

First, recall that a square, symmetric \(P \times P\) matrix \(\A\) can always be written in the form

\[ \begin{aligned} \A &= \U \Lambda \U^\trans \quad\textrm{where}\quad \\ \U^\trans\U &= \id \textrm{ ($\U$ is orthogonal) and }\\ \Lammat &= \begin{pmatrix} \lambda_1 & 0 & \ldots & 0 \\ 0 & \lambda_2 & 0\ldots & 0 \\ \vdots & & \ddots & \vdots \\ 0 & 0 & \ldots 0 & \lambda_P \\ \end{pmatrix} \textrm{ is diagonal.} \end{aligned} \]

Exercise

Prove that the columns of \(\U\) are the eigenvectors, and the \(\lambda_p\) are the corresponding eigenvalues.

A matrix is called “positive definite” if the eigenvalues are all strictly positive, and positive semi-definite if they are all non-zero.

Exercise

Prove that positive definiteness is equivalent to \(\vv^\trans \A \vv > 0\), and semi-definiteness to \(\vv^\trans \A \vv \ge 0\).

Exercise

Prove that if \(\A\) is of the form \(\A = \Q^\trans \Q\) for some \(\Q\), then \(\A\) is positive semi-definite.

Note that \(\A\) is invertible if and only if its eigenvalues are non-zero. To see that nonzero eigenvalues are sufficient, we can write \(\A^{-1} = \U \Lambda^{-1} \U^\trans\).

Exercise

Verify that \(\A^{-1}\) so defined is an inverse.

Exercise

Prove that, if some \(\lambda_p = 0\), then \(\A\) is not invertible.

From the eigendecomposition, one can define the square root of a symmetric positive semit-definite matrix \(\A\): that is, a matrix \(\A^{1/2}\) such that \((\A^{1/2})^\trans \A^{1/2} = \A\). First, define

\[ \Lammat^{1/2} := \begin{pmatrix} \sqrt{\lambda_1} & 0 & \ldots & 0 \\ 0 & \sqrt{\lambda_2} & 0\ldots & 0 \\ \vdots & & \ddots & \vdots \\ 0 & 0 & \ldots 0 & \sqrt{\lambda_P} \\ \end{pmatrix}. \]

Then \(\U \Lammat^{1/2} \U^\trans\) is a square root of \(\A\). Matrix square roots are not unique, but that will not matter — all that matters is that symmetric positive semi-definite matrices always have a square root. However, when I write \(\A^{1/2}\), I will refer to the symmetric square root given in the formula above.

Further \(\A^{1/2}\) itself is invertible if and only if \(\A\) is invertible.

Exercise

Prove that \(\A^{1/2}\) so defined is a square root.

Exercise

Find a different matrix than the one given above that is also a square root.

Exercise

Prove that any square root \(\A^{1/2}\) is invertible.

Matrix square roots for orthonormalization

In our case, recall that we want to find an orthonormal basis of the column span of \(\X\). Let us assume that \(\X\) has full column rank. Observe that

  1. \(\X^\trans \X\) is symmetric and positive definite.
  2. \(\Rightarrow\) We can form the inverse square root \((\X^\trans \X)^{-1/2}\).
  3. \(\Rightarrow\) We can define \(\U := \X (\X^\trans \X)^{-1/2}\).

Then

  1. \(\U\) is orthonormal and
  2. \(\U\) has the same column span as \(\X\).

Effectively we have produced an orthonormal basis for \(\X\).

Exercise

Prove that \(\U\) is orthonormal and \(\U\) has the same column span as \(\X\).

Note that right multiplying \(\X\) effectively transforms each row of \(\X\) — that is, the regressor \(\x_n\) — into a new regressor \(\z_n = (\X^\trans \X)^{-1/2} \z_n\) which has an identity covariance matrix.

Using our orthonormal basis, we can write the projection matrix as

\[ \proj{\Sc_\X} = \U \U^\trans = \X (\X^\trans \X)^{-1/2} (\X^\trans \X)^{-1/2} \X^\trans = \X (\X^\trans \X)^{-1} \X^\trans. \]

Exercise

Check directly that any vector that can be written in the form \(\vv = \X \beta\) satisfies \(\vv = \proj{\Sc_\X} \vv\), and that any \(\vv \in \Sc_\X^\perp\) satisfies \(\proj{\Sc_\X} \vv = \zerov\).

One might simply begin with the above formula and verify that it acts as a projection directly. However, doing so may seem mysterious or somehow lucky. Bulding the projection up as we have done shows where it comes from.

Finally, we have derived the projection operator onto the span of \(\X\):

\[ \begin{aligned} \proj{\Sc_\X} \Y ={}& \X (\X^\trans \X)^{-1} \X^\trans \Y ={} \X \betahat \quad\textrm{where}\\ \betahat ={}& (\X^\trans \X)^{-1} \X^\trans \Y. \end{aligned} \]

We give this fitted vector a special name:

\[ \Yhat := \proj{\Sc_\X} \Y = \X \betahat. \]

Additionally, we have that the fitted residuals are given by

\[ \resvhat = \Y - \X \betahat = \id \Y - \proj{\Sc_\X} \Y = (\id - \proj{\Sc_\X})\Y = \proj{\Sc_\X^\perp} \Y. \]

Note that \(\Y = \Yhat + \resvhat = (\proj{\Sc_\X} + \proj{\Sc_\X^\perp}) \Y\). Also, our squared error loss is given by \(\norm{\resvhat}_2^2 = \norm{\proj{\Sc_\X^\perp} \Y}_2^2\), the squared magnitude of the component of \(\Y\) not lying in \(\X\).

Conclusion

Putting this together, we see that

\[ \Yhat = \X \betavhat = \X (\X^\trans \X)^{-1} \X^\trans \Y = \proj{\X} \Y. \]

That is, our best guess for \(\Y\), in a least squares sense, is given by the projection of \(\Y\) onto the column span of \(\X\). Similarly,

\[ \resvhat = \Y - \Yhat = \id \Y - \proj{\X} \Y = (\id - \proj{\X}) \Y = \proj{\X^\perp} \Y \]

is the projection perpendicular to the column span of \(\X\). These particularly simple forms are independent of the particular value of \(\betavhat\), and are in fact uniquely defined even when \(\X\) is not full–column rank and \(\X^\trans \X\) is not invertible.