$$ \newcommand{\mybold}[1]{\boldsymbol{#1}} \newcommand{\trans}{\intercal} \newcommand{\norm}[1]{\left\Vert#1\right\Vert} \newcommand{\abs}[1]{\left|#1\right|} \newcommand{\bbr}{\mathbb{R}} \newcommand{\bbz}{\mathbb{Z}} \newcommand{\bbc}{\mathbb{C}} \newcommand{\gauss}[1]{\mathcal{N}\left(#1\right)} \newcommand{\chisq}[1]{\mathcal{\chi}^2_{#1}} \newcommand{\studentt}[1]{\mathrm{StudentT}_{#1}} \newcommand{\fdist}[2]{\mathrm{FDist}_{#1,#2}} \newcommand{\argmin}[1]{\underset{#1}{\mathrm{argmin}}\,} \newcommand{\projop}[1]{\underset{#1}{\mathrm{Proj}}\,} \newcommand{\proj}[1]{\underset{#1}{\mybold{P}}} \newcommand{\expect}[1]{\mathbb{E}\left[#1\right]} \newcommand{\prob}[1]{\mathbb{P}\left(#1\right)} \newcommand{\dens}[1]{\mathit{p}\left(#1\right)} \newcommand{\var}[1]{\mathrm{Var}\left(#1\right)} \newcommand{\cov}[1]{\mathrm{Cov}\left(#1\right)} \newcommand{\sumn}{\sum_{n=1}^N} \newcommand{\meann}{\frac{1}{N} \sumn} \newcommand{\cltn}{\frac{1}{\sqrt{N}} \sumn} \newcommand{\trace}[1]{\mathrm{trace}\left(#1\right)} \newcommand{\diag}[1]{\mathrm{Diag}\left(#1\right)} \newcommand{\grad}[2]{\nabla_{#1} \left. #2 \right.} \newcommand{\gradat}[3]{\nabla_{#1} \left. #2 \right|_{#3}} \newcommand{\fracat}[3]{\left. \frac{#1}{#2} \right|_{#3}} \newcommand{\W}{\mybold{W}} \newcommand{\w}{w} \newcommand{\wbar}{\bar{w}} \newcommand{\wv}{\mybold{w}} \newcommand{\X}{\mybold{X}} \newcommand{\x}{x} \newcommand{\xbar}{\bar{x}} \newcommand{\xv}{\mybold{x}} \newcommand{\Xcov}{\Sigmam_{\X}} \newcommand{\Xcovhat}{\hat{\Sigmam}_{\X}} \newcommand{\Covsand}{\Sigmam_{\mathrm{sand}}} \newcommand{\Covsandhat}{\hat{\Sigmam}_{\mathrm{sand}}} \newcommand{\Z}{\mybold{Z}} \newcommand{\z}{z} \newcommand{\zv}{\mybold{z}} \newcommand{\zbar}{\bar{z}} \newcommand{\Y}{\mybold{Y}} \newcommand{\Yhat}{\hat{\Y}} \newcommand{\y}{y} \newcommand{\yv}{\mybold{y}} \newcommand{\yhat}{\hat{\y}} \newcommand{\ybar}{\bar{y}} \newcommand{\res}{\varepsilon} \newcommand{\resv}{\mybold{\res}} \newcommand{\resvhat}{\hat{\mybold{\res}}} \newcommand{\reshat}{\hat{\res}} \newcommand{\betav}{\mybold{\beta}} \newcommand{\betavhat}{\hat{\betav}} \newcommand{\betahat}{\hat{\beta}} \newcommand{\betastar}{{\beta^{*}}} \newcommand{\bv}{\mybold{\b}} \newcommand{\bvhat}{\hat{\bv}} \newcommand{\alphav}{\mybold{\alpha}} \newcommand{\alphavhat}{\hat{\av}} \newcommand{\alphahat}{\hat{\alpha}} \newcommand{\omegav}{\mybold{\omega}} \newcommand{\gv}{\mybold{\gamma}} \newcommand{\gvhat}{\hat{\gv}} \newcommand{\ghat}{\hat{\gamma}} \newcommand{\hv}{\mybold{\h}} \newcommand{\hvhat}{\hat{\hv}} \newcommand{\hhat}{\hat{\h}} \newcommand{\gammav}{\mybold{\gamma}} \newcommand{\gammavhat}{\hat{\gammav}} \newcommand{\gammahat}{\hat{\gamma}} \newcommand{\new}{\mathrm{new}} \newcommand{\zerov}{\mybold{0}} \newcommand{\onev}{\mybold{1}} \newcommand{\id}{\mybold{I}} \newcommand{\sigmahat}{\hat{\sigma}} \newcommand{\etav}{\mybold{\eta}} \newcommand{\muv}{\mybold{\mu}} \newcommand{\Sigmam}{\mybold{\Sigma}} \newcommand{\rdom}[1]{\mathbb{R}^{#1}} \newcommand{\RV}[1]{\tilde{#1}} \def\A{\mybold{A}} \def\A{\mybold{A}} \def\av{\mybold{a}} \def\a{a} \def\B{\mybold{B}} \def\S{\mybold{S}} \def\sv{\mybold{s}} \def\s{s} \def\R{\mybold{R}} \def\rv{\mybold{r}} \def\r{r} \def\V{\mybold{V}} \def\vv{\mybold{v}} \def\v{v} \def\U{\mybold{U}} \def\uv{\mybold{u}} \def\u{u} \def\W{\mybold{W}} \def\wv{\mybold{w}} \def\w{w} \def\tv{\mybold{t}} \def\t{t} \def\Sc{\mathcal{S}} \def\ev{\mybold{e}} \def\Lammat{\mybold{\Lambda}} $$

Ridge or ‘L2’ regression

$\,$

Goals

Introduce ridge regression
- As a complexity penalty
- As a tuneable hierarchy of models to be selected by cross-validation
- As a Bayesian posterior estimate

Approximately colinear regressors

Sometime it makes sense to penalize very large values of $\betahat$.

Consider the following contrived example. Let

\[ \begin{aligned} \y_n ={}& \z_n + \varepsilon_n \quad\textrm{where}\quad \z_n \sim{} \gauss{0,1} \quad\textrm{and}\quad \varepsilon \sim{} \gauss{0,1}. \end{aligned} \]

If we regress $y \sim \beta z$, then $\betahat \sim \gauss{1, 1/N}$ with no problems. But suppose we actually also observe $\x_n = \z_n + \epsilon_n$ with $\epsilon_n \sim \gauss{0, \delta}$ for very small $\delta \ll 1$. Then $\x_n$ and $\z_n$ are highly correlated:

\[ \Xcov = \expect{\begin{pmatrix}\x_n \\ \z_n\end{pmatrix}} = \begin{pmatrix} 1 + \delta & 1 \\ 1 & 1 \end{pmatrix} \quad\Rightarrow\quad \Xcov^{-1} = \frac{1}{\delta} \begin{pmatrix} 1 & -1 \\ -1 & 1 + \delta \end{pmatrix}. \]

Therefore, if we try to regress $\y \sim \beta_\x \x + \beta_z \z$, we get

\[ \cov{\begin{pmatrix}\beta_x \\ \beta_z \end{pmatrix}} = \frac{1}{N} \Xcov^{-1} = \frac{1}{N} \frac{1}{\delta} \begin{pmatrix} 1 & -1 \\ -1 & 1 + \delta \end{pmatrix}. \]

Note that $\var{\beta_\x} = \frac{N}{\delta}$, which is very large. With high probability, we will estimate $\betahat$ that has a very large magnitude, $\norm{\betahat}_2^2$, although its two components should nearly the negative of one another. This could be problematic in practice. For example, in our test set or application, if $\x_n$ and $\z_n$ are not as well–correlated as in our training set, this will lead to crazy and highly variable predicted values.

Does it make sense to permit such a large variance? Wouldn’t it be better to choose slightly smaller $\betahat$, which are in turn somewhat more “stable”?

Penalizing large regressors

Recall that one perspective on regression is that we choose $\betahat$ to minimize the loss

\[ \betahat := \argmin{\beta} \sumn (\y_n - \beta^\trans \x_n)^2 =: RSS(\beta). \]

We motivated this as an approximation to the expected predicted loss, $L(\beta) = \expect{\y_\new,\x_\new}{(\y_\new - \beta^\trans \x_\new)^2}$. But that made sense when we had a fixed set of regressors, and have shown that the correspondence breaks down when we are searching the space of regressors. In particular, $RSS(\beta)$ always decreases as we add more regressors, but $L(\beta)$ may not.

Instead, let us consider minimizing $RSS(\beta)$, but with an additional penalty for large $\betahat$. There are many ways to do this! But one convenient one is as follows. For now, pick a $\lambda$, and choose $\betahat$ to minimize:

\[ \betahat(\lambda) := \argmin{\beta} L_{ridge}(\beta, \lambda) := RSS(\beta) + \lambda \norm{\beta}_2^2 = RSS(\beta) + \lambda \sum_{p=1}^P \beta_p^2. \]

This is known as ridge regression, L2–penalized regression. The latter is because the penalty $\norm{\beta}_2^2$ is the L2 norm of the regressor; next time we will study the L1 version, which is also known as the Lasso.

The term $\lambda \norm{\beta}_2^2$ is known as a “regularizer,” since it imposes some “regularity” to the estimate $\betahat(\lambda)$. Note that

As $\lambda \rightarrow \infty$, then $\betahat(\lambda) \rightarrow \zerov$
When $\lambda = 0$, then $\betahat(\lambda) = \betahat$, the OLS estimator.

So the inclusion of $\lambda$ is to “shrink” the estimate $\betahat(\lambda)$ towards zero. Note that since the ridge loss has an extra penalty for the norm, it is impossible for the OLS solution to have a smaller norm than the ridge solution.

The ridge regression regularizer has the considerable advantage that the optimum is available in closed form, since

\[ \begin{aligned} L_{ridge}(\beta, \lambda) ={}& (\Y - \X\beta)^\trans (\Y - \X\beta) + \lambda \beta^\trans \beta \\={}& \Y^\trans \Y -2 \Y^\trans \X \beta + \beta^\trans \X^\trans \X \beta+ \lambda \beta^\trans \beta \\={}& \Y^\trans \Y -2 \Y^\trans \X \beta + \beta^\trans \left(\X^\trans \X + \lambda \id \right) \beta \quad \Rightarrow \\ \frac{\partial L_{ridge}(\beta, \lambda)}{\partial \beta} ={}& -2 \X^\trans \Y + 2 \left(\X^\trans \X + \lambda \id \right) \beta \quad\Rightarrow \\ \betahat(\lambda) ={}& \left(\X^\trans \X + \lambda \id \right)^{-1} \X^\trans \Y. \end{aligned} \]

Note that $\X^\trans \X + \lambda \id$ is always invertible if $\lambda > 0$, even if $\X$ is not full–column rank. In this sense, the ridge regression can deal safely with colinearity.

Exercise

Prove that $\X^\trans \X + \lambda \id$ is invertible if $\lambda > 0$. Hint: using the fact that $\X^\trans \X$ positive semi–definite because it’s symmetric, find a lower bound on the smallest eigenvalue of $\X^\trans \X + \lambda \id$.

Standardizing regressors

Suppose we re-scale one of the regressors $\x_{np}$ by some value $\alpha$ for a very small $\alpha \ll 1$, i.e., regressing on $\x'_{np} = \alpha \x_{np}$ instead of $\x_{np}$. As we know, the OLS minimizer $\betahat_p' = \betahat_p / \alpha$ and the fitted value $\Yhat$ is unchanged at $\lambda = 0$. But for a particular $\lambda > 0$, how does this affect the ridge solution? We can write

\[ \lambda \betahat_p'^2 = \frac{\lambda}{\alpha^2} \betahat_p^2. \]

That is, we will effectively “punish” large values of $\betahat_p'$ much more than we would “punish” the corresponding values of $\betahat$. In turn, for a particular $\lambda$, we will tend to set $\betahat'_p < \betahat_p$ (although this is not necessarily the case).

The point is that re-scaling the regressors affects the meaning of $\lambda$. Correspondingly, if different regressors have very different typical scales, such as age versus income, then ridge regression will drive their coefficients to zero to very different degrees.

Similarly, it often doesn’t make sense to penalize the constant, so we might take $\beta_1$ to be the constant ($\x_{n1} = 1$), and write

\[ \betahat(\lambda) := RSS(\beta) + \lambda \norm{\beta}_2^2 = RSS(\beta) + \lambda \sum_{p=2}^P \beta_p^2. \]

But this gets tedious, and assumes we have included a constant.
Instead, we might invoke the FWL theorem, center the response and regressors at their mean values, and then do penalized regression.

For both these reasons, before performing ridge regression (or any other similar penalized regression), we typically standardize the regressors, defining

\[ \x'_n := \frac{\x_n - \xbar_n}{\sqrt{\meann (\x_n - \xbar)^2}}. \]

We then run ridge regression on $\xv'$ rather than $\x$, so that we

Don’t penalize the constant term and
Penalize every coefficient the same regardless of its regressor’s typical scale.

A complexity penalty

Suppose that $\y_n = \beta^\trans \xv_n + \res_n$ for some $\beta$. Note that, for a fixed $\xv_\new$, $\y_\new$, and fixed $\X$,

\[ \begin{aligned} \expect{\y_\new - \xv_\new^\trans \betahat(\lambda)} ={}& \xv_\new^\trans \left(\beta - \expect{\betahat(\lambda)}\right) \\={}& \xv_\new^\trans \left(\id - \left(\X^\trans \X + \lambda \id \right)^{-1} (\X^\trans \X)\right) \beta \end{aligned} \]

That is, as $\lambda$ grows, $\betahat(\lambda)$ becomes biased. However, by the same reasoning as in the standard case, under homoskedasticity,

\[ \cov{\betahat(\lambda)} = \sigma^2 \left(\X^\trans \X + \lambda \id \right)^{-1} (\X^\trans \X) \left(\X^\trans \X + \lambda \id \right)^{-1} , \]

which is smaller that $\cov{\betahat}$ in the sense that $\cov{\betahat} - \cov{\betahat(\lambda)}$ is a positive definite matrix for $\lambda > 0$ and full-rank $\X$. In this sense, the family $\betahat(\lambda)$ is a one-dimensional family that trades off bias and variance. We can thus use the variable selection procedures of the previous lecture to choose $\lambda$ that minimizes estimated MSE.

A Bayesian posterior

Another way to interpret the ridge penalty is as a Bayesian posterior mean. If

\[ \betav \sim \gauss{\zerov, \sigma_\beta^2 \id} \quad\textrm{and}\quad \Y \vert \betav, \X \sim \gauss{\X \betav, \sigma^2 \id}, \]

then

\[ \betav \vert \Y, \X \sim \gauss{ \left(\X^\trans \X + \frac{\sigma^{2}}{\sigma_\beta^{2}} \id \right)^{-1} \X^\trans \Y, \sigma^{2}\left( \X^\trans \X + \frac{\sigma^{2}}{\sigma_\beta^{2}} \id \right)^{-1} }. \]

One way to derive this is to recognize that, if $\betav \sim \gauss{\mu, \Sigma}$, then

\[ \log \prob{\betav | \mu, \Sigma} = -\frac{1}{2} \beta^\trans \Sigma^{-1} \beta + \beta^\trans \Sigma^{-1} \mu + \textrm{Terms that do not depend on }\beta. \]

We can write out the distribution of $\prob{\beta | \Y} = \prob{\beta, \Y} / \prob{\Y}$, gather terms that depend on $\beta$, and read off the mean and covariance:

\[ \begin{aligned} \log \prob{\beta, \Y} ={}& -\frac{1}{2} \sigma^{-2} (\Y - \X\beta)^\trans (\Y - \X\beta) -\frac{1}{2} \sigma_\beta^{-2} \beta^\trans \beta + \textrm{Terms that do not depend on }\beta \\={}& -\frac{1}{2} \sigma^{-2} \beta^\trans \X^\trans \X\beta + \sigma^{-2} \beta^\trans \X^\trans \Y -\frac{1}{2} \sigma_\beta^{-2} \beta^\trans \beta + \textrm{Terms that do not depend on }\beta \\={}& -\frac{1}{2} \beta^\trans\left(\sigma^{-2} \X^\trans \X + \sigma_\beta^{-2} \id \right) \beta + \sigma^{-2} \beta^\trans \X^\trans \Y + \textrm{Terms that do not depend on }\beta. \end{aligned} \]

From this, we can read off $\Sigma$ and $\mu$, and get the above expression.

If we take $\lambda = \sigma^2 / \sigma_\beta^2$, then we can see that

\[ \expect{\beta | \Y} = \left(\X^\trans \X + \lambda \id \right)^{-1} \X^\trans \Y = \betahat(\lambda). \]

This gives the ridge procedure some interpretability. First of all, the use of the ridge penalty corresponds to a prior belief that $\beta$ is not too large.

Second, the ridge penalty you use reflects the relative scale of the noise variance and prior variance in a way that makes sense:

If $\sigma \gg \sigma_\beta$, then the data is noisy (relative to our prior beliefs). We should not take fitting the data too seriously, and so should estimate a smaller $\beta$ than $\betahat$. And indeed, in this case $\lambda$ is large, a large $\lambda$ shrinks the estimated coefficients.
If $\sigma_\beta \gg \sigma$, then we find it plausible that $\beta$ is very large (relative to the variability in our data). We should not take our prior beliefs too seriously, and estimate a coefficient that matches $\betahat$. And indeed, in this case, $\lambda$ is small, and we do not shrink the coefficients much.