$$ \newcommand{\mybold}[1]{\boldsymbol{#1}} \newcommand{\trans}{\intercal} \newcommand{\norm}[1]{\left\Vert#1\right\Vert} \newcommand{\abs}[1]{\left|#1\right|} \newcommand{\bbr}{\mathbb{R}} \newcommand{\bbz}{\mathbb{Z}} \newcommand{\bbc}{\mathbb{C}} \newcommand{\gauss}[1]{\mathcal{N}\left(#1\right)} \newcommand{\chisq}[1]{\mathcal{\chi}^2_{#1}} \newcommand{\studentt}[1]{\mathrm{StudentT}_{#1}} \newcommand{\fdist}[2]{\mathrm{FDist}_{#1,#2}} \newcommand{\iid}{\overset{\mathrm{IID}}{\sim}} \newcommand{\argmin}[1]{\underset{#1}{\mathrm{argmin}}\,} \newcommand{\projop}[1]{\underset{#1}{\mathrm{Proj}}\,} \newcommand{\proj}[1]{\underset{#1}{\mybold{P}}} \newcommand{\expect}[1]{\mathbb{E}\left[#1\right]} \newcommand{\prob}[1]{\mathbb{P}\left(#1\right)} \newcommand{\dens}[1]{\mathit{p}\left(#1\right)} \newcommand{\var}[1]{\mathrm{Var}\left(#1\right)} \newcommand{\cov}[1]{\mathrm{Cov}\left(#1\right)} \newcommand{\sumn}{\sum_{n=1}^N} \newcommand{\meann}{\frac{1}{N} \sumn} \newcommand{\cltn}{\frac{1}{\sqrt{N}} \sumn} \newcommand{\trace}[1]{\mathrm{trace}\left(#1\right)} \newcommand{\diag}[1]{\mathrm{Diag}\left(#1\right)} \newcommand{\grad}[2]{\nabla_{#1} \left. #2 \right.} \newcommand{\gradat}[3]{\nabla_{#1} \left. #2 \right|_{#3}} \newcommand{\fracat}[3]{\left. \frac{#1}{#2} \right|_{#3}} \newcommand{\W}{\mybold{W}} \newcommand{\w}{w} \newcommand{\wbar}{\bar{w}} \newcommand{\wv}{\mybold{w}} \newcommand{\X}{\mybold{X}} \newcommand{\x}{x} \newcommand{\xbar}{\bar{x}} \newcommand{\xv}{\mybold{x}} \newcommand{\Xcov}{\mybold{M}_{\X}} \newcommand{\Xcovhat}{\hat{\mybold{M}}_{\X}} \newcommand{\Covsand}{\Sigmam_{\mathrm{sand}}} \newcommand{\Covsandhat}{\hat{\Sigmam}_{\mathrm{sand}}} \newcommand{\Z}{\mybold{Z}} \newcommand{\z}{z} \newcommand{\zv}{\mybold{z}} \newcommand{\zbar}{\bar{z}} \newcommand{\Y}{\mybold{Y}} \newcommand{\Yhat}{\hat{\Y}} \newcommand{\y}{y} \newcommand{\yv}{\mybold{y}} \newcommand{\yhat}{\hat{\y}} \newcommand{\ybar}{\bar{y}} \newcommand{\res}{\varepsilon} \newcommand{\resv}{\mybold{\res}} \newcommand{\resvhat}{\hat{\mybold{\res}}} \newcommand{\reshat}{\hat{\res}} \newcommand{\betav}{\mybold{\beta}} \newcommand{\betavhat}{\hat{\betav}} \newcommand{\betahat}{\hat{\beta}} \newcommand{\betastar}{{\beta^{*}}} \newcommand{\betavstar}{{\betav^{*}}} \newcommand{\loss}{\mathscr{L}} \newcommand{\losshat}{\hat{\loss}} \newcommand{\f}{f} \newcommand{\fhat}{\hat{f}} \newcommand{\bv}{\mybold{\b}} \newcommand{\bvhat}{\hat{\bv}} \newcommand{\alphav}{\mybold{\alpha}} \newcommand{\alphavhat}{\hat{\av}} \newcommand{\alphahat}{\hat{\alpha}} \newcommand{\omegav}{\mybold{\omega}} \newcommand{\gv}{\mybold{\gamma}} \newcommand{\gvhat}{\hat{\gv}} \newcommand{\ghat}{\hat{\gamma}} \newcommand{\hv}{\mybold{\h}} \newcommand{\hvhat}{\hat{\hv}} \newcommand{\hhat}{\hat{\h}} \newcommand{\gammav}{\mybold{\gamma}} \newcommand{\gammavhat}{\hat{\gammav}} \newcommand{\gammahat}{\hat{\gamma}} \newcommand{\new}{\mathrm{new}} \newcommand{\zerov}{\mybold{0}} \newcommand{\onev}{\mybold{1}} \newcommand{\id}{\mybold{I}} \newcommand{\sigmahat}{\hat{\sigma}} \newcommand{\etav}{\mybold{\eta}} \newcommand{\muv}{\mybold{\mu}} \newcommand{\Sigmam}{\mybold{\Sigma}} \newcommand{\rdom}[1]{\mathbb{R}^{#1}} \newcommand{\RV}[1]{{#1}} \def\A{\mybold{A}} \def\A{\mybold{A}} \def\av{\mybold{a}} \def\a{a} \def\B{\mybold{B}} \def\b{b} \def\S{\mybold{S}} \def\sv{\mybold{s}} \def\s{s} \def\R{\mybold{R}} \def\rv{\mybold{r}} \def\r{r} \def\V{\mybold{V}} \def\vv{\mybold{v}} \def\v{v} \def\vhat{\hat{v}} \def\U{\mybold{U}} \def\uv{\mybold{u}} \def\u{u} \def\W{\mybold{W}} \def\wv{\mybold{w}} \def\w{w} \def\tv{\mybold{t}} \def\t{t} \def\Sc{\mathcal{S}} \def\ev{\mybold{e}} \def\Lammat{\mybold{\Lambda}} \def\Q{\mybold{Q}} \def\eps{\varepsilon} $$

Lasso or ‘L1’ regression

$\,$

Goals

Introduce lasso regression
- As a complexity penalty
- As a tuneable hierarchy of models to be selected by cross-validation
- Show some examples comparing greedy variable selection, ridge, and lasso

Penalizing large regressors (slightly different)

Recall last

Recall that one perspective on regression is that we choose $\betahat$ to minimize the ridge or L2 loss,

\[ \betavhat_{L2}(\lambda) := \argmin{\betav}\left( RSS(\betav) + \lambda \norm{\betav}_2^2 \right). \]

and proved that \[ \betavhat_{L2}(\lambda) = \left(\X^\trans \X + \lambda \id \right)^{-1} \X^\trans \Y. \]

One problem with this might be that the solution $\betavhat_{L2}(\lambda)$ is still “dense”, meaning that, in general, every entry of it is nonzero, and we still have to invert a $P\times P$ matrix.

For example, consider our highly correlated regressor example from the previous lecture. The ridge regression will still include both regressors, and their coefficient estimates will still be highly negatively correlated, but both will be shrunk towards zero. Maybe it would make more sense to select only one variable to include. Let us try to think of how we can change the penalty term to achieve this.

A “sparse” solution is an estimator $\betavhat$ in which many of the entries are zero — that is, an estimated regression line that does not use many of the available regressors.

In a word — ridge regression estimates are not sparse. Let’s try to derive one that is by changing the penalty.

A very intuitive way to produce a sparse estimate is as follows: \[ \betavhat_{L0}(\lambda) := \argmin{\betav}\left( RSS(\betav) + \lambda \sum_{p} 1\left(\beta_p \ne 0\right) \right) \quad\textrm{(practically difficult)} \]

This finds a tradeoff between the best fit to the data, but with a penalty for using more regressors. This makes sense, but is very difficult to compute. In particular, this objective is very non-convex. Bayesian statisticians do attempt to estimate models with a similar kind of penalty (they are called “spike and slab” models), but they are extremeley computationally intensive and beyond the scope of this course.

A convex approximation to the preceding loss is the L1 or Lasso loss, leading to Lasso or L1 regression:

\[ \betavhat_{L1}(\lambda) := \argmin{\betav}\left( RSS(\betav) + \lambda \sum_{p} \abs{\beta_b} \right) = \argmin{\betav}\left( RSS(\betav) + \lambda \norm{\betav}_1 \right). \]

This loss is convex (beacuse it is the sum of two convex functions), and so is much easier to minimize. Furthermore, as $\lambda$ grows, it does produce sparser and sparser solutions — though it may not be obvious at first.

The Lasso produces sparse solutions

One way to see that the Lasso produces sparse solutions is to start with a very large $\lambda$ and see what happens as it is slowly decreased.

Start at $\lambda$ very large, so that $\betavhat_{L1}(\lambda) = \zerov$. If we take small step of size $\varepsilon$ in a particular direction away from zero in entry $\beta_p$, then $\lambda \norm{\betahat}_1$ increases by $\varepsilon \lambda$, and the RSS changes by the gradient of the squared error,

\[ \varepsilon \sumn (\y_n - \betavhat(\lambda)^\trans\xv_n) \xv_{np} = \varepsilon \sumn \reshat_n \xv_{np} = \varepsilon \sumn \y_n \xv_{np} \textrm{ (because $\betavhat(\lambda) = \zerov$)}. \]

As long as $\abs{ \sumn \y_n \xv_{np}} < \lambda$ for all $p$, we cannot improve the loss by moving away from $\zerov$. Since the loss is convex, that means $\zerov$ is the minimum.

Eventually, we decrease $\lambda$ until $\sumn \y_n \xv_{np} = \lambda$ for some $p$. At that point, $\beta_p$ moves away from zero as $\lambda$ decreases, and the $\reshat_n$ also change. However, until $\sumn \reshat_n \xv_{nq} = \lambda$ for some other $q$, only $\beta_p$ will be nonzero. As $\lambda$ decreases more and more variables tend to get added to the model, until $\lambda = 0$, when of course $\betavhat_{L1}(0) = \betahat$, the OLS solution.

Standardization

Just as with the ridge regression, you should standardize variables before appyling the Lasso.