$$ \newcommand{\mybold}[1]{\boldsymbol{#1}} \newcommand{\trans}{\intercal} \newcommand{\norm}[1]{\left\Vert#1\right\Vert} \newcommand{\abs}[1]{\left|#1\right|} \newcommand{\bbr}{\mathbb{R}} \newcommand{\bbz}{\mathbb{Z}} \newcommand{\bbc}{\mathbb{C}} \newcommand{\gauss}[1]{\mathcal{N}\left(#1\right)} \newcommand{\chisq}[1]{\mathcal{\chi}^2_{#1}} \newcommand{\studentt}[1]{\mathrm{StudentT}_{#1}} \newcommand{\fdist}[2]{\mathrm{FDist}_{#1,#2}} \newcommand{\argmin}[1]{\underset{#1}{\mathrm{argmin}}\,} \newcommand{\projop}[1]{\underset{#1}{\mathrm{Proj}}\,} \newcommand{\proj}[1]{\underset{#1}{\mybold{P}}} \newcommand{\expect}[1]{\mathbb{E}\left[#1\right]} \newcommand{\prob}[1]{\mathbb{P}\left(#1\right)} \newcommand{\dens}[1]{\mathit{p}\left(#1\right)} \newcommand{\var}[1]{\mathrm{Var}\left(#1\right)} \newcommand{\cov}[1]{\mathrm{Cov}\left(#1\right)} \newcommand{\sumn}{\sum_{n=1}^N} \newcommand{\meann}{\frac{1}{N} \sumn} \newcommand{\cltn}{\frac{1}{\sqrt{N}} \sumn} \newcommand{\trace}[1]{\mathrm{trace}\left(#1\right)} \newcommand{\diag}[1]{\mathrm{Diag}\left(#1\right)} \newcommand{\grad}[2]{\nabla_{#1} \left. #2 \right.} \newcommand{\gradat}[3]{\nabla_{#1} \left. #2 \right|_{#3}} \newcommand{\fracat}[3]{\left. \frac{#1}{#2} \right|_{#3}} \newcommand{\W}{\mybold{W}} \newcommand{\w}{w} \newcommand{\wbar}{\bar{w}} \newcommand{\wv}{\mybold{w}} \newcommand{\X}{\mybold{X}} \newcommand{\x}{x} \newcommand{\xbar}{\bar{x}} \newcommand{\xv}{\mybold{x}} \newcommand{\Xcov}{\Sigmam_{\X}} \newcommand{\Xcovhat}{\hat{\Sigmam}_{\X}} \newcommand{\Covsand}{\Sigmam_{\mathrm{sand}}} \newcommand{\Covsandhat}{\hat{\Sigmam}_{\mathrm{sand}}} \newcommand{\Z}{\mybold{Z}} \newcommand{\z}{z} \newcommand{\zv}{\mybold{z}} \newcommand{\zbar}{\bar{z}} \newcommand{\Y}{\mybold{Y}} \newcommand{\Yhat}{\hat{\Y}} \newcommand{\y}{y} \newcommand{\yv}{\mybold{y}} \newcommand{\yhat}{\hat{\y}} \newcommand{\ybar}{\bar{y}} \newcommand{\res}{\varepsilon} \newcommand{\resv}{\mybold{\res}} \newcommand{\resvhat}{\hat{\mybold{\res}}} \newcommand{\reshat}{\hat{\res}} \newcommand{\betav}{\mybold{\beta}} \newcommand{\betavhat}{\hat{\betav}} \newcommand{\betahat}{\hat{\beta}} \newcommand{\betastar}{{\beta^{*}}} \newcommand{\bv}{\mybold{\b}} \newcommand{\bvhat}{\hat{\bv}} \newcommand{\alphav}{\mybold{\alpha}} \newcommand{\alphavhat}{\hat{\av}} \newcommand{\alphahat}{\hat{\alpha}} \newcommand{\omegav}{\mybold{\omega}} \newcommand{\gv}{\mybold{\gamma}} \newcommand{\gvhat}{\hat{\gv}} \newcommand{\ghat}{\hat{\gamma}} \newcommand{\hv}{\mybold{\h}} \newcommand{\hvhat}{\hat{\hv}} \newcommand{\hhat}{\hat{\h}} \newcommand{\gammav}{\mybold{\gamma}} \newcommand{\gammavhat}{\hat{\gammav}} \newcommand{\gammahat}{\hat{\gamma}} \newcommand{\new}{\mathrm{new}} \newcommand{\zerov}{\mybold{0}} \newcommand{\onev}{\mybold{1}} \newcommand{\id}{\mybold{I}} \newcommand{\sigmahat}{\hat{\sigma}} \newcommand{\etav}{\mybold{\eta}} \newcommand{\muv}{\mybold{\mu}} \newcommand{\Sigmam}{\mybold{\Sigma}} \newcommand{\rdom}[1]{\mathbb{R}^{#1}} \newcommand{\RV}[1]{\tilde{#1}} \def\A{\mybold{A}} \def\A{\mybold{A}} \def\av{\mybold{a}} \def\a{a} \def\B{\mybold{B}} \def\S{\mybold{S}} \def\sv{\mybold{s}} \def\s{s} \def\R{\mybold{R}} \def\rv{\mybold{r}} \def\r{r} \def\V{\mybold{V}} \def\vv{\mybold{v}} \def\v{v} \def\U{\mybold{U}} \def\uv{\mybold{u}} \def\u{u} \def\W{\mybold{W}} \def\wv{\mybold{w}} \def\w{w} \def\tv{\mybold{t}} \def\t{t} \def\Sc{\mathcal{S}} \def\ev{\mybold{e}} \def\Lammat{\mybold{\Lambda}} $$

Influence and outliers

$\,$

Goals

Look at different kinds of sensitivity analysis
- Concern with unusual $\xv_n$ values
- Concern with unusual $\res_n$ values
- Leverage scores
- Differential approaches

Gross-error robustness

Essentially all of our analysis has been under the assumption that $(\xv_n, \y_n)$ — and $(\xv_\new, \y_\new)$ — are drawn IID from the same distribution. Many things go wrong with regression when this assumption is violated.

One classical concern is with “data corruption.” There’s no single definition, but the idea is that some relatively small number of your datapoints have been drawn from some very different distribution than the rest of your data. This might happen because of

Data processing errors
Poor sampling techniques
Adversaries trying to confound your analysis
Rare but extremely unusual behavior

Such aberrant datapoints are called “outliers.” There is no clear definition of what an outlier is, although by “outlier” we typically mean something that should be removed from the dataset if you knew it was an outlier. There cannot be a clear definition of an outlier, because in principle you might want to ask a question that includes an attempt to model the aberrant data.

The best defense against outliers is good EDA. That means lots of sanity checks and careful data cleaning. Specifically, one should look for:

Unusual $\y_n$ values (e.g. with residual plots)
Unusual $\xv_n$ values (e.g. with leverage scores)
Influential combinations of $\xv_n$ and $\res_n$ (e.g. with the influence function)

Unusual responses (look at residuals)

Recall that $\betavhat = (\X^\trans \X)^{-1} \X^ \trans \Y$. This can be written as a weighted sum of $\y_n$:

\[ \betavhat = (\X^\trans \X)^{-1} \X\trans \Y = \sumn (\X^\trans \X)^{-1} \xv_n \y_n =: \sumn \omegav_n \y_n. \]

It is clear that we can produce arbitrarily large changes in $\betavhat$ by producing arbitrarily large changes in only a single $\y_n$.

Outlier $\y_n$ will also tend to have outlier residuals. To see this, let’s suppose there is one abberant value, $\y_*$, which is very large, and which we enumerate separately from the well-behaved $\y_n$. We have

\[ \begin{aligned} \reshat_* ={}& \y_* - \yhat_* \\={}& \y^* - \xv_*^\trans \betahat \\={}& \y^* - \xv_*^\trans (\X^\trans \X)^{-1} (\X^\trans \Y + \xv_* \y_*) \\\approx{}& \y^* - \xv_*^\trans (\X^\trans \X)^{-1} \xv_* \y_* \\=& \left( 1 - \xv_*^\trans (\X^\trans \X)^{-1} \xv_* \right) \y_*. \end{aligned} \]

If we assume that $\xv_*$ is not an outlier (i.e. the response is an outlier but the regressor is not), then we expect $\xv_*^\trans (\X^\trans \X)^{-1} \xv_* \rightarrow 0$, since $(\X^\trans \X)^{-1}$ is of order $1/N$. This means that although the $\y_*$ causes the $\betahat$ to grow very large in an attempt to fit it, its residual remains large, and with the same sign as $\y_*$.

This means you may be able to identify outlier responses by looking at a residual plot, e.g., a histogram of residuals, and seeing if any fitted residuals are atypical.

Unusual regressors (look at leverage scores)

Unusually large regressor values are called “high leverage points,” since small changes in $\beta$ produce large changes in the fitted values at the corresponding points. Since $\xv_n$ is a vector, measuring what it means for $\xv_n$ to be an outlier is a little more subtle than measuring what it means for a scalar like $\y_n$ to be an outlier. But a sensible thing to do is to measure the size of $\xv_n$ relative to $\X^\trans \X$, which estimates the spread of the $\xv_n$ values (if they are centered, it is an estimate of the covariance). We define the “leverage” score \[ h_n := \xv_n^\trans (\X^\trans \X)^{-1} \xv_n, \]

and we can check for unusual $\xv_n$ by looking for high leverage scores.

Note that $h_n$ is the $n$–th diagonal entry of the “hat” matrix $\proj{\X} = \X (\X^\trans \X)^{-1} \X^\trans$, so called because it “puts the hat on $\Y$”, since $\Yhat = \proj{\X} \Y$. There are a few useful consequences of this fact, which you prove in your homework:

$0 \le h_n \le 1$
$\sumn h_n = P$
At least some $h_n > 0$.

Additionally, we can see that $\frac{d\yhat_n}{d \y_n} = h_n$. (This fact that one can use to define “leverage scores” in settings beyond linear regression.)

Putting this together, we can see that

Since $\sumn h_n = P$ and $0 \le h_n \le 1$, not too many leverage scores can be large.
On average, a typical $h_n \approx P / N$ if the data is well-behaved.
If a leverage score is large, it means that the value of $\y_n$ has a high influence on its own fit, $\yhat_n$.
If a leverage score is large, $\xv_n$ is large relative to the estimated “covariance” $\meann \X^\trans \X$, up to its expected scaling of $1/N$.

Suppose now we have an oulier $\delta \xv_*$, where $\delta$ is very large, and see what happens to $\betavhat$. Without loss of generality take $\xv_*$ to be a unit vector. We have

\[ \begin{aligned} \betavhat ={}& (\X^\trans \X + \delta^2 \xv_* \xv_*^\trans)^{-1} (\X^\trans \Y + \delta \xv_* \y_*) \\ ={}& \left(\frac{1}{\delta} \X^\trans \X + \delta \xv_* \xv_*^\trans\right)^{-1} \left(\frac{1}{\delta} \X^\trans \Y + \xv_* \y_*\right) \\ \approx{}& \delta^{-1} \xv_* \xv_*^\trans \xv_* \y_* \\ \approx{}& \delta^{-1} \xv_* \y_*. \end{aligned} \]

Effectively, $\betavhat$ ends up fitting only the outlier datapoint, and $\yhat_* = \delta \xv_*^\trans \betahat = \y_*$. Note that, in this case, we have $h_* = 1$.

Data weights

The above results rely heavily on the special structure of linear regression: e.g. the fact that $\betahat$ is a linear combination of $\Y$, and that $\Yhat$ is a projection of $\Y$. In more general settings such results are not available. In order to motivate the use of such diagnostics in more general settings (not covered in this class), let me introduce a slightly different approach based on derivatives.

Suppose we assign each datapoint a weight, $\w_n$, and write

\[ \betavhat(\w)v = \argmin{\beta} \sumn \w_n (\y_n - \xv_n^\trans \betav)^2. \]

I have written $\betahat(\wv)$ because the optimal solution depends on the vector of weights, $\wv = (\w_1, \ldots, \w_N)^\trans$. When $\wv = \onev$, we recover the original problem. When we set one of the entries to zero, we remove that datapoint from the problem. Using this, we can approximate the effect of removing a datapoint using the first-order Taylor series expansion:

\[ \betahat_{-n} \approx \betahat_{-n}^{linear} = \betahat + \frac{\partial \betahat(\wv)}{\partial \w_n}\vert_{\w_n=1} (0 - 1). \]

One can show that

\[ \frac{\partial \betahat(\wv)}{\partial \w_n}\vert_{\w_n=1} = (\X^\trans \X)^{-1} \xv_n \reshat_n. \]

Note that the Taylor series is a good approximation to the exact formula (given in the homework) when $h_n \ll 1$, which is expected when $h_n$ goes to zero at rate $1/N$:

\[ \betahat_{-n} = \betahat - (\X^\trans \X)^{-1} \xv_n \frac{\reshat_n}{1 - h_n} \approx \betahat - (\X^\trans \X)^{-1} \xv_n \reshat = \betahat_{-n}^{linear}. \]

In more complicated nonlinear problems, the exact formula is unavailable, but the linear approximation is typically computed.

From this (or the exact formula), we can see that the effect of extreme values on $\betahat$ is actually a product of both $\reshat_n$ and $\xv_n$. Large residuals will not have an effect when $\xv_n = \zerov$, and outlier $\xv_n$ will not have an effect when $\reshat_n = 0$.

Rank-one updates for linear regression (Sherman-Morrison)

A famous result relates the effect of leaving out a single datapoint to leverage scores. The proof is in your homework. The proof uses the Sherman–Morrison formula, for which I now give a proof. Let $U$ and $V$ be $N \times K$ matrices, and $\id_N$ and $\id_K$ the $N\times N$ and $K \times K$ identity matrices, respectively. Using the fact that

\[ \begin{aligned} (\id_N + U V^\trans)^{-1} (\id_N + U V^\trans) = \id_N \quad\Rightarrow\quad& (\id_N + U V^\trans)^{-1} = \id_N - (\id_N + U V^\trans)^{-1} U V^\trans & \textrm{(i)}\\ (\id_N + U V^\trans) U = U (\id_K + V^\trans U) \quad\Rightarrow\quad& U (\id_N + V^\trans U )^{-1} = (\id_K + U^\trans V)^{-1} U & \textrm{(ii)} \end{aligned} \]

we have that

\[ \begin{aligned} (\id_N + U V^\trans)^{-1} ={}& \id_N - (\id_N + U V^\trans)^{-1} U V^\trans & \textrm{by (i)} \\ ={}& \id_N - U (\id_K + V^\trans U )^{-1} V^\trans & \textrm{by (ii)}. \end{aligned} \]

We can use this to prove the Sherman–Morrison formula as a special case when $K=1$.

\[ \begin{aligned} (A + \uv \vv^\trans)^{-1} ={}& (A (\id_N + A^{-1} \uv \vv^\trans))^{-1} \\={}& (\id_N + (A^{-1} \uv) \vv^\trans )^{-1} A^{-1} \\={}& (\id_N - A^{-1} \uv (1 + \vv^\trans A^{-1} \uv)^{-1} \vv^\trans ) A^{-1} \\={}& A^{-1} - \frac{A^{-1} \uv \vv A^{-1}}{1 + \vv^\trans A^{-1} \uv}. \end{aligned} \]