$$ \newcommand{\mybold}[1]{\boldsymbol{#1}} \newcommand{\trans}{\intercal} \newcommand{\norm}[1]{\left\Vert#1\right\Vert} \newcommand{\abs}[1]{\left|#1\right|} \newcommand{\bbr}{\mathbb{R}} \newcommand{\bbz}{\mathbb{Z}} \newcommand{\bbc}{\mathbb{C}} \newcommand{\gauss}[1]{\mathcal{N}\left(#1\right)} \newcommand{\chisq}[1]{\mathcal{\chi}^2_{#1}} \newcommand{\studentt}[1]{\mathrm{StudentT}_{#1}} \newcommand{\fdist}[2]{\mathrm{FDist}_{#1,#2}} \newcommand{\argmin}[1]{\underset{#1}{\mathrm{argmin}}\,} \newcommand{\projop}[1]{\underset{#1}{\mathrm{Proj}}\,} \newcommand{\proj}[1]{\underset{#1}{\mybold{P}}} \newcommand{\expect}[1]{\mathbb{E}\left[#1\right]} \newcommand{\prob}[1]{\mathbb{P}\left(#1\right)} \newcommand{\dens}[1]{\mathit{p}\left(#1\right)} \newcommand{\var}[1]{\mathrm{Var}\left(#1\right)} \newcommand{\cov}[1]{\mathrm{Cov}\left(#1\right)} \newcommand{\sumn}{\sum_{n=1}^N} \newcommand{\meann}{\frac{1}{N} \sumn} \newcommand{\cltn}{\frac{1}{\sqrt{N}} \sumn} \newcommand{\trace}[1]{\mathrm{trace}\left(#1\right)} \newcommand{\diag}[1]{\mathrm{Diag}\left(#1\right)} \newcommand{\grad}[2]{\nabla_{#1} \left. #2 \right.} \newcommand{\gradat}[3]{\nabla_{#1} \left. #2 \right|_{#3}} \newcommand{\fracat}[3]{\left. \frac{#1}{#2} \right|_{#3}} \newcommand{\W}{\mybold{W}} \newcommand{\w}{w} \newcommand{\wbar}{\bar{w}} \newcommand{\wv}{\mybold{w}} \newcommand{\X}{\mybold{X}} \newcommand{\x}{x} \newcommand{\xbar}{\bar{x}} \newcommand{\xv}{\mybold{x}} \newcommand{\Xcov}{\Sigmam_{\X}} \newcommand{\Xcovhat}{\hat{\Sigmam}_{\X}} \newcommand{\Covsand}{\Sigmam_{\mathrm{sand}}} \newcommand{\Covsandhat}{\hat{\Sigmam}_{\mathrm{sand}}} \newcommand{\Z}{\mybold{Z}} \newcommand{\z}{z} \newcommand{\zv}{\mybold{z}} \newcommand{\zbar}{\bar{z}} \newcommand{\Y}{\mybold{Y}} \newcommand{\Yhat}{\hat{\Y}} \newcommand{\y}{y} \newcommand{\yv}{\mybold{y}} \newcommand{\yhat}{\hat{\y}} \newcommand{\ybar}{\bar{y}} \newcommand{\res}{\varepsilon} \newcommand{\resv}{\mybold{\res}} \newcommand{\resvhat}{\hat{\mybold{\res}}} \newcommand{\reshat}{\hat{\res}} \newcommand{\betav}{\mybold{\beta}} \newcommand{\betavhat}{\hat{\betav}} \newcommand{\betahat}{\hat{\beta}} \newcommand{\betastar}{{\beta^{*}}} \newcommand{\bv}{\mybold{\b}} \newcommand{\bvhat}{\hat{\bv}} \newcommand{\alphav}{\mybold{\alpha}} \newcommand{\alphavhat}{\hat{\av}} \newcommand{\alphahat}{\hat{\alpha}} \newcommand{\omegav}{\mybold{\omega}} \newcommand{\gv}{\mybold{\gamma}} \newcommand{\gvhat}{\hat{\gv}} \newcommand{\ghat}{\hat{\gamma}} \newcommand{\hv}{\mybold{\h}} \newcommand{\hvhat}{\hat{\hv}} \newcommand{\hhat}{\hat{\h}} \newcommand{\gammav}{\mybold{\gamma}} \newcommand{\gammavhat}{\hat{\gammav}} \newcommand{\gammahat}{\hat{\gamma}} \newcommand{\new}{\mathrm{new}} \newcommand{\zerov}{\mybold{0}} \newcommand{\onev}{\mybold{1}} \newcommand{\id}{\mybold{I}} \newcommand{\sigmahat}{\hat{\sigma}} \newcommand{\etav}{\mybold{\eta}} \newcommand{\muv}{\mybold{\mu}} \newcommand{\Sigmam}{\mybold{\Sigma}} \newcommand{\rdom}[1]{\mathbb{R}^{#1}} \newcommand{\RV}[1]{\tilde{#1}} \def\A{\mybold{A}} \def\A{\mybold{A}} \def\av{\mybold{a}} \def\a{a} \def\B{\mybold{B}} \def\S{\mybold{S}} \def\sv{\mybold{s}} \def\s{s} \def\R{\mybold{R}} \def\rv{\mybold{r}} \def\r{r} \def\V{\mybold{V}} \def\vv{\mybold{v}} \def\v{v} \def\U{\mybold{U}} \def\uv{\mybold{u}} \def\u{u} \def\W{\mybold{W}} \def\wv{\mybold{w}} \def\w{w} \def\tv{\mybold{t}} \def\t{t} \def\Sc{\mathcal{S}} \def\ev{\mybold{e}} \def\Lammat{\mybold{\Lambda}} $$

Transformations of regressors: Some payoffs from the linear algebra perspective.

$\LaTeX$

Goals

Use our technical results to answer some questions about prediction:
- How can we form new predictors for prediction problems?

Prediction in the bodyfat example

Recall our bodyweight example. Suppose we’ve chosen a couple of variables to form our prediction: Height, Weight, and Abdomen.

reg <- lm(bodyfat ~ Abdomen + Height + Weight, bodyfat_df)
print(reg$coefficients)

(Intercept)     Abdomen      Height      Weight 
-36.6147193   0.9515631  -0.1270307  -0.1307606

print(sprintf("Error: %f", mean(reg$residuals^2)))

[1] "Error: 19.456161"

Question 1

Noting that all three of these variables are on different scales, you might think you would get a better fit by normalizing them. Let’s see what happens:

bodyfat_df <- bodyfat_df %>%
    mutate(height_norm=(Height - mean(Height)) / sd(Height),
           weight_norm=(Weight - mean(Weight)) / sd(Weight),
           abdomen_norm=(Abdomen - mean(Abdomen)) / sd(Abdomen))
reg_norm <- lm(bodyfat ~ abdomen_norm + height_norm + weight_norm, bodyfat_df)
print(reg_norm$coefficients)

 (Intercept) abdomen_norm  height_norm  weight_norm 
   19.150794    10.260778    -0.465295    -3.842945

print(sprintf("Error: %f", mean(reg_norm$residuals^2)))

[1] "Error: 19.456161"

Our coefficients changed, but our fitted error didn’t change at all! Why not?

Question 2

Suppose we think, okay, maybe it’s the difference between normalized height and weight that helps us predict.

bodyfat_df <- bodyfat_df %>%
    mutate(hw_diff = height_norm - weight_norm)
reg_diff <- lm(bodyfat ~ abdomen_norm + height_norm + weight_norm + hw_diff, bodyfat_df)
print(reg_diff$coefficients)

 (Intercept) abdomen_norm  height_norm  weight_norm      hw_diff 
   19.150794    10.260778    -0.465295    -3.842945           NA

print(sprintf("Error: %f", mean(reg_diff$residuals^2)))

[1] "Error: 19.456161"

Now, our fitted error didn’t change at all, but our difference coefficient wasn’t even estimated. What is going on?

Question 3

What about the ratio of weight to height?

bodyfat_df <- bodyfat_df %>%
    mutate(hw_ratio = height_norm / weight_norm)
reg_ratio <- lm(bodyfat ~ abdomen_norm + height_norm + weight_norm + hw_ratio, bodyfat_df)
print(reg_ratio$coefficients)

 (Intercept) abdomen_norm  height_norm  weight_norm     hw_ratio 
19.149475485 10.271751693 -0.478683698 -3.848561820  0.009057317

print(sprintf("Error: %f", mean(reg_ratio$residuals^2)))

[1] "Error: 19.423560"

Our fitted error finally changed, and we could estimate this coefficient. What’s different between this example and the previous one?

Question 4

Let $\x_n = (\textrm{Abdomen}_n, \textrm{Height}_n, \textrm{Weight}_n, \textrm{Wrist}_n ...)$ contain all the measured variables in the dataset.

A colleague suggests a research project where you improve your fit by regressing $\y_n \sim \z_n$ for new regressors $\z_n$ of the form $\z_n = \A \x_n$ where the matrix $\A$ is chosen optimally.

A different colleague suggests a research project where you try the same thing, but let $\z_n = f(\x_n)$ for any function $f$, where you optimize over all functions.

What do you think of these ideas?

Answer to Question 1

Linear combinations of variables in simple least squares

To answer question 1, let’s look at a slightly simpler example, where we perform simple least squares on heigh.

reg_height <- lm(bodyfat ~ Height, bodyfat_df)
print(reg_height$coefficients)

(Intercept)      Height 
 33.4944938  -0.2044753

print(sprintf("Error: %f", mean(reg_height$residuals^2)))

[1] "Error: 69.199176"

reg_height_norm <- lm(bodyfat ~ height_norm, bodyfat_df)
print(reg_height_norm$coefficients)

(Intercept) height_norm 
 19.1507937  -0.7489636

print(sprintf("Error: %f", mean(reg_height_norm$residuals^2)))

[1] "Error: 69.199176"

Why do these two have different coefficients, but the same fit? The answer comes from our projection result. Let’s let $\x_n = \textrm{Height}_n$, and $\z_n = \textrm{height\_norm}_n$. Recall that R includes a constant in the regression by default, so our two regressions are:

\[ \begin{aligned} \textrm{height: } && \textrm{height\_norm} \\ \y_n = \beta_0 + \beta_1 \x_n + \res_n && \y_n = \gamma_0 + \gamma_1 \z_n + \nu_n. \end{aligned} \]

What’s the relationship between these two regressions? Well, taking $\overline{x} := \meann \x_n$ and $\sigmahat_x := \sqrt{\meann (\x_n - \overline{x})^2}$. In our case, $\sigmahat_x > 0$, so we’ve set

\[ \z_n = \frac{\x_n - \overline{x}}{\sigmahat_x} = \frac{1}{\sigmahat_x} \x_n - \frac{\overline{x}}{\sigmahat_x}. \tag{1}\]

That means we can write

\[ \begin{aligned} \gamma_0 + \gamma_1 \z_n + \nu_n ={}& \gamma_0 + \gamma_1 \left( \frac{1}{\sigmahat_x} \x_n - \frac{\overline{x}}{\sigmahat_x} \right) + \nu_n \\ ={}& \left(\gamma_0 - \frac{\overline{x}}{\sigmahat_x} \gamma_1 \right) + \left( \frac{\gamma_1}{\sigmahat_x}\right) \x_n + \nu_n. \end{aligned} \]

By identifying

\[ \gamma_0 - \frac{\overline{x}}{\sigmahat_x} \gamma_1 = \beta_0 \quad\textrm{and}\quad \frac{\gamma_1}{\sigmahat_x} = \beta_1, \]

we see that the two regressions are exactly the same! From this two conclusions follow:

It is impossible for you to get a better fit with one than the other.
In general the coefficients will be different.

In fact, we can write down the rule to convert between the two. Let’s check that it works:

xbar <- mean(bodyfat_df$Height)
sigmahat <- sd(bodyfat_df$Height)
gammahat_0 <- reg_height_norm$coefficients["(Intercept)"]
gammahat_1 <- reg_height_norm$coefficients["height_norm"]
betahat_0 <- reg_height$coefficients["(Intercept)"]
betahat_1 <- reg_height$coefficients["Height"]

print("Intercept: ")

[1] "Intercept: "

cat(betahat_0, " = ", gammahat_0 - gammahat_1 * xbar / sigmahat, "\n")

33.49449  =  33.49449

cat(betahat_1, " = ", gammahat_1  / sigmahat, "\n")

-0.2044753  =  -0.2044753

Linear combinations of variables in matrix form

To extend what we just did, let’s put our result in matrix notation. First, write

\[ \xv_n = \begin{pmatrix} 1 \\ \x_n \end{pmatrix} \quad\textrm{and}\quad \zv_n = \begin{pmatrix} 1 \\ \z_n \end{pmatrix}. \]

In this notation, we can write Equation 1 as

\[ \zv_n = \begin{pmatrix} 1 & 0 \\ -\frac{\overline{x}}{\sigmahat_x} & \frac{1}{\sigmahat_x} \end{pmatrix} \xv_n =: \A \xv_n. \]

We can see that $\A$ is invertible whenever $\sigmahat_x > 0$. Using the matrix $\A$, we can write $\Z = \X \A^\trans$, so

\[ \Y = \Z \gamma + \etav = \X \A^\trans \gamma + \etav = \X \beta + \resv, \]

which gives

\[ \etav = \resv \quad\Leftrightarrow \quad \A^\trans \gamma = \beta \quad\Leftrightarrow \quad \gamma = (\A^\trans)^{-1} \beta . \]

We can confirm that this condition does indeed hold at the optimum of each regression:

a_mat = matrix(NA, 2, 2)
a_mat[1,1] <- 1
a_mat[1,2] <- 0
a_mat[2,1] <- -xbar / sigmahat
a_mat[2,2] <- 1 / sigmahat
print(t(a_mat) %*% reg_height_norm$coefficients %>% as.numeric())

[1] 33.4944938 -0.2044753

print(reg_height$coefficients)

(Intercept)      Height 
 33.4944938  -0.2044753

Note that we were able to do this only because we included an intercept in the regression! In fact, we can show that we get different fits if we don’t include the intercept:

reg_height_noint <- lm(bodyfat ~ Height - 1, bodyfat_df)
print(sprintf("Error: %f", mean(reg_height_noint$residuals^2)))

[1] "Error: 72.237550"

reg_height_norm_noint <- lm(bodyfat ~ height_norm - 1, bodyfat_df)
print(sprintf("Error: %f", mean(reg_height_norm_noint$residuals^2)))

[1] "Error: 435.952073"

Linear combinations of variables in general

Note that the argument which we made in the special case above actually holds in general. Let’s go to $P$–dimensional regression, but otherwise retaining the notation $\y_n \sim \z_n^\trans\gamma$ and $\y_n \sim \x_n^\trans \beta$. If we can find an invertible $\A$ such that $\z_n = \A \x_n$, then we still have

\[ \Y = \Z \gamma + \etav = \X \A^\trans \gamma + \etav = \X \beta + \resv, \]

\[ \etav = \resv \quad\Leftrightarrow \quad \A^\trans \gamma = \beta \quad\Leftrightarrow \quad \gamma = (\A^\trans)^{-1} \beta . \]

How can we recognize a linear transformation from $\x_n$ to $\z_n$?
It’s enough for each entry of $\z_n$ to be a linear transform $\z_{np}= \a_p^\trans \x_n$, since we can then just stack the $\a_p^\trans$ vectors to form $\A$.

Linear transformations of $\x_n$ always look like $\z_{np} = \a_{p1} \x_{n1} + \ldots \a_{pP} \x_{nP}$.
Entries that are unchanged are always the (trivial) identity linear transformation $\a_{p} = (0, 0, \ldots, 0, 1, 0, \ldots, 0)$

Note that adding a constant is an affine, not a linear transformation. However, it can be a linear transformation if your model includes an intercept — or if some sum of the entries of $\x_n$ is always a constant.

Not that it is not always easy to see whether a linear transformation is invertible! We will talk about that problem shortly.

In this way, one can see that the normalization of Question 1 is a linear transformation.

Linear transformations and column spans

Finally, recall our projection result that states

\[ \Yhat = \proj{\S_\X} \Y = \X (\X^\trans \X)^{-1} \X^\trans \Y, \]

where $\S_\X$ is the space of linear combinations of the columns of $\X$. If $\A$ is invertible, then the column span of $\Z = \X \A^\trans$ is equal to the column span of $\X$. Consequently, a regression on an invertible linear combination of regressors cannot affect the fit.

Although this intuition is clear enough if you’re comfortable with linear algebra, you can also check directly that

\[ \proj{\S_\Z} \Y = \Z (\Z^\trans \Z)^{-1} \Z^\trans \Y = \X \A^\trans (\A \X^\trans \X \A^\trans)^{-1} \A \X^\trans \Y = \X \A^\trans (\A^\trans)^{-1} (\X^\trans \X)^{-1} \A^{-1} \A \X^\trans \Y = \X (\X^\trans \X)^{-1} \X^\trans \Y = \proj{\S_\X} \Y. \]

Answer to Question 2

Redundant variables in simple regression

Once again, it will be useful to start with a simpler example. Suppose you have a regression $\y_n \sim \xv_n^\trans \beta$, with $\xv_n = (1, \x_n)^\trans$, where $\sigmahat_x > 0$ (as defined above). Now suppose you want to run a new regression $\y_n \sim \z_n^\trans \gamma$ with

\[ \zv_n = \begin{pmatrix} 1 \\ \x_n \\ \x_n - 1 \end{pmatrix}. \]

Note that the new regression is the same as

\[ \begin{aligned} \y_n ={}& \gamma_1 + \gamma_2 \x_n + \gamma_3 (\x_n - 1) + \eta_n \\ ={}& (\gamma_1 - \gamma_3) + (\gamma_2 + \gamma_3) \x_n + \eta_n \\ ={}& \beta_1 + \beta_2 \x_n + \res_n. \end{aligned} \]

We achieve the same fit — $\res_n = \eta_n$ for all $n$ — whenever

\[ \gamma_1 - \gamma_3 = \beta_1 \quad\textrm{and}\quad \gamma_2 + \gamma_3 = \beta_2. \]

By the arguments above, the two regressions must have the same optimal fit. But in this case, there are actually a whole infinite dimensional set of $\gammav$ that correspond to each $\bv$. In particular, for any $a$, we can take

\[ \begin{aligned} \gamma_1 &\rightarrow \gamma_1 + a \\ \gamma_2 &\rightarrow \gamma_2 - a \\ \gamma_3 &\rightarrow \gamma_3 + a, \end{aligned} \]

and achieve exactly the same fit. In other words, the optimal fit $\Yhat$ is the same for the two regressions, but the optimal parameter

\[ \gammahat := \argmin{\gamma} \meann \eta_n^2 \]

is not uniquely defined — there is a whole family of coefficients that acheive the same optimal fit.

Redundant variables in matrix form

How can we understand this in matrix form? In this case, we can write the third element of the $\zv_n$ regressor as a linear combination of the other two:

\[ \zv_{n3} = \x_n - 1 = \zv_{n2} - \zv_{n1} \quad\Leftrightarrow\quad \zv_{n3} - \zv_{n2} + \zv_{n1} = 0 \quad\Leftrightarrow\quad \zv_n^\trans \begin{pmatrix} 1 \\ -1 \\ 1 \end{pmatrix} =: \zv_n^\trans \vv = 0. \]

Since this is true for every row, we have that $\Z \vv = \zerov$ as well, and so

\[ \vv^\trans (\Z^\trans \Z) \vv = 0. \]

In other words, $\vv$ is a non-zero vector in the nullspace of $\Z^\trans \Z$. It follows that $\Z^\trans \Z$ is not invertible, and the OLS coefficient $(\Z^\trans\Z)^{-1} \Z^\trans \Y$ is not defined!

But that does not prevent us from finding the optimal projection. That is, we can still find

\[ \textrm{Well-defined: }\quad \min_{\gamma} \norm{\Y - \Z \gamma}_2^2 \quad\quad\quad \quad\quad\quad \textrm{Ill-defined: }\quad \argmin{\gamma} \norm{\Y - \Z \gamma}_2^2. \]

Specifically,

\[ \min_{\gamma} \norm{\Y - \Z \gamma}_2^2 = \norm{(\id - \proj{\S_\Z}) \Y}_2^2 \]

is the norm of the projection perpendicular to the space spanned by the columns of $\Z$, it’s just that this space is two-dimensional, not three-dimensional.

In R

It turns out that R does not actually estimate the OLS fit by forming $(\X^\trans \X)^{-1} \X^\trans \Y$. It uses an iterative procedure that is roughly similar to Gaussian elimination to estimate one component at a time. When it gets to a component that cannot be estimated beacuse $\X^\trans \X$ has a non-trivial nullspace, then the fit terminates, and it reports the values for the coefficients estimated up to that point, with NA for the rest.

In our example, the difference hw_diff is a linear combination of height_norm and weight_norm. Thus the regressors have a non-trivial nullspace, and the best fit $\Yhat$ is defined even though the regressors are not.

Answer to Question 3

The above arguments apply only to linear transformations. Non-linear transformations of the regressors give different fits.

A single counterexample is enough to prove the general statement. Take $\y_n = 1$ for all $n$, $N$ to be even, and take \[ \x_n = \begin{cases} -1 & \textrm{if }n\textrm{ is odd} \\ 1 & \textrm{if }n\textrm{ is even}. \\ \end{cases} \]

Take $\z_n = \x_n^2$, which is a nonlinear function. Regressing $\y_n \sim \x_n \beta$ and $\y_n \sim \z_n \gamma$, we get

\[ \betahat = \frac{\sumn \y_n \x_n}{\sumn \x_n^2} = \frac{\sum_{n\textrm{ odd}} (-1) + \sum_{n\textrm{ even}}1}{\sumn 1} = 0 \]

but

\[ \gammahat = \frac{\sumn \y_n \z_n}{\sumn \z_n^2} = \frac{\sum_{n\textrm{ odd}} (-1)^2 + \sum_{n\textrm{ even}}1}{\sumn 1} = 1. \]

Of course, a non-linear transform can still be linear on a particular dataset — for example, if $\x_n$ are all a positive constant, then the column span of the regressors $\x_n$ and $\x_n^2$ is the same.

A more pathological example is as follows. Suppose that the $\x_n$ are all distinct, and define sets $B_n$ such that $\x_n \in B_n$ but $\x_n \notin B_m$ for $n \ne m$. Then define an $N$–dimensional regressor \[ \z_{np} = \begin{cases} 1 & \x_{n} \in B_p \\ 0 & \x_{n} \notin B_p. \end{cases} \]

This is a perfectly well-defined function of $\x_n$! However, $\Z$ is the identity matrix, so the OLS solution to $\y_n \sim \z_n \beta$ is $\beta = \Y$, with zero error.

It is difficult to say in general what effect nonlinear transformations will have on a regression, but to answer Question 3, it suffices to notice that height_norm / weight_norm is a nonlinear function of the two variables, and so provide a linearly independent regressor in general.

Answer to Question 4

Hopefully you will find that we have already answered Question 4! Neither are a very good idea.