ggplot(ames_df) +
geom_point(aes(x=Gr.Liv.Area, y=SalePrice))
Let’s try to predict SalePrice using the regressors in the Ames dataset. We’ll explore some different ways to include more regressors.
regs <- c("Neighborhood", "Bldg.Type", "Overall.Qual", "Overall.Cond",
"Exter.Qual", "Exter.Cond",
"X1st.Flr.SF", "X2nd.Flr.SF", "Gr.Liv.Area", "Low.Qual.Fin.SF",
"Kitchen.Qual",
"Garage.Area", "Garage.Qual", "Garage.Cond",
"Paved.Drive", "Open.Porch.SF", "Enclosed.Porch",
"Pool.Area", "Yr.Sold")
if (any(!(regs %in% names(ames_df)))) {
print(setdiff(regs, names(ames_df)))
}I’ll run a bunch of regressions, including more and more regressors, and see what happens.



It looks like there may be non-linear dependence. How can we fit a nonlinear curve with “linear regression”?

It looks like there may be non-linear dependence. How can we fit a nonlinear curve with “linear regression”?
reg_order1 <- lm(SalePrice ~ Gr.Liv.Area, ames_df)
reg_order2 <- lm(SalePrice ~ Gr.Liv.Area + I(Gr.Liv.Area^2), ames_df)
reg_order3 <- lm(SalePrice ~ Gr.Liv.Area + I(Gr.Liv.Area^2) + I(Gr.Liv.Area^3), ames_df)
reg_order4 <- lm(SalePrice ~ Gr.Liv.Area + I(Gr.Liv.Area^2) +
I(Gr.Liv.Area^3) + I(Gr.Liv.Area^4), ames_df)
What will happen if you regress on higher and higher-order polynomials? How do you decide when to stop?
The poly function does the same thing more compactly. It also uses “orthogonal polynomials,” which can help fitting.

Recall that we found some grouping by kitchen quality. Let’s include the indicator in the regression:
Now each level of Kitchen.Qual has its own intercept.

The fit is now different for each kitchen quality, and the overall slope changed too!
However, the slope is still the same within each kitchen quality group. Why?
We can see in the XTX matrix that there is some association between living area and kitchen quality.
How would we expect the slope to have changed if this matrix were diagonal?
Note that one of the levels is automatically dropped? Why?
How would you fit a different slope for each kitchen quality? Use an interaction term. You can use the R notation : and it will construct the interaction automatically for you.
reg_inter <- lm(SalePrice ~ Gr.Liv.Area:Kitchen.Qual, ames_df)
bind_cols(select(ames_df, Gr.Liv.Area, Kitchen.Qual),
as.data.frame(model.matrix(reg_inter)))[1,] %>%
t() 1
Gr.Liv.Area "1656"
Kitchen.Qual "TA"
(Intercept) "1"
Gr.Liv.Area:Kitchen.QualEx "0"
Gr.Liv.Area:Kitchen.QualFa "0"
Gr.Liv.Area:Kitchen.QualGd "0"
Gr.Liv.Area:Kitchen.QualTA "1656"
Here’s a good cheat sheet for R formulas:
https://www.econometrics.blog/post/the-r-formula-cheatsheet/

Is there a point where all the slope interaction lines cross?
Using the R notation *, we can add both constant interactions and slope interactions:
reg_add_inter <- lm(SalePrice ~ Gr.Liv.Area*Kitchen.Qual, ames_df)
colnames(model.matrix(reg_add_inter))[1] "(Intercept)" "Gr.Liv.Area"
[3] "Kitchen.QualFa" "Kitchen.QualGd"
[5] "Kitchen.QualTA" "Gr.Liv.Area:Kitchen.QualFa"
[7] "Gr.Liv.Area:Kitchen.QualGd" "Gr.Liv.Area:Kitchen.QualTA"

We can create still higher–order interactions with the symbol ^:
Ex Fa Gd Po TA
3 101 19 3 2065
Ex Fa Gd TA
52 13 679 1447
Ex Fa Gd TA
108 42 844 1197
coeff_names <- names(coefficients(reg_higher))
cbind(coeff_names[1:5], coeff_names[21:25], coeff_names[71:75]) [,1] [,2]
[1,] "(Intercept)" "Exter.QualFa:Garage.QualFa"
[2,] "Exter.QualFa" "Exter.QualGd:Garage.QualFa"
[3,] "Exter.QualGd" "Exter.QualTA:Garage.QualFa"
[4,] "Exter.QualTA" "Exter.QualFa:Garage.QualGd"
[5,] "Kitchen.QualFa" "Exter.QualGd:Garage.QualGd"
[,3]
[1,] "Exter.QualTA:Kitchen.QualTA:Garage.QualPo"
[2,] "Exter.QualFa:Kitchen.QualFa:Garage.QualTA"
[3,] "Exter.QualGd:Kitchen.QualFa:Garage.QualTA"
[4,] "Exter.QualTA:Kitchen.QualFa:Garage.QualTA"
[5,] "Exter.QualFa:Kitchen.QualGd:Garage.QualTA"
Why are there so many NA coefficients?
Question: Why do the higher–order interactions look like step functions?

---
title: "Transforming regressors."
format:
html:
code-fold: false
code-tools: true
---
::: {.content-visible when-format="html"}
{{< include /macros.tex >}}
:::
<!--
This beamer output never really worked and I gave up trying
to get it to work.
The R output text was way too big, the figure
text was too small, and LaTeX math didn't render at all.
beamer:
slide-level: 1
include-before-body:
- file: ../macros.tex
fig-width: 6
fig-height: 3.5
-->
```{r}
#| echo: false
#| output: false
# Raster is not included in the Quarto build environment :(
#library(raster) # Load before tidyverse to avoid conflict with select
library(tidyverse)
library(gridExtra)
library(quarto)
root_dir <- quarto::find_project_root()
ames_df <- read.csv(file.path(root_dir, "datasets/ames_house/data/ames_de_cock.csv")) %>%
filter(Sale.Condition=="Normal",
MS.Zoning %in% c("RH", "RL", "RP", "RM")) %>%
filter(Kitchen.Qual != "Po") %>% # only one observation!
filter(!is.na(Garage.Qual))
```
```{r}
#| echo: false
#| output: false
if (FALSE) {
RasterizeMatrix <- function(mat) {
mat_df <-
as.data.frame(raster(mat), xy = TRUE) %>%
mutate(col_ind=round(x * ncol(mat) + 0.5),
row_ind=round((1 - y) * nrow(mat) + 0.5)) %>%
mutate(col_lab=colnames(mat)[col_ind]) %>%
mutate(row_lab=rownames(mat)[row_ind])
return(mat_df)
}
PlotMatrix <- function(mat) {
RasterizeMatrix(mat) %>%
ggplot() +
geom_raster(aes(x=col_lab, y=row_lab, fill=layer))
}
}
if (FALSE) {
# Test
mat <- matrix(runif(6), nrow=3)
rownames(mat) <- letters[1:3]
colnames(mat) <- c("A", "B")
mat_df <- RasterizeMatrix(mat)
for (col_lab in colnames(mat)) {
for (row_lab in rownames(mat)) {
mat_val <- mat_df %>% filter(row_lab == !!row_lab, col_lab == !!col_lab) %>% pull(layer)
stopifnot(abs(mat_val - mat[row_lab, col_lab]) < 1e-8)
}
}
}
PlotMatrix <- function(mat) {
image(mat)
}
```
# Goals
- Discuss transformations of regressors
- Polynomial regressors
- Interaction regressors
# Recall the Ames data
Let's try to predict `SalePrice` using the regressors in the Ames dataset. We'll
explore some different ways to include more regressors.
```{r}
ggplot(ames_df) +
geom_point(aes(x=Gr.Liv.Area, y=SalePrice))
```
# Let's run on a bunch of regressors
```{r}
regs <- c("Neighborhood", "Bldg.Type", "Overall.Qual", "Overall.Cond",
"Exter.Qual", "Exter.Cond",
"X1st.Flr.SF", "X2nd.Flr.SF", "Gr.Liv.Area", "Low.Qual.Fin.SF",
"Kitchen.Qual",
"Garage.Area", "Garage.Qual", "Garage.Cond",
"Paved.Drive", "Open.Porch.SF", "Enclosed.Porch",
"Pool.Area", "Yr.Sold")
if (any(!(regs %in% names(ames_df)))) {
print(setdiff(regs, names(ames_df)))
}
```
I'll run a bunch of regressions, including more and more
regressors, and see what happens.
```{r}
#| echo: false
#| output: false
fits_df <- data.frame()
r2_df <- data.frame()
for (i in 1:length(regs)) {
reg_form <- paste(regs[1:i], collapse=" + ")
lm_fit <- lm(formula(paste0("SalePrice ~ ", reg_form)), ames_df)
fits_df <- bind_rows(fits_df,
data.frame(x=ames_df$Gr.Liv.Area,
y=ames_df$SalePrice,
yhat=fitted(lm_fit),
num_regs=i))
r2_df <- bind_rows(r2_df, data.frame(
r2=summary(lm_fit)$r.squared,
num_regs=i
))
}
```
# How does the "fit" change?
```{r}
#| echo: false
ggplot(r2_df) +
geom_line(aes(x=num_regs, y=r2)) +
ylim(0, 1) +
xlab("Number of regressors included") +
ylab("R2")
```
# How does the "fit" change?
```{r}
#| echo: false
fits_df %>% filter(num_regs %% 4 == 0) %>%
ggplot() +
geom_point(aes(x=y, y=yhat)) +
geom_abline(aes(slope=1, intercept=0)) +
facet_grid(~ num_regs)
```
- You can't use R2 alone to determine which regressors to include,
since R2 will always tell you to add more
- It helps to be smart about what regressors you include
- How to make more regressors from the regressors we have?
# Polynomial regressors
```{r}
reg <- lm(SalePrice ~ Gr.Liv.Area, ames_df)
```
```{r}
#| echo: false
ggplot(ames_df, aes(y=SalePrice, x=Gr.Liv.Area), ames_df) +
geom_point(alpha=0.5) +
geom_line(aes(y=predict(reg, ames_df), color="Linear fit"), lwd=3)
```
It looks like there may be non-linear dependence. How can we fit
a nonlinear curve with "linear regression"?
# Regress on the square too
```{r}
reg_sq <- lm(SalePrice ~ Gr.Liv.Area + I(Gr.Liv.Area^2), ames_df)
```
```{r}
#| echo: false
ggplot(ames_df, aes(y=SalePrice, x=Gr.Liv.Area), ames_df) +
geom_point(alpha=0.5) +
geom_line(aes(y=predict(reg, ames_df), color="Linear fit"), lwd=3) +
geom_line(aes(y=predict(reg_sq, ames_df), color="Quadratic fit"), lwd=3)
```
It looks like there may be non-linear dependence. How can we fit
a nonlinear curve with "linear regression"?
# Regress on the higher-order polynomials
```{r}
reg_order1 <- lm(SalePrice ~ Gr.Liv.Area, ames_df)
reg_order2 <- lm(SalePrice ~ Gr.Liv.Area + I(Gr.Liv.Area^2), ames_df)
reg_order3 <- lm(SalePrice ~ Gr.Liv.Area + I(Gr.Liv.Area^2) + I(Gr.Liv.Area^3), ames_df)
reg_order4 <- lm(SalePrice ~ Gr.Liv.Area + I(Gr.Liv.Area^2) +
I(Gr.Liv.Area^3) + I(Gr.Liv.Area^4), ames_df)
```
```{r}
#| echo: false
ggplot(ames_df, aes(y=SalePrice, x=Gr.Liv.Area)) +
geom_point(alpha=0.3) +
geom_line(aes(y=predict(reg_order1, ames_df), color="Order 1"), lwd=3) +
geom_line(aes(y=predict(reg_order2, ames_df), color="Order 2"), lwd=3) +
geom_line(aes(y=predict(reg_order3, ames_df), color="Order 3"), lwd=3) +
geom_line(aes(y=predict(reg_order4, ames_df), color="Order 4"), lwd=3)
```
What will happen if you regress on higher and higher-order polynomials? How
do you decide when to stop?
# Regress on the higher-order polynomials
The poly function does the same thing more compactly. It
also uses "orthogonal polynomials," which can help fitting.
```{r}
reg_order4 <- lm(SalePrice ~ Gr.Liv.Area + I(Gr.Liv.Area^2) +
I(Gr.Liv.Area^3) + I(Gr.Liv.Area^4), ames_df)
reg_order4_v2 <- lm(SalePrice ~ poly(Gr.Liv.Area, 4), ames_df)
print(max(abs(fitted(reg_order4) - fitted(reg_order4_v2))))
```
# Interactions
```{r}
#| echo: false
# But there may be some grouping with Kitchen.Qual
ggplot(ames_df, aes(y=SalePrice, x=Gr.Liv.Area)) +
geom_point(alpha=0.5, aes(color=Kitchen.Qual)) +
geom_line(aes(y=predict(reg, ames_df)), lwd=2)
```
Recall that we found some grouping by kitchen quality. Let's
include the indicator in the regression:
```{r}
reg_additive <- lm(SalePrice ~ Gr.Liv.Area + Kitchen.Qual, ames_df)
```
Now each level of `Kitchen.Qual` has its own intercept.
```{r}
#| echo: false
additive_plot <-
ggplot(ames_df, aes(y=SalePrice, x=Gr.Liv.Area)) +
geom_point(alpha=0.3) +
geom_line(aes(y=predict(reg, ames_df), color="Original fit"), lwd=2, linetype = "dashed") +
geom_line(aes(y=predict(reg_additive, ames_df),
group=Kitchen.Qual, color=Kitchen.Qual), lwd=2)
print(additive_plot)
```
The fit is now different for each kitchen quality, and the
overall slope changed too!
However, the slope is still the same within each kitchen
quality group. **Why?**
# XTX plot
We can see in the XTX matrix that there is some association
between living area and kitchen quality.
**How would we expect the slope to have changed if this matrix were diagonal?**
```{r}
x <- model.matrix(reg_additive) %>% scale(center=FALSE)
xtx <- t(x) %*% x / nrow(ames_df)
PlotMatrix(xtx)
```
# Level dropping
Note that one of the levels is automatically dropped? **Why?**
```{r}
table(ames_df$Kitchen.Qual)
print(coefficients(reg_additive))
```
# Interactions
How would you fit a different slope for each kitchen quality?
Use an interaction term. You can use the `R` notation `:`
and it will construct the interaction automatically for you.
```{r}
reg_inter <- lm(SalePrice ~ Gr.Liv.Area:Kitchen.Qual, ames_df)
bind_cols(select(ames_df, Gr.Liv.Area, Kitchen.Qual),
as.data.frame(model.matrix(reg_inter)))[1,] %>%
t()
```
Here's a good cheat sheet for `R` formulas:
`https://www.econometrics.blog/post/the-r-formula-cheatsheet/`
# Interaction XTX matrix
```{r}
x <- model.matrix(reg_inter) %>% scale(center=FALSE)
xtx <- t(x) %*% x / nrow(ames_df)
PlotMatrix(xtx)
```
# Interaction plot
```{r}
#| echo: false
interaction_plot <-
ggplot(ames_df, aes(y=SalePrice, x=Gr.Liv.Area)) +
geom_point(alpha=0.3) +
geom_line(aes(y=predict(reg, ames_df), color="Overall fit"), lwd=3, linetype="dashed") +
geom_line(aes(y=predict(reg_inter, ames_df),
group=Kitchen.Qual, color=Kitchen.Qual), lwd=2)
grid.arrange(interaction_plot +
ggtitle("Interaction slopes") +
theme(legend.position="none"),
additive_plot +
ggtitle("Interaction offsets") +
theme(legend.position="none"),
ncol=2)
```
**Is there a point where all the slope interaction lines cross?**
# Both slopes and offsets
Using the `R` notation `*`, we can add both constant interactions
and slope interactions:
```{r}
reg_add_inter <- lm(SalePrice ~ Gr.Liv.Area*Kitchen.Qual, ames_df)
colnames(model.matrix(reg_add_inter))
```
```{r}
#| echo: false
additive_and_interaction_plot <-
ggplot(ames_df, aes(y=SalePrice, x=Gr.Liv.Area)) +
geom_point(alpha=0.3) +
geom_line(aes(y=predict(reg, ames_df), color="Overall fit"), lwd=3, linetype="dashed") +
geom_line(aes(y=predict(reg_add_inter, ames_df),
group=Kitchen.Qual, color=Kitchen.Qual), lwd=2)
grid.arrange(interaction_plot +
ggtitle("Interaction slopes") +
theme(legend.position="none"),
additive_and_interaction_plot +
ggtitle("Interaction slopes and offsets") +
theme(legend.position="none"),
ncol=2)
```
# Higher--order interactions
We can create still higher--order interactions with the
symbol `^`:
```{r}
table(ames_df$Garage.Qual)
table(ames_df$Exter.Qual)
table(ames_df$Kitchen.Qual)
```
```{r}
reg_higher <- lm(
SalePrice ~ (Exter.Qual + Kitchen.Qual + Garage.Qual)^3,
ames_df)
```
# Three--way interactions have a lot of NA estimates.
```{r}
coeff_names <- names(coefficients(reg_higher))
cbind(coeff_names[1:5], coeff_names[21:25], coeff_names[71:75])
```
```{r}
print(sum(is.na(coefficients(reg_higher))))
```
**Why are there so many NA coefficients?**
# Higher--order Interaction Plot
**Question:** Why do the higher--order interactions look like step functions?
```{r}
#| echo: false
ggplot(ames_df, aes(y=SalePrice, x=Gr.Liv.Area)) +
geom_point(alpha=0.3) +
geom_line(aes(y=predict(reg, ames_df), color="Original fit"), lwd=2, linetype="dashed") +
geom_point(aes(y=predict(reg_higher, ames_df),
color="Higher order indicator interactions"))
```