Three example datasets for linear modeling

In Lecture XXI of Wittgenstein’s “Lectures on the Foundations of Mathematics” we find this diagram and quote:

In the same passage, he goes on to say,

“A use of language has normally what we might call a point. This is immensely important. Although it’s true this is a matter of degree, and we can’t just say where it ends.”

When we run a regression, we must have a point. Otherwise we are just looking at goose entrails.

library(tidyverse)
library(gridExtra)
library(car)

options(repr.plot.width = 15, repr.plot.height = 8, repr.plot.res = 300)

── Attaching core tidyverse packages ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Attaching package: ‘gridExtra’


The following object is masked from ‘package:dplyr’:

    combine


Loading required package: carData


Attaching package: ‘car’


The following object is masked from ‘package:dplyr’:

    recode


The following object is masked from ‘package:purrr’:

    some

Spotify data

Almost 30,000 Songs from the Spotify API with black-box machine learning quantifications of musical features.

https://www.kaggle.com/datasets/joebeachcapital/30000-spotify-songs

spotify_df <- read.csv("../datasets/spotify_songs.csv")
head(spotify_df)
print(sprintf("%d rows in the dataframe.", nrow(spotify_df)))
print(names(spotify_df))

A data.frame: 6 × 23
	track_id	track_name	track_artist	track_popularity	track_album_id	track_album_name	track_album_release_date	playlist_name	playlist_id	playlist_genre	⋯	key	loudness	mode	speechiness	acousticness	instrumentalness	liveness	valence	tempo	duration_ms
	<chr>	<chr>	<chr>	<int>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	⋯	<int>	<dbl>	<int>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<int>
1	6f807x0ima9a1j3VPbc7VN	I Don't Care (with Justin Bieber) - Loud Luxury Remix	Ed Sheeran	66	2oCs0DGTsRO98Gh5ZSl2Cx	I Don't Care (with Justin Bieber) [Loud Luxury Remix]	2019-06-14	Pop Remix	37i9dQZF1DXcZDD7cfEKhW	pop	⋯	6	-2.634	1	0.0583	0.1020	0.00e+00	0.0653	0.518	122.036	194754
2	0r7CVbZTWZgbTCYdfa2P31	Memories - Dillon Francis Remix	Maroon 5	67	63rPSO264uRjW1X5E6cWv6	Memories (Dillon Francis Remix)	2019-12-13	Pop Remix	37i9dQZF1DXcZDD7cfEKhW	pop	⋯	11	-4.969	1	0.0373	0.0724	4.21e-03	0.3570	0.693	99.972	162600
3	1z1Hg7Vb0AhHDiEmnDE79l	All the Time - Don Diablo Remix	Zara Larsson	70	1HoSmj2eLcsrR0vE9gThr4	All the Time (Don Diablo Remix)	2019-07-05	Pop Remix	37i9dQZF1DXcZDD7cfEKhW	pop	⋯	1	-3.432	0	0.0742	0.0794	2.33e-05	0.1100	0.613	124.008	176616
4	75FpbthrwQmzHlBJLuGdC7	Call You Mine - Keanu Silva Remix	The Chainsmokers	60	1nqYsOef1yKKuGOVchbsk6	Call You Mine - The Remixes	2019-07-19	Pop Remix	37i9dQZF1DXcZDD7cfEKhW	pop	⋯	7	-3.778	1	0.1020	0.0287	9.43e-06	0.2040	0.277	121.956	169093
5	1e8PAfcKUYoKkxPhrHqw4x	Someone You Loved - Future Humans Remix	Lewis Capaldi	69	7m7vv9wlQ4i0LFuJiE2zsQ	Someone You Loved (Future Humans Remix)	2019-03-05	Pop Remix	37i9dQZF1DXcZDD7cfEKhW	pop	⋯	1	-4.672	1	0.0359	0.0803	0.00e+00	0.0833	0.725	123.976	189052
6	7fvUMiyapMsRRxr07cU8Ef	Beautiful People (feat. Khalid) - Jack Wins Remix	Ed Sheeran	67	2yiy9cd2QktrNvWC2EUi0k	Beautiful People (feat. Khalid) [Jack Wins Remix]	2019-07-11	Pop Remix	37i9dQZF1DXcZDD7cfEKhW	pop	⋯	8	-5.385	1	0.1270	0.0799	0.00e+00	0.1430	0.585	124.982	163049

[1] "32833 rows in the dataframe."
 [1] "track_id"                 "track_name"              
 [3] "track_artist"             "track_popularity"        
 [5] "track_album_id"           "track_album_name"        
 [7] "track_album_release_date" "playlist_name"           
 [9] "playlist_id"              "playlist_genre"          
[11] "playlist_subgenre"        "danceability"            
[13] "energy"                   "key"                     
[15] "loudness"                 "mode"                    
[17] "speechiness"              "acousticness"            
[19] "instrumentalness"         "liveness"                
[21] "valence"                  "tempo"                   
[23] "duration_ms"

grid.arrange(
    qplot(x=track_popularity, geom="histogram", data=spotify_df),
    qplot(x=log10(track_popularity), geom="histogram", data=spotify_df), ncol=2)

Warning message:
“`qplot()` was deprecated in ggplot2 3.4.0.”
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning message:
“Removed 2703 rows containing non-finite values (`stat_bin()`).”

ggplot(spotify_df) + geom_point(aes(y=track_popularity, x=danceability))

ggplot(spotify_df) + geom_density_2d_filled(aes(y=track_popularity, x=danceability))

reg_small <- lm(track_popularity ~ danceability, spotify_df)
print(summary(reg_small))

ggplot(spotify_df) +
    geom_density_2d_filled(aes(y=track_popularity, x=danceability)) +
    geom_line(aes(x=danceability, y=reg_small$fitted.values), color="red", linewidth=2)


Call:
lm(formula = track_popularity ~ danceability, data = spotify_df)

Residuals:
    Min      1Q  Median      3Q     Max 
-46.024 -18.415   2.934  19.480  57.105 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   35.1757     0.6361   55.30   <2e-16 ***
danceability  11.1497     0.9484   11.76   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 24.93 on 32831 degrees of freedom
Multiple R-squared:  0.004192,  Adjusted R-squared:  0.004162 
F-statistic: 138.2 on 1 and 32831 DF,  p-value: < 2.2e-16

Bodyfat

Source: https://www.kaggle.com/datasets/fedesoriano/body-fat-prediction-dataset

Motivation

Measuring bodyfat precisely is hard, and proxies are very useful. Can we predict the outcome of an expensive, inconvenient procedure using only easy-to-gather measurements?

Image from https://www.topendsports.com/testing/tests/underwater.htm

bodyfat_df <- read.csv("../datasets/bodyfat.csv")
head(bodyfat_df)
nrow(bodyfat_df)

A data.frame: 6 × 15
	Density	bodyfat	Age	Weight	Height	Neck	Chest	Abdomen	Hip	Thigh	Knee	Ankle	Biceps	Forearm	Wrist
	<dbl>	<dbl>	<int>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
1	1.0708	12.3	23	154.25	67.75	36.2	93.1	85.2	94.5	59.0	37.3	21.9	32.0	27.4	17.1
2	1.0853	6.1	22	173.25	72.25	38.5	93.6	83.0	98.7	58.7	37.3	23.4	30.5	28.9	18.2
3	1.0414	25.3	22	154.00	66.25	34.0	95.8	87.9	99.2	59.6	38.9	24.0	28.8	25.2	16.6
4	1.0751	10.4	26	184.75	72.25	37.4	101.8	86.4	101.2	60.1	37.3	22.8	32.4	29.4	18.2
5	1.0340	28.7	24	184.25	71.25	34.4	97.3	100.0	101.9	63.2	42.2	24.0	32.2	27.7	17.7
6	1.0502	20.9	24	210.25	74.75	39.0	104.5	94.4	107.8	66.0	42.0	25.6	35.7	30.6	18.8

252

vars <- c("Weight", "Height", "Neck", "Chest", "Abdomen", "Hip", 
          "Thigh", "Knee", "Ankle", "Biceps", "Forearm", "Wrist")
stopifnot(all(vars %in% names(bodyfat_df)))

grid.arrange(
    ggplot(bodyfat_df) + geom_point(aes(x=Abdomen, y=bodyfat)),
    ggplot(bodyfat_df) + geom_point(aes(x=Height, y=bodyfat)),
    ggplot(bodyfat_df) + geom_point(aes(x=Weight, y=bodyfat)),
    ncol=3)

regressors <- paste(vars, collapse=" + ")
reg_form <- formula(paste0("bodyfat ~ 1 + ", regressors))
reg <- lm(reg_form, bodyfat_df)
summary(reg)

ggplot() +
    geom_point(aes(x=bodyfat_df$bodyfat, y=reg$fitted.values)) +
    geom_abline(aes(slope=1, intercept=0)) +
    xlab("Actual bodyfat") + ylab("Predicted bodyfat")


Call:
lm(formula = reg_form, data = bodyfat_df)

Residuals:
     Min       1Q   Median       3Q      Max 
-10.4341  -3.0551  -0.0158   3.1951   9.7682 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -19.54803   17.43111  -1.121   0.2632    
Weight       -0.10974    0.05266  -2.084   0.0382 *  
Height       -0.09410    0.09569  -0.983   0.3264    
Neck         -0.42995    0.23280  -1.847   0.0660 .  
Chest        -0.01728    0.09964  -0.173   0.8625    
Abdomen       1.02953    0.07761  13.266   <2e-16 ***
Hip          -0.22995    0.14626  -1.572   0.1172    
Thigh         0.13476    0.13510   0.997   0.3195    
Knee          0.13187    0.23554   0.560   0.5761    
Ankle         0.12974    0.22150   0.586   0.5586    
Biceps        0.20538    0.17163   1.197   0.2326    
Forearm       0.38964    0.19756   1.972   0.0497 *  
Wrist        -1.27227    0.50602  -2.514   0.0126 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.329 on 239 degrees of freedom
Multiple R-squared:  0.7452,    Adjusted R-squared:  0.7324 
F-statistic: 58.24 on 12 and 239 DF,  p-value: < 2.2e-16

Aluminum

Stress-strain plots of AL data. For a given amount of force, how much does a bar of aluminum deform?

Taken from https://data.mendeley.com/datasets/rd6jm9tyb6/2

See paper and methods at https://www.sciencedirect.com/science/article/pii/S0263823118313831?via%3Dihub

al_df <- read.csv(file="../datasets/aluminum_dataset.csv") %>% 
  filter(loading_type == "T", temp == 20, lot == "A")
head(al_df)

A data.frame: 6 × 7
	Strain	Stress_MPa	loading_type	temp	lot	specimen	filename
	<dbl>	<dbl>	<chr>	<int>	<chr>	<int>	<chr>
1	1.746888e-05	0.1470586	T	20	A	1	T_020_A_1_001_022_03.csv
2	2.096203e-05	0.1480829	T	20	A	1	T_020_A_1_001_022_03.csv
3	1.871861e-05	0.1676320	T	20	A	1	T_020_A_1_001_022_03.csv
4	1.634951e-05	0.1929697	T	20	A	1	T_020_A_1_001_022_03.csv
5	1.951097e-05	0.2212170	T	20	A	1	T_020_A_1_001_022_03.csv
6	2.155187e-05	0.2569143	T	20	A	1	T_020_A_1_001_022_03.csv

print(table(al_df[c("lot", "specimen")]))

   specimen
lot   1   2   3
  A 614 645 630

ggplot(al_df) +
  geom_point(aes(x=Strain, y=Stress_MPa, color=factor(specimen)))

al_df %>% filter(Strain < 0.0035) %>%
  ggplot() +
    geom_point(aes(x=Strain, y=Stress_MPa, color=factor(specimen)))

al_reg_df <- al_df %>% filter(Strain < 0.0035, Strain > 0.0001, specimen == 1)
ggplot(al_reg_df) +
  geom_point(aes(x=Strain, y=Stress_MPa, color=factor(specimen)))

reg <- lm(Stress_MPa ~ 1 + Strain, al_reg_df)
al_reg_df$fit <- reg$fitted.values
al_df %>% filter(Strain < 0.0035) %>%
  ggplot() +
    geom_point(aes(x=Strain, y=Stress_MPa, color=factor(specimen))) +
    geom_line(aes(x=Strain, y=fit, color="Specimen 1 Fitted"), lwd=2, data=al_reg_df)