The Best Subset Selection approach considers every possible combination of features and creates a model for each one. Using one’s chosen evaluation statistic, one can select the model with the best results.
Best Subset Selection is likely overfit so ideally we use evaluation statistics that estimate the test error for each model. In the analysis below, Mallow’s Cp is used.
The data set includes baseball player statistics and salaries.
library(ISLR)
summary(Hitters)
AtBat Hits HmRun Runs RBI
Min. : 19.0 Min. : 1.0 Min. : 0.00 Min. : 0.00 Min. : 0.00
1st Qu.:282.5 1st Qu.: 71.5 1st Qu.: 5.00 1st Qu.: 33.50 1st Qu.: 30.00
Median :413.0 Median :103.0 Median : 9.00 Median : 52.00 Median : 47.00
Mean :403.6 Mean :107.8 Mean :11.62 Mean : 54.75 Mean : 51.49
3rd Qu.:526.0 3rd Qu.:141.5 3rd Qu.:18.00 3rd Qu.: 73.00 3rd Qu.: 71.00
Max. :687.0 Max. :238.0 Max. :40.00 Max. :130.00 Max. :121.00
Walks Years CAtBat CHits
Min. : 0.00 Min. : 1.000 Min. : 19.0 Min. : 4.0
1st Qu.: 23.00 1st Qu.: 4.000 1st Qu.: 842.5 1st Qu.: 212.0
Median : 37.00 Median : 6.000 Median : 1931.0 Median : 516.0
Mean : 41.11 Mean : 7.312 Mean : 2657.5 Mean : 722.2
3rd Qu.: 57.00 3rd Qu.:10.000 3rd Qu.: 3890.5 3rd Qu.:1054.0
Max. :105.00 Max. :24.000 Max. :14053.0 Max. :4256.0
CHmRun CRuns CRBI CWalks League Division
Min. : 0.00 Min. : 2.0 Min. : 3.0 Min. : 1.0 A:139 E:129
1st Qu.: 15.00 1st Qu.: 105.5 1st Qu.: 95.0 1st Qu.: 71.0 N:124 W:134
Median : 40.00 Median : 250.0 Median : 230.0 Median : 174.0
Mean : 69.24 Mean : 361.2 Mean : 330.4 Mean : 260.3
3rd Qu.: 92.50 3rd Qu.: 497.5 3rd Qu.: 424.5 3rd Qu.: 328.5
Max. :548.00 Max. :2165.0 Max. :1659.0 Max. :1566.0
PutOuts Assists Errors Salary NewLeague
Min. : 0.0 Min. : 0.0 Min. : 0.000 Min. : 67.5 A:141
1st Qu.: 113.5 1st Qu.: 8.0 1st Qu.: 3.000 1st Qu.: 190.0 N:122
Median : 224.0 Median : 45.0 Median : 7.000 Median : 425.0
Mean : 290.7 Mean :118.8 Mean : 8.593 Mean : 535.9
3rd Qu.: 322.5 3rd Qu.:192.0 3rd Qu.:13.000 3rd Qu.: 750.0
Max. :1377.0 Max. :492.0 Max. :32.000 Max. :2460.0
There are 59 missing values in the Salary variable we want to predict. I decide to remove this observations. The function na.omit removes all incomplete observations
Hitters <- na.omit(Hitters)
To check that na.omit worked, I use the anyNA function to quickly determine if there are any NA fields.
anyNA(Hitters$Salary)
[1] FALSE
I use the package leaps
to evaluate all the best-subset models.
install.packages("leaps")
Error in install.packages : Updating loaded packages
library(leaps)
The leaps
function regsubsets runs the best subset regression. The ~.
tells R to create the models with all features except Salary.
bestregfit.full <- regsubsets(Salary~., data = Hitters)
summary(bestregfit.full)
Subset selection object
Call: regsubsets.formula(Salary ~ ., data = Hitters)
19 Variables (and intercept)
Forced in Forced out
AtBat FALSE FALSE
Hits FALSE FALSE
HmRun FALSE FALSE
Runs FALSE FALSE
RBI FALSE FALSE
Walks FALSE FALSE
Years FALSE FALSE
CAtBat FALSE FALSE
CHits FALSE FALSE
CHmRun FALSE FALSE
CRuns FALSE FALSE
CRBI FALSE FALSE
CWalks FALSE FALSE
LeagueN FALSE FALSE
DivisionW FALSE FALSE
PutOuts FALSE FALSE
Assists FALSE FALSE
Errors FALSE FALSE
NewLeagueN FALSE FALSE
1 subsets of each size up to 8
Selection Algorithm: exhaustive
AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI CWalks
1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " "*" " "
2 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*" " "
3 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*" " "
4 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*" " "
5 ( 1 ) "*" "*" " " " " " " " " " " " " " " " " " " "*" " "
6 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " " " "*" " "
7 ( 1 ) " " "*" " " " " " " "*" " " "*" "*" "*" " " " " " "
8 ( 1 ) "*" "*" " " " " " " "*" " " " " " " "*" "*" " " "*"
LeagueN DivisionW PutOuts Assists Errors NewLeagueN
1 ( 1 ) " " " " " " " " " " " "
2 ( 1 ) " " " " " " " " " " " "
3 ( 1 ) " " " " "*" " " " " " "
4 ( 1 ) " " "*" "*" " " " " " "
5 ( 1 ) " " "*" "*" " " " " " "
6 ( 1 ) " " "*" "*" " " " " " "
7 ( 1 ) " " "*" "*" " " " " " "
8 ( 1 ) " " "*" "*" " " " " " "
The summary indicates only eight features were modeled in subsets as that is the default. I run it again with the full number of features, 19.
bestregfit.full19 <- regsubsets(Salary~., data = Hitters, nvmax=19)
summary(bestregfit.full19)
Subset selection object
Call: regsubsets.formula(Salary ~ ., data = Hitters, nvmax = 19)
19 Variables (and intercept)
Forced in Forced out
AtBat FALSE FALSE
Hits FALSE FALSE
HmRun FALSE FALSE
Runs FALSE FALSE
RBI FALSE FALSE
Walks FALSE FALSE
Years FALSE FALSE
CAtBat FALSE FALSE
CHits FALSE FALSE
CHmRun FALSE FALSE
CRuns FALSE FALSE
CRBI FALSE FALSE
CWalks FALSE FALSE
LeagueN FALSE FALSE
DivisionW FALSE FALSE
PutOuts FALSE FALSE
Assists FALSE FALSE
Errors FALSE FALSE
NewLeagueN FALSE FALSE
1 subsets of each size up to 19
Selection Algorithm: exhaustive
AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI CWalks
1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " "*" " "
2 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*" " "
3 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*" " "
4 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*" " "
5 ( 1 ) "*" "*" " " " " " " " " " " " " " " " " " " "*" " "
6 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " " " "*" " "
7 ( 1 ) " " "*" " " " " " " "*" " " "*" "*" "*" " " " " " "
8 ( 1 ) "*" "*" " " " " " " "*" " " " " " " "*" "*" " " "*"
9 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*" "*" "*"
10 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*" "*" "*"
11 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*" "*" "*"
12 ( 1 ) "*" "*" " " "*" " " "*" " " "*" " " " " "*" "*" "*"
13 ( 1 ) "*" "*" " " "*" " " "*" " " "*" " " " " "*" "*" "*"
14 ( 1 ) "*" "*" "*" "*" " " "*" " " "*" " " " " "*" "*" "*"
15 ( 1 ) "*" "*" "*" "*" " " "*" " " "*" "*" " " "*" "*" "*"
16 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" " " "*" "*" "*"
17 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" " " "*" "*" "*"
18 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" " " "*" "*" "*"
19 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
LeagueN DivisionW PutOuts Assists Errors NewLeagueN
1 ( 1 ) " " " " " " " " " " " "
2 ( 1 ) " " " " " " " " " " " "
3 ( 1 ) " " " " "*" " " " " " "
4 ( 1 ) " " "*" "*" " " " " " "
5 ( 1 ) " " "*" "*" " " " " " "
6 ( 1 ) " " "*" "*" " " " " " "
7 ( 1 ) " " "*" "*" " " " " " "
8 ( 1 ) " " "*" "*" " " " " " "
9 ( 1 ) " " "*" "*" " " " " " "
10 ( 1 ) " " "*" "*" "*" " " " "
11 ( 1 ) "*" "*" "*" "*" " " " "
12 ( 1 ) "*" "*" "*" "*" " " " "
13 ( 1 ) "*" "*" "*" "*" "*" " "
14 ( 1 ) "*" "*" "*" "*" "*" " "
15 ( 1 ) "*" "*" "*" "*" "*" " "
16 ( 1 ) "*" "*" "*" "*" "*" " "
17 ( 1 ) "*" "*" "*" "*" "*" "*"
18 ( 1 ) "*" "*" "*" "*" "*" "*"
19 ( 1 ) "*" "*" "*" "*" "*" "*"
Let’s view the compnents provided in the summary.
names(summary(bestregfit.full19))
[1] "which" "rsq" "rss" "adjr2" "cp" "bic" "outmat" "obj"
I plot the Cp (Mallow’s Cp) component to determine which model has the lowest Cp.
bestregfit.summary <- summary(bestregfit.full19)
plot(bestregfit.summary$cp, xlab="Number of Variables", ylab="Cp")
The model with 10 variables appears to have the lowest Cp. I can also query this specifically with which.min
which.min(bestregfit.summary$cp)
[1] 10
I’m going to mark the lowest point on my plot.
plot(bestregfit.summary$cp, xlab="Number of Variables", ylab="Cp")
points(10, bestregfit.summary$cp[10], pch=20, col="red")
I can also plot regsubsets
object itself.
plot(bestregfit.full19, scale="Cp")
We can finally get the coefficients for the 10 variables in that best model.
coef(bestregfit.full19, 10)
(Intercept) AtBat Hits Walks CAtBat CRuns
162.5354420 -2.1686501 6.9180175 5.7732246 -0.1300798 1.4082490
CRBI CWalks DivisionW PutOuts Assists
0.7743122 -0.8308264 -112.3800575 0.2973726 0.2831680