install.packages(“ISLR”)
library(ISLR)
## Warning: package 'ISLR' was built under R version 3.4.4
library(MASS)
These libraries have data sets used below. One is a data frame named Boston.
?Boston
## starting httpd help server ... done
Housing values in the Suburbs of Boston with 506 rows and 14 columns. Each observation is a town.
anyNA(Boston)
## [1] FALSE
There are no missing values in the data set.
I plot the median value of owner occupied homes against the percent of ‘lower status’ population. Note, median home values are lower as this data is several decades old.
plot(medv~lstat, Boston)
The plot indicates that as percentage of “lower status” people increases, median home value decreases. It also indicates towns with the most “lower status” people do not necessarily have the lowest median home values.
Now we fit a linear model to this data.
m1 <- lm(medv~lstat, data=Boston)
m1
##
## Call:
## lm(formula = medv ~ lstat, data = Boston)
##
## Coefficients:
## (Intercept) lstat
## 34.55 -0.95
summary(m1)
##
## Call:
## lm(formula = medv ~ lstat, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.168 -3.990 -1.318 2.034 24.500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.55384 0.56263 61.41 <2e-16 ***
## lstat -0.95005 0.03873 -24.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared: 0.5441, Adjusted R-squared: 0.5432
## F-statistic: 601.6 on 1 and 504 DF, p-value: < 2.2e-16
The coefficient of lstat
is negative, as expected. Calling summary provides the p-value for lstat, indicating lstat
has a relationship to medv
. Add a like to the plot.
plot(medv~lstat, Boston)
abline(m1, col="red")
What components are provided by the model?
names(m1)
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
Get confidence intervals for each coefficient, 95% by default.
confint(m1)
## 2.5 % 97.5 %
## (Intercept) 33.448457 35.6592247
## lstat -1.026148 -0.8739505
Now let’s try to predict medv
with some imaginary values of lstat
. We will also request the confidence interval.
predict(m1, data.frame(lstat=c(5,10,15,20,22)), interval="confidence")
## fit lwr upr
## 1 29.80359 29.00741 30.59978
## 2 25.05335 24.47413 25.63256
## 3 20.30310 19.73159 20.87461
## 4 15.55285 14.77355 16.33216
## 5 13.65276 12.75796 14.54755
I fit a linear model to the data but this with using multiple predictors. I add the age feature, which indicates the percentage of owner-occupied units built prior to 1940 in each town.
m2 <- lm(medv~lstat+age, data=Boston)
m2
##
## Call:
## lm(formula = medv ~ lstat + age, data = Boston)
##
## Coefficients:
## (Intercept) lstat age
## 33.22276 -1.03207 0.03454
summary(m2)
##
## Call:
## lm(formula = medv ~ lstat + age, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.981 -3.978 -1.283 1.968 23.158
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.22276 0.73085 45.458 < 2e-16 ***
## lstat -1.03207 0.04819 -21.416 < 2e-16 ***
## age 0.03454 0.01223 2.826 0.00491 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.173 on 503 degrees of freedom
## Multiple R-squared: 0.5513, Adjusted R-squared: 0.5495
## F-statistic: 309 on 2 and 503 DF, p-value: < 2.2e-16
The coefficient of lstat
is negative as expected. The coefficient of age
is small but positive, indicating that the more “old” homes a town has, the higher the median home value. This also makes sense as we typically expect homes to appreciate over time. The p-values of lstat
and age
are both significant at the 95% confidence level.
Note that R provides little indicators of each p-value’s significance.
I now fit a linear model with all predictors except medv
.
m3 <- lm(medv~., data=Boston)
m3
##
## Call:
## lm(formula = medv ~ ., data = Boston)
##
## Coefficients:
## (Intercept) crim zn indus chas
## 3.646e+01 -1.080e-01 4.642e-02 2.056e-02 2.687e+00
## nox rm age dis rad
## -1.777e+01 3.810e+00 6.922e-04 -1.476e+00 3.060e-01
## tax ptratio black lstat
## -1.233e-02 -9.527e-01 9.312e-03 -5.248e-01
summary(m3)
##
## Call:
## lm(formula = medv ~ ., data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.595 -2.730 -0.518 1.777 26.199
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.646e+01 5.103e+00 7.144 3.28e-12 ***
## crim -1.080e-01 3.286e-02 -3.287 0.001087 **
## zn 4.642e-02 1.373e-02 3.382 0.000778 ***
## indus 2.056e-02 6.150e-02 0.334 0.738288
## chas 2.687e+00 8.616e-01 3.118 0.001925 **
## nox -1.777e+01 3.820e+00 -4.651 4.25e-06 ***
## rm 3.810e+00 4.179e-01 9.116 < 2e-16 ***
## age 6.922e-04 1.321e-02 0.052 0.958229
## dis -1.476e+00 1.995e-01 -7.398 6.01e-13 ***
## rad 3.060e-01 6.635e-02 4.613 5.07e-06 ***
## tax -1.233e-02 3.760e-03 -3.280 0.001112 **
## ptratio -9.527e-01 1.308e-01 -7.283 1.31e-12 ***
## black 9.312e-03 2.686e-03 3.467 0.000573 ***
## lstat -5.248e-01 5.072e-02 -10.347 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.745 on 492 degrees of freedom
## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7338
## F-statistic: 108.1 on 13 and 492 DF, p-value: < 2.2e-16
Note that the significance of age has drastically decreased in significance in the presence of additional variables.
I’m going to plot my new model. To prepare the plot area, I use par to make quadrants because I know four charts will be created by the plot.
par(mfrow=c(2,2))
plot(m3)
I recreate my model but remove the age and industry because they were not significant at my confidence level. I use the update function to change the existing model.
m4 <- update(m3, ~.-age,-indus)
m4
##
## Call:
## lm(formula = medv ~ crim + zn + indus + chas + nox + rm + dis +
## rad + tax + ptratio + black + lstat, data = Boston)
##
## Coefficients:
## (Intercept) crim zn indus chas
## 36.436927 -0.108006 0.046334 0.020562 2.689026
## nox rm dis rad tax
## -17.713540 3.814394 -1.478612 0.305786 -0.012329
## ptratio black lstat
## -0.952211 0.009321 -0.523852
summary(m4)
##
## Call:
## lm(formula = medv ~ crim + zn + indus + chas + nox + rm + dis +
## rad + tax + ptratio + black + lstat, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.6054 -2.7313 -0.5188 1.7601 26.2243
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.436927 5.080119 7.172 2.72e-12 ***
## crim -0.108006 0.032832 -3.290 0.001075 **
## zn 0.046334 0.013613 3.404 0.000719 ***
## indus 0.020562 0.061433 0.335 0.737989
## chas 2.689026 0.859598 3.128 0.001863 **
## nox -17.713540 3.679308 -4.814 1.97e-06 ***
## rm 3.814394 0.408480 9.338 < 2e-16 ***
## dis -1.478612 0.190611 -7.757 5.03e-14 ***
## rad 0.305786 0.066089 4.627 4.75e-06 ***
## tax -0.012329 0.003755 -3.283 0.001099 **
## ptratio -0.952211 0.130294 -7.308 1.10e-12 ***
## black 0.009321 0.002678 3.481 0.000544 ***
## lstat -0.523852 0.047625 -10.999 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.74 on 493 degrees of freedom
## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7343
## F-statistic: 117.3 on 12 and 493 DF, p-value: < 2.2e-16
plot(m4)
I actually forgot to use par but I much prefer the way the charts look separately - pleasant mistake! (Not sure how this will look in the html rendering…)
Fit a model with an interaction between lstat and dis.
m5 <- lm(medv~lstat*dis, Boston)
m5
##
## Call:
## lm(formula = medv ~ lstat * dis, data = Boston)
##
## Coefficients:
## (Intercept) lstat dis lstat:dis
## 37.8743 -1.0025 -0.5119 -0.0176
summary(m5)
##
## Call:
## lm(formula = medv ~ lstat * dis, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.445 -3.864 -1.461 1.962 22.474
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.87434 1.19775 31.621 <2e-16 ***
## lstat -1.00251 0.07475 -13.411 <2e-16 ***
## dis -0.51191 0.26035 -1.966 0.0498 *
## lstat:dis -0.01760 0.02282 -0.771 0.4408
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.1 on 502 degrees of freedom
## Multiple R-squared: 0.5627, Adjusted R-squared: 0.5601
## F-statistic: 215.4 on 3 and 502 DF, p-value: < 2.2e-16
We find the interaction of lstat
and dis
is not significant.
Fit a model using a quadratic term. We try this as the relationship between lstat
and medv
was not exactly linear. The I function ensures ‘^2’ will be interpreted as lstat
to the power of 2.
m6 <- lm(medv~lstat+I(lstat^2), Boston)
m6
##
## Call:
## lm(formula = medv ~ lstat + I(lstat^2), data = Boston)
##
## Coefficients:
## (Intercept) lstat I(lstat^2)
## 42.86201 -2.33282 0.04355
summary(m6)
##
## Call:
## lm(formula = medv ~ lstat + I(lstat^2), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.2834 -3.8313 -0.5295 2.3095 25.4148
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.862007 0.872084 49.15 <2e-16 ***
## lstat -2.332821 0.123803 -18.84 <2e-16 ***
## I(lstat^2) 0.043547 0.003745 11.63 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.524 on 503 degrees of freedom
## Multiple R-squared: 0.6407, Adjusted R-squared: 0.6393
## F-statistic: 448.5 on 2 and 503 DF, p-value: < 2.2e-16
The linear and the quadratic coefficients are significant. For each value of lstat
, the points function will plot the values from the m6 fitted model. Using attach keeps me from having to specify the data set for every function.
attach(Boston)
plot(medv~lstat)
points(lstat, fitted(m6), col="red",pch=20)
For these analyses, I use the data set Carseats
. It provides sales data for car seats at 400 different stores. An easy way to view a data set if isn’t too large is fix. For those used to Excel, fix feels more familiar than native R. With fix you can actually make changes to the data set.
fix(Carseats)
Otherwise, we can determine what is in it using R commands.
dim(Carseats)
## [1] 400 11
str(Carseats)
## 'data.frame': 400 obs. of 11 variables:
## $ Sales : num 9.5 11.22 10.06 7.4 4.15 ...
## $ CompPrice : num 138 111 113 117 141 124 115 136 132 132 ...
## $ Income : num 73 48 35 100 64 113 105 81 110 113 ...
## $ Advertising: num 11 16 10 4 3 13 0 15 0 0 ...
## $ Population : num 276 260 269 466 340 501 45 425 108 131 ...
## $ Price : num 120 83 80 97 128 72 108 120 124 124 ...
## $ ShelveLoc : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
## $ Age : num 42 65 59 55 38 78 71 67 76 76 ...
## $ Education : num 17 10 12 14 13 16 15 10 10 17 ...
## $ Urban : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
## $ US : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...
summary(Carseats)
## Sales CompPrice Income Advertising
## Min. : 0.000 Min. : 77 Min. : 21.00 Min. : 0.000
## 1st Qu.: 5.390 1st Qu.:115 1st Qu.: 42.75 1st Qu.: 0.000
## Median : 7.490 Median :125 Median : 69.00 Median : 5.000
## Mean : 7.496 Mean :125 Mean : 68.66 Mean : 6.635
## 3rd Qu.: 9.320 3rd Qu.:135 3rd Qu.: 91.00 3rd Qu.:12.000
## Max. :16.270 Max. :175 Max. :120.00 Max. :29.000
## Population Price ShelveLoc Age
## Min. : 10.0 Min. : 24.0 Bad : 96 Min. :25.00
## 1st Qu.:139.0 1st Qu.:100.0 Good : 85 1st Qu.:39.75
## Median :272.0 Median :117.0 Medium:219 Median :54.50
## Mean :264.8 Mean :115.8 Mean :53.32
## 3rd Qu.:398.5 3rd Qu.:131.0 3rd Qu.:66.00
## Max. :509.0 Max. :191.0 Max. :80.00
## Education Urban US
## Min. :10.0 No :118 No :142
## 1st Qu.:12.0 Yes:282 Yes:258
## Median :14.0
## Mean :13.9
## 3rd Qu.:16.0
## Max. :18.0
head(Carseats)
Carseats has 400 observations with 11 features. There are no NAs. Most of the features are numeric and a few are factors.
We can find out how R codes the qualitative predictors as dummy variables using the contrasts function.
contrasts(Carseats$ShelveLoc)
## Good Medium
## Bad 0 0
## Good 1 0
## Medium 0 1
contrasts(Carseats$Urban)
## Yes
## No 0
## Yes 1
contrasts(Carseats$US)
## Yes
## No 0
## Yes 1
Fit a simple model first.
attach(Carseats)
m7 <- lm(Sales~ShelveLoc)
m7
##
## Call:
## lm(formula = Sales ~ ShelveLoc)
##
## Coefficients:
## (Intercept) ShelveLocGood ShelveLocMedium
## 5.523 4.691 1.784
summary(m7)
##
## Call:
## lm(formula = Sales ~ ShelveLoc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.3066 -1.6282 -0.0416 1.5666 6.1471
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.5229 0.2388 23.131 < 2e-16 ***
## ShelveLocGood 4.6911 0.3484 13.464 < 2e-16 ***
## ShelveLocMedium 1.7837 0.2864 6.229 1.2e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.339 on 397 degrees of freedom
## Multiple R-squared: 0.3172, Adjusted R-squared: 0.3138
## F-statistic: 92.23 on 2 and 397 DF, p-value: < 2.2e-16
Education level of the store’s surrounding community does not appear to be a predictor of sales. Fit a multivariate model.
attach(Carseats)
## The following objects are masked from Carseats (pos = 3):
##
## Advertising, Age, CompPrice, Education, Income, Population,
## Price, Sales, ShelveLoc, Urban, US
m7.1 <- lm(Sales~Urban+ShelveLoc+US)
m7.1
##
## Call:
## lm(formula = Sales ~ Urban + ShelveLoc + US)
##
## Coefficients:
## (Intercept) UrbanYes ShelveLocGood ShelveLocMedium
## 4.8965 0.0999 4.6400 1.8168
## USYes
## 0.8506
summary(m7.1)
##
## Call:
## lm(formula = Sales ~ Urban + ShelveLoc + US)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.713 -1.634 -0.019 1.738 5.823
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.8966 0.3398 14.411 < 2e-16 ***
## UrbanYes 0.0999 0.2543 0.393 0.6947
## ShelveLocGood 4.6400 0.3453 13.438 < 2e-16 ***
## ShelveLocMedium 1.8168 0.2834 6.410 4.14e-10 ***
## USYes 0.8506 0.2424 3.510 0.0005 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.309 on 395 degrees of freedom
## Multiple R-squared: 0.3384, Adjusted R-squared: 0.3317
## F-statistic: 50.51 on 4 and 395 DF, p-value: < 2.2e-16
Shelf location and store location (in or outside of the US) appear to be indicators for sales although they only account for a small portion of the variance. Other features are needed.
m7.2 <- update(m7.1, Sales~Urban+ShelveLoc+US+Income +Advertising +Age)
summary(m7.2)
##
## Call:
## lm(formula = Sales ~ Urban + ShelveLoc + US + Income + Advertising +
## Age)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.255 -1.500 -0.084 1.409 6.023
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.853454 0.521578 11.223 < 2e-16 ***
## UrbanYes 0.096288 0.231839 0.415 0.678
## ShelveLocGood 4.689904 0.315117 14.883 < 2e-16 ***
## ShelveLocMedium 1.916887 0.259456 7.388 9.05e-13 ***
## USYes -0.244106 0.303018 -0.806 0.421
## Income 0.016652 0.003788 4.396 1.42e-05 ***
## Advertising 0.107690 0.021745 4.952 1.09e-06 ***
## Age -0.040723 0.006513 -6.252 1.06e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.103 on 392 degrees of freedom
## Multiple R-squared: 0.4554, Adjusted R-squared: 0.4457
## F-statistic: 46.83 on 7 and 392 DF, p-value: < 2.2e-16
We see that Urban and US location drop in significance when additional numeric features are added, however shelf location stays significant. The R-squared still indicates we are not capturing enough information.
m8 <- lm(Sales~.-Urban-US, data = Carseats)
summary(m8)
##
## Call:
## lm(formula = Sales ~ . - Urban - US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8403 -0.6846 0.0151 0.6702 3.3481
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.6472014 0.5973864 9.453 < 2e-16 ***
## CompPrice 0.0929439 0.0041451 22.422 < 2e-16 ***
## Income 0.0157321 0.0018426 8.538 3.09e-16 ***
## Advertising 0.1141447 0.0080123 14.246 < 2e-16 ***
## Population 0.0002632 0.0003641 0.723 0.470
## Price -0.0953854 0.0026726 -35.690 < 2e-16 ***
## ShelveLocGood 4.8350741 0.1527240 31.659 < 2e-16 ***
## ShelveLocMedium 1.9553465 0.1255836 15.570 < 2e-16 ***
## Age -0.0459891 0.0031817 -14.454 < 2e-16 ***
## Education -0.0197355 0.0196387 -1.005 0.316
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.02 on 390 degrees of freedom
## Multiple R-squared: 0.8725, Adjusted R-squared: 0.8696
## F-statistic: 296.6 on 9 and 390 DF, p-value: < 2.2e-16
Let’s see if we can improve the R-squareds by removing the predictors that are not significant.
m9 <- update(m8,Sales~.-Population -Education)
summary(m9)
##
## Call:
## lm(formula = Sales ~ CompPrice + Income + Advertising + Price +
## ShelveLoc + Age, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.7728 -0.6954 0.0282 0.6732 3.3292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.475226 0.505005 10.84 <2e-16 ***
## CompPrice 0.092571 0.004123 22.45 <2e-16 ***
## Income 0.015785 0.001838 8.59 <2e-16 ***
## Advertising 0.115903 0.007724 15.01 <2e-16 ***
## Price -0.095319 0.002670 -35.70 <2e-16 ***
## ShelveLocGood 4.835675 0.152499 31.71 <2e-16 ***
## ShelveLocMedium 1.951993 0.125375 15.57 <2e-16 ***
## Age -0.046128 0.003177 -14.52 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.019 on 392 degrees of freedom
## Multiple R-squared: 0.872, Adjusted R-squared: 0.8697
## F-statistic: 381.4 on 7 and 392 DF, p-value: < 2.2e-16
plot(m9)
This model is rough. The adjusted R-squared did not improve, although this model still has a high R-squared. As we have a small number of features, my next step would be to use Best Subset Selection to try to improve the model.