Install the Data Sets

install.packages(“ISLR”)

library(ISLR)
## Warning: package 'ISLR' was built under R version 3.4.4
library(MASS)

These libraries have data sets used below. One is a data frame named Boston.

?Boston
## starting httpd help server ... done

Housing values in the Suburbs of Boston with 506 rows and 14 columns. Each observation is a town.

anyNA(Boston)
## [1] FALSE

There are no missing values in the data set.

I plot the median value of owner occupied homes against the percent of ‘lower status’ population. Note, median home values are lower as this data is several decades old.

plot(medv~lstat, Boston)

The plot indicates that as percentage of “lower status” people increases, median home value decreases. It also indicates towns with the most “lower status” people do not necessarily have the lowest median home values.

Simple Linear Regression

Now we fit a linear model to this data.

m1 <- lm(medv~lstat, data=Boston)
m1
## 
## Call:
## lm(formula = medv ~ lstat, data = Boston)
## 
## Coefficients:
## (Intercept)        lstat  
##       34.55        -0.95
summary(m1)
## 
## Call:
## lm(formula = medv ~ lstat, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.168  -3.990  -1.318   2.034  24.500 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 34.55384    0.56263   61.41   <2e-16 ***
## lstat       -0.95005    0.03873  -24.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared:  0.5441, Adjusted R-squared:  0.5432 
## F-statistic: 601.6 on 1 and 504 DF,  p-value: < 2.2e-16

The coefficient of lstat is negative, as expected. Calling summary provides the p-value for lstat, indicating lstat has a relationship to medv. Add a like to the plot.

plot(medv~lstat, Boston)
abline(m1, col="red")

What components are provided by the model?

names(m1)
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"

Get confidence intervals for each coefficient, 95% by default.

confint(m1)
##                 2.5 %     97.5 %
## (Intercept) 33.448457 35.6592247
## lstat       -1.026148 -0.8739505

Now let’s try to predict medv with some imaginary values of lstat. We will also request the confidence interval.

predict(m1, data.frame(lstat=c(5,10,15,20,22)), interval="confidence")
##        fit      lwr      upr
## 1 29.80359 29.00741 30.59978
## 2 25.05335 24.47413 25.63256
## 3 20.30310 19.73159 20.87461
## 4 15.55285 14.77355 16.33216
## 5 13.65276 12.75796 14.54755

Multiple Linear Regression

I fit a linear model to the data but this with using multiple predictors. I add the age feature, which indicates the percentage of owner-occupied units built prior to 1940 in each town.

m2 <- lm(medv~lstat+age, data=Boston)
m2
## 
## Call:
## lm(formula = medv ~ lstat + age, data = Boston)
## 
## Coefficients:
## (Intercept)        lstat          age  
##    33.22276     -1.03207      0.03454
summary(m2)
## 
## Call:
## lm(formula = medv ~ lstat + age, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.981  -3.978  -1.283   1.968  23.158 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 33.22276    0.73085  45.458  < 2e-16 ***
## lstat       -1.03207    0.04819 -21.416  < 2e-16 ***
## age          0.03454    0.01223   2.826  0.00491 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.173 on 503 degrees of freedom
## Multiple R-squared:  0.5513, Adjusted R-squared:  0.5495 
## F-statistic:   309 on 2 and 503 DF,  p-value: < 2.2e-16

The coefficient of lstat is negative as expected. The coefficient of age is small but positive, indicating that the more “old” homes a town has, the higher the median home value. This also makes sense as we typically expect homes to appreciate over time. The p-values of lstat and age are both significant at the 95% confidence level.

Note that R provides little indicators of each p-value’s significance.

I now fit a linear model with all predictors except medv.

m3 <- lm(medv~., data=Boston)
m3
## 
## Call:
## lm(formula = medv ~ ., data = Boston)
## 
## Coefficients:
## (Intercept)         crim           zn        indus         chas  
##   3.646e+01   -1.080e-01    4.642e-02    2.056e-02    2.687e+00  
##         nox           rm          age          dis          rad  
##  -1.777e+01    3.810e+00    6.922e-04   -1.476e+00    3.060e-01  
##         tax      ptratio        black        lstat  
##  -1.233e-02   -9.527e-01    9.312e-03   -5.248e-01
summary(m3)
## 
## Call:
## lm(formula = medv ~ ., data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.595  -2.730  -0.518   1.777  26.199 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.646e+01  5.103e+00   7.144 3.28e-12 ***
## crim        -1.080e-01  3.286e-02  -3.287 0.001087 ** 
## zn           4.642e-02  1.373e-02   3.382 0.000778 ***
## indus        2.056e-02  6.150e-02   0.334 0.738288    
## chas         2.687e+00  8.616e-01   3.118 0.001925 ** 
## nox         -1.777e+01  3.820e+00  -4.651 4.25e-06 ***
## rm           3.810e+00  4.179e-01   9.116  < 2e-16 ***
## age          6.922e-04  1.321e-02   0.052 0.958229    
## dis         -1.476e+00  1.995e-01  -7.398 6.01e-13 ***
## rad          3.060e-01  6.635e-02   4.613 5.07e-06 ***
## tax         -1.233e-02  3.760e-03  -3.280 0.001112 ** 
## ptratio     -9.527e-01  1.308e-01  -7.283 1.31e-12 ***
## black        9.312e-03  2.686e-03   3.467 0.000573 ***
## lstat       -5.248e-01  5.072e-02 -10.347  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.745 on 492 degrees of freedom
## Multiple R-squared:  0.7406, Adjusted R-squared:  0.7338 
## F-statistic: 108.1 on 13 and 492 DF,  p-value: < 2.2e-16

Note that the significance of age has drastically decreased in significance in the presence of additional variables.

I’m going to plot my new model. To prepare the plot area, I use par to make quadrants because I know four charts will be created by the plot.

par(mfrow=c(2,2))
plot(m3)

I recreate my model but remove the age and industry because they were not significant at my confidence level. I use the update function to change the existing model.

m4 <- update(m3, ~.-age,-indus)
m4
## 
## Call:
## lm(formula = medv ~ crim + zn + indus + chas + nox + rm + dis + 
##     rad + tax + ptratio + black + lstat, data = Boston)
## 
## Coefficients:
## (Intercept)         crim           zn        indus         chas  
##   36.436927    -0.108006     0.046334     0.020562     2.689026  
##         nox           rm          dis          rad          tax  
##  -17.713540     3.814394    -1.478612     0.305786    -0.012329  
##     ptratio        black        lstat  
##   -0.952211     0.009321    -0.523852
summary(m4)
## 
## Call:
## lm(formula = medv ~ crim + zn + indus + chas + nox + rm + dis + 
##     rad + tax + ptratio + black + lstat, data = Boston)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.6054  -2.7313  -0.5188   1.7601  26.2243 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  36.436927   5.080119   7.172 2.72e-12 ***
## crim         -0.108006   0.032832  -3.290 0.001075 ** 
## zn            0.046334   0.013613   3.404 0.000719 ***
## indus         0.020562   0.061433   0.335 0.737989    
## chas          2.689026   0.859598   3.128 0.001863 ** 
## nox         -17.713540   3.679308  -4.814 1.97e-06 ***
## rm            3.814394   0.408480   9.338  < 2e-16 ***
## dis          -1.478612   0.190611  -7.757 5.03e-14 ***
## rad           0.305786   0.066089   4.627 4.75e-06 ***
## tax          -0.012329   0.003755  -3.283 0.001099 ** 
## ptratio      -0.952211   0.130294  -7.308 1.10e-12 ***
## black         0.009321   0.002678   3.481 0.000544 ***
## lstat        -0.523852   0.047625 -10.999  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.74 on 493 degrees of freedom
## Multiple R-squared:  0.7406, Adjusted R-squared:  0.7343 
## F-statistic: 117.3 on 12 and 493 DF,  p-value: < 2.2e-16
plot(m4)

I actually forgot to use par but I much prefer the way the charts look separately - pleasant mistake! (Not sure how this will look in the html rendering…)

Interactions

Fit a model with an interaction between lstat and dis.

m5 <- lm(medv~lstat*dis, Boston)
m5
## 
## Call:
## lm(formula = medv ~ lstat * dis, data = Boston)
## 
## Coefficients:
## (Intercept)        lstat          dis    lstat:dis  
##     37.8743      -1.0025      -0.5119      -0.0176
summary(m5)
## 
## Call:
## lm(formula = medv ~ lstat * dis, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.445  -3.864  -1.461   1.962  22.474 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 37.87434    1.19775  31.621   <2e-16 ***
## lstat       -1.00251    0.07475 -13.411   <2e-16 ***
## dis         -0.51191    0.26035  -1.966   0.0498 *  
## lstat:dis   -0.01760    0.02282  -0.771   0.4408    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.1 on 502 degrees of freedom
## Multiple R-squared:  0.5627, Adjusted R-squared:  0.5601 
## F-statistic: 215.4 on 3 and 502 DF,  p-value: < 2.2e-16

We find the interaction of lstat and dis is not significant.

Non-linear Models

Fit a model using a quadratic term. We try this as the relationship between lstat and medv was not exactly linear. The I function ensures ‘^2’ will be interpreted as lstat to the power of 2.

m6 <- lm(medv~lstat+I(lstat^2), Boston)
m6
## 
## Call:
## lm(formula = medv ~ lstat + I(lstat^2), data = Boston)
## 
## Coefficients:
## (Intercept)        lstat   I(lstat^2)  
##    42.86201     -2.33282      0.04355
summary(m6)
## 
## Call:
## lm(formula = medv ~ lstat + I(lstat^2), data = Boston)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.2834  -3.8313  -0.5295   2.3095  25.4148 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 42.862007   0.872084   49.15   <2e-16 ***
## lstat       -2.332821   0.123803  -18.84   <2e-16 ***
## I(lstat^2)   0.043547   0.003745   11.63   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.524 on 503 degrees of freedom
## Multiple R-squared:  0.6407, Adjusted R-squared:  0.6393 
## F-statistic: 448.5 on 2 and 503 DF,  p-value: < 2.2e-16

The linear and the quadratic coefficients are significant. For each value of lstat, the points function will plot the values from the m6 fitted model. Using attach keeps me from having to specify the data set for every function.

attach(Boston)
plot(medv~lstat)
points(lstat, fitted(m6), col="red",pch=20)

Qualitative Predictors

For these analyses, I use the data set Carseats. It provides sales data for car seats at 400 different stores. An easy way to view a data set if isn’t too large is fix. For those used to Excel, fix feels more familiar than native R. With fix you can actually make changes to the data set.

fix(Carseats)

Otherwise, we can determine what is in it using R commands.

dim(Carseats)
## [1] 400  11
str(Carseats)
## 'data.frame':    400 obs. of  11 variables:
##  $ Sales      : num  9.5 11.22 10.06 7.4 4.15 ...
##  $ CompPrice  : num  138 111 113 117 141 124 115 136 132 132 ...
##  $ Income     : num  73 48 35 100 64 113 105 81 110 113 ...
##  $ Advertising: num  11 16 10 4 3 13 0 15 0 0 ...
##  $ Population : num  276 260 269 466 340 501 45 425 108 131 ...
##  $ Price      : num  120 83 80 97 128 72 108 120 124 124 ...
##  $ ShelveLoc  : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
##  $ Age        : num  42 65 59 55 38 78 71 67 76 76 ...
##  $ Education  : num  17 10 12 14 13 16 15 10 10 17 ...
##  $ Urban      : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
##  $ US         : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...
summary(Carseats)
##      Sales          CompPrice       Income        Advertising    
##  Min.   : 0.000   Min.   : 77   Min.   : 21.00   Min.   : 0.000  
##  1st Qu.: 5.390   1st Qu.:115   1st Qu.: 42.75   1st Qu.: 0.000  
##  Median : 7.490   Median :125   Median : 69.00   Median : 5.000  
##  Mean   : 7.496   Mean   :125   Mean   : 68.66   Mean   : 6.635  
##  3rd Qu.: 9.320   3rd Qu.:135   3rd Qu.: 91.00   3rd Qu.:12.000  
##  Max.   :16.270   Max.   :175   Max.   :120.00   Max.   :29.000  
##    Population        Price        ShelveLoc        Age       
##  Min.   : 10.0   Min.   : 24.0   Bad   : 96   Min.   :25.00  
##  1st Qu.:139.0   1st Qu.:100.0   Good  : 85   1st Qu.:39.75  
##  Median :272.0   Median :117.0   Medium:219   Median :54.50  
##  Mean   :264.8   Mean   :115.8                Mean   :53.32  
##  3rd Qu.:398.5   3rd Qu.:131.0                3rd Qu.:66.00  
##  Max.   :509.0   Max.   :191.0                Max.   :80.00  
##    Education    Urban       US     
##  Min.   :10.0   No :118   No :142  
##  1st Qu.:12.0   Yes:282   Yes:258  
##  Median :14.0                      
##  Mean   :13.9                      
##  3rd Qu.:16.0                      
##  Max.   :18.0
head(Carseats)

Carseats has 400 observations with 11 features. There are no NAs. Most of the features are numeric and a few are factors.

We can find out how R codes the qualitative predictors as dummy variables using the contrasts function.

contrasts(Carseats$ShelveLoc)
##        Good Medium
## Bad       0      0
## Good      1      0
## Medium    0      1
contrasts(Carseats$Urban)
##     Yes
## No    0
## Yes   1
contrasts(Carseats$US)
##     Yes
## No    0
## Yes   1

Fit a simple model first.

attach(Carseats)
m7 <- lm(Sales~ShelveLoc)
m7
## 
## Call:
## lm(formula = Sales ~ ShelveLoc)
## 
## Coefficients:
##     (Intercept)    ShelveLocGood  ShelveLocMedium  
##           5.523            4.691            1.784
summary(m7)
## 
## Call:
## lm(formula = Sales ~ ShelveLoc)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.3066 -1.6282 -0.0416  1.5666  6.1471 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       5.5229     0.2388  23.131  < 2e-16 ***
## ShelveLocGood     4.6911     0.3484  13.464  < 2e-16 ***
## ShelveLocMedium   1.7837     0.2864   6.229  1.2e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.339 on 397 degrees of freedom
## Multiple R-squared:  0.3172, Adjusted R-squared:  0.3138 
## F-statistic: 92.23 on 2 and 397 DF,  p-value: < 2.2e-16

Education level of the store’s surrounding community does not appear to be a predictor of sales. Fit a multivariate model.

attach(Carseats)
## The following objects are masked from Carseats (pos = 3):
## 
##     Advertising, Age, CompPrice, Education, Income, Population,
##     Price, Sales, ShelveLoc, Urban, US
m7.1 <- lm(Sales~Urban+ShelveLoc+US)
m7.1
## 
## Call:
## lm(formula = Sales ~ Urban + ShelveLoc + US)
## 
## Coefficients:
##     (Intercept)         UrbanYes    ShelveLocGood  ShelveLocMedium  
##          4.8965           0.0999           4.6400           1.8168  
##           USYes  
##          0.8506
summary(m7.1)
## 
## Call:
## lm(formula = Sales ~ Urban + ShelveLoc + US)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.713 -1.634 -0.019  1.738  5.823 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       4.8966     0.3398  14.411  < 2e-16 ***
## UrbanYes          0.0999     0.2543   0.393   0.6947    
## ShelveLocGood     4.6400     0.3453  13.438  < 2e-16 ***
## ShelveLocMedium   1.8168     0.2834   6.410 4.14e-10 ***
## USYes             0.8506     0.2424   3.510   0.0005 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.309 on 395 degrees of freedom
## Multiple R-squared:  0.3384, Adjusted R-squared:  0.3317 
## F-statistic: 50.51 on 4 and 395 DF,  p-value: < 2.2e-16

Shelf location and store location (in or outside of the US) appear to be indicators for sales although they only account for a small portion of the variance. Other features are needed.

m7.2 <- update(m7.1, Sales~Urban+ShelveLoc+US+Income +Advertising +Age)
summary(m7.2)
## 
## Call:
## lm(formula = Sales ~ Urban + ShelveLoc + US + Income + Advertising + 
##     Age)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.255 -1.500 -0.084  1.409  6.023 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      5.853454   0.521578  11.223  < 2e-16 ***
## UrbanYes         0.096288   0.231839   0.415    0.678    
## ShelveLocGood    4.689904   0.315117  14.883  < 2e-16 ***
## ShelveLocMedium  1.916887   0.259456   7.388 9.05e-13 ***
## USYes           -0.244106   0.303018  -0.806    0.421    
## Income           0.016652   0.003788   4.396 1.42e-05 ***
## Advertising      0.107690   0.021745   4.952 1.09e-06 ***
## Age             -0.040723   0.006513  -6.252 1.06e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.103 on 392 degrees of freedom
## Multiple R-squared:  0.4554, Adjusted R-squared:  0.4457 
## F-statistic: 46.83 on 7 and 392 DF,  p-value: < 2.2e-16

We see that Urban and US location drop in significance when additional numeric features are added, however shelf location stays significant. The R-squared still indicates we are not capturing enough information.

m8 <- lm(Sales~.-Urban-US, data = Carseats)
summary(m8)
## 
## Call:
## lm(formula = Sales ~ . - Urban - US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8403 -0.6846  0.0151  0.6702  3.3481 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      5.6472014  0.5973864   9.453  < 2e-16 ***
## CompPrice        0.0929439  0.0041451  22.422  < 2e-16 ***
## Income           0.0157321  0.0018426   8.538 3.09e-16 ***
## Advertising      0.1141447  0.0080123  14.246  < 2e-16 ***
## Population       0.0002632  0.0003641   0.723    0.470    
## Price           -0.0953854  0.0026726 -35.690  < 2e-16 ***
## ShelveLocGood    4.8350741  0.1527240  31.659  < 2e-16 ***
## ShelveLocMedium  1.9553465  0.1255836  15.570  < 2e-16 ***
## Age             -0.0459891  0.0031817 -14.454  < 2e-16 ***
## Education       -0.0197355  0.0196387  -1.005    0.316    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.02 on 390 degrees of freedom
## Multiple R-squared:  0.8725, Adjusted R-squared:  0.8696 
## F-statistic: 296.6 on 9 and 390 DF,  p-value: < 2.2e-16

Let’s see if we can improve the R-squareds by removing the predictors that are not significant.

m9 <- update(m8,Sales~.-Population -Education)
summary(m9)
## 
## Call:
## lm(formula = Sales ~ CompPrice + Income + Advertising + Price + 
##     ShelveLoc + Age, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.7728 -0.6954  0.0282  0.6732  3.3292 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      5.475226   0.505005   10.84   <2e-16 ***
## CompPrice        0.092571   0.004123   22.45   <2e-16 ***
## Income           0.015785   0.001838    8.59   <2e-16 ***
## Advertising      0.115903   0.007724   15.01   <2e-16 ***
## Price           -0.095319   0.002670  -35.70   <2e-16 ***
## ShelveLocGood    4.835675   0.152499   31.71   <2e-16 ***
## ShelveLocMedium  1.951993   0.125375   15.57   <2e-16 ***
## Age             -0.046128   0.003177  -14.52   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.019 on 392 degrees of freedom
## Multiple R-squared:  0.872,  Adjusted R-squared:  0.8697 
## F-statistic: 381.4 on 7 and 392 DF,  p-value: < 2.2e-16
plot(m9)

This model is rough. The adjusted R-squared did not improve, although this model still has a high R-squared. As we have a small number of features, my next step would be to use Best Subset Selection to try to improve the model.