I use the ISLR library to get the Carseats
data set. I also install the tree
package.
library(ISLR) data(“Carseats”) install.packages(“tree”) library(tree)
I create a histogram of sales to help decide what the classification targets will be. The values on the x-axis correspond to thousands of dollars.
attach(Carseats)
hist(Sales)
Based on the histogram, I create a binary variable to split the Sales into “high” and “low” at the $8,000 mark.
salesVol <- ifelse(Sales<=8,"No","Yes")
I create a new data frame to hold the Carseats
data and the labels for each observation.
Carseats <- data.frame(Carseats,salesVol)
head(Carseats)
I create a decision tree model using the new Carseats
data frame. I make sure to exclude “Sales” from the model since my salesVol variable is directly based on “Sales”.
I also get a summary of my decision tree model.
tree.carseats <- tree(salesVol~.-Sales, data=Carseats)
summary(tree.carseats)
Classification tree:
tree(formula = salesVol ~ . - Sales, data = Carseats)
Variables actually used in tree construction:
[1] "ShelveLoc" "Price" "Income" "CompPrice"
[5] "Population" "Advertising" "Age" "US"
Number of terminal nodes: 27
Residual mean deviance: 0.4575 = 170.7 / 373
Misclassification error rate: 0.09 = 36 / 400
I end up with 27 terminal nodes. I can actually plot the model to see a tree diagram.
plot(tree.carseats)
text(tree.carseats, pretty = 0, cex = 0.45)
The value under each terminal indicates the predicted value for observations in that segment. The height of each node indicates how much the misclassification error rate was reduced by the split. The improvement in misclassification error rate gets smaller as we make more splits.
Create a training data set with 250 rows. There are 400 observations total.
set.seed(1011)
train = sample(1:nrow(Carseats),250)
Create and plot a decision tree using the training data.
tree.carseatsTrainModel <- tree(salesVol~.-Sales, data = Carseats, subset=train)
plot(tree.carseatsTrainModel)
text(tree.carseatsTrainModel, pretty=0, cex=0.45)
Predict values for the remaining test data. We specify the type as classification to predict the class labels.
tree.pred <- predict(tree.carseatsTrainModel, Carseats[-train,], type="class")
Evaluate the error
with(Carseats[-train,], table(tree.pred,salesVol))
salesVol
tree.pred No Yes
No 72 27
Yes 18 33
(72+33)/150
[1] 0.7
The misclassification error rate is 0.30.
Use cross validation cv to prune the tree. We indicate we will use misclassification error in the parameters.
cv.carseats <- cv.tree(tree.carseatsTrainModel, FUN=prune.misclass)
plot(cv.carseats)
I pick 13 as the number of terminal nodes as it is in the middle of the area with lowest deviance.
prune.carseats <- prune.misclass(tree.carseatsTrainModel, best=13)
plot(prune.carseats)
text(prune.carseats, pretty=0, cex=0.7)
I will use this new “pruned” model on the test data set, then evaluate the error.
tree.pred.Pruned <- predict(prune.carseats, Carseats[-train,], type="class")
with(Carseats[-train,], table(tree.pred.Pruned,salesVol))
salesVol
tree.pred.Pruned No Yes
No 72 28
Yes 18 32
(72+32)/150
[1] 0.6933333
I find that the error rate did not improve with the pruned model.