Chapter Status: This chapter was originally written using the
tree
packages. Sylenth 2.2.1 torrent. Currently being re-written to exclusively use the rpart
package which seems more widely suggested and provides better plotting features.Here it is easy to see that the tree has been over-fit. The train set performs much better than the test set. We will now use cross-validation to find a tree by considering trees of different sizes which have been pruned from our original tree. VERIFY helps distinguish between true and false information by answering questions directly from you. Looking for ‘Connect’ content? When you visit our website, we store cookies on your browser to collect information. The information collected might relate to you, your preferences or your device, and is mostly used to make the site work as you expect it to and to provide a more personalized web experience. 5.0 out of 5 stars Stable, works well for 1 inch hole plates, simple to assembly, small footprint Reviewed in the United States on December 16, 2020 Color: Black, Space Saving A-style 1' Plate Tree Verified Purchase.
In this document, we will use the package
tree
for both classification and regression trees. Note that there are many packages to do this in R
. rpart
may be the most common, however, we will use tree
for simplicity.26.1 Classification Trees
To understand classification trees, we will use the
Carseat
dataset from the ISLR
package. We will first modify the response variable Sales
from its original use as a numerical variable, to a categorical variable with High
for high sales, and Low
for low sales.We first fit an unpruned classification tree using all of the predictors. Details of this process can be found using
?tree
and ?tree.control
We see this tree has 27 terminal nodes and a misclassification rate of 0.09.
Above we plot the tree. Below we output the details of the splits.
Dragon dictate 6 0 5. We now test-train split the data so we can evaluate how well our tree is working. We use 200 observations for each.
Note that, the tree is not using all of the available variables.
Also notice that, this new tree is slightly different than the tree fit to all of the data.
When using the
predict()
function on a tree, the default type
is vector
which gives predicted probabilities for both classes. We will use type = class
to directly obtain classes. We first fit the tree using the training data (above), then obtain predictions on both the train and test set, then view the confusion matrix for both.Here it is easy to see that the tree has been over-fit. The train set performs much better than the test set.
We will now use cross-validation to find a tree by considering trees of different sizes which have been pruned from our original tree.
It appears that a tree of size 9 has the fewest misclassifications of the considered trees, via cross-validation.
We use
prune.misclass()
to obtain that tree from our original tree, and plot this smaller tree.We again obtain predictions using this smaller tree, and evaluate on the test and train sets.
The train set has performed almost as well as before, and there was a small improvement in the test set, but it is still obvious that we have over-fit. Trees tend to do this. We will look at several ways to fix this, including: bagging, boosting and random forests.
Tree 1905
26.2 Regression Trees
To demonstrate regression trees, we will use the
Boston
data. Recall medv
is the response. We first split the data in half.Then fit an unpruned regression tree to the training data.
As with classification trees, we can use cross-validation to select a good pruning of the tree.
While the tree of size 9 does have the lowest RMSE, we’ll prune to a size of 7 as it seems to perform just as well. (Otherwise we would not be pruning.) The pruned tree is, as expected, smaller and easier to interpret.
Let’s compare this regression tree to an additive linear model and use RMSE as our metric.
We obtain predictions on the train and test sets from the pruned tree. We also plot actual vs predicted. This plot may look odd. We’ll compare it to a plot for linear regression below.
Here, using an additive linear regression the actual vs predicted looks much more like what we are used to.
We also see a lower test RMSE. The most obvious linear regression beats the tree! Again, we’ll improve on this tree soon. Also note the summary of the additive linear regression below. Which is easier to interpret, that output, or the small tree above?
26.3rpart
Package
The
rpart
package is an alternative method for fitting trees in R
. It is much more feature rich, including fitting multiple cost complexities and performing cross-validation by default. It also has the ability to produce much nicer trees. Based on its default settings, it will often result in smaller trees than using the tree
package. See the references below for more information. rpart
can also be tuned via caret
.26.4 External Links
- An Introduction to Recursive Partitioning Using the
rpart
Routines - Details of therpart
package. rpart.plot
Package - Detailed manual on plotting withrpart
using therpart.plot
package.
26.5rmarkdown
1-9 Form Pdf
The
rmarkdown
file for this chapter can be found here. The file was created using R
version 4.0.2. The following packages (and their dependencies) were loaded when knitting this file: