In this unit, we will cover an extension of simple decision tree models, namely models that combine multiple trees for improved performance (but reduced interpretability).
As discussed in the last unit, trees have many good properties, such as being interpretable, able to work for both regression and classification, deal with missing values, and auto-select predictors.
The main disadvantage of single-tree models is that their performance is often not that great–and complicated trees which perform better are vulnerable to overfitting. However, building models which are combinations of many single trees is possible. Usually these so-called ensemble models (the general name for an ML model that gets an aggregate result by combining many other models) sacrifice some interpretability for performance. The idea behind more sophisticated tree-based methods is to take multiple individual trees and add/average them together in a smart way to end up with an ensemble of trees that often performs much better than a single tree. Some of the most common approaches are Bagging (Bootstrap aggregation), Random Forests, and Boosting. We’ll cover each one very briefly.
The main goal of bagging is to reduce variance, which is the main factor that generally negatively affects the performance of a single tree (even if one uses pruning or cross-validation). Remember that if we have N independent observations, each with the same variance/standard deviation, SD, the total variance is SD/N. If we had M datasets for the same process, each with N observations, and we built a model for each dataset, we could average over the models and thereby reduce the variance by 1/M. We don’t have M different datasets, but we can re-sample our existing data (using bootstrapping), and build a model for each sample. We can then, in the end, average over the different models to reduce variance. Here, each model is a tree, but we could also apply this approach to other models. The final model is the average of all the trees/individual models. Since the bagging procedure reduces the variance, individual trees are not pruned. Bagging leads to models with less variance and thus generally better model performance/predictive power. What is lost now is the ease of interpretation, since our final model is now the sum of a (possibly large) number of individual trees.
The random forest model/algorithm is similar to bagging. The difference is that as we split each tree, instead of considering all possible predictors, we pick a random subset of predictors and only split on the best among those. Since we are artificially limiting ourselves, we obtain many trees that don’t perform too well on their own. However, this random choosing of subsets of predictors leads to more “diversity” in the tree structure (i.e., it avoids the greedy nature of the standard tree building algorithm). This helps in de-correlating the trees when we sum across them in the final model. Often this further improves model performance. The cost is the same as for bagging, namely the final model is a sum of trees which is hard to understand and interpret.
Boosting takes a somewhat different approach than the other two. In this case, instead of averaging over full trees, we build many small trees. The procedure starts by building a tree with a specified, often small number of splits (this number is a tuning parameter). This small tree is added to the model, the change in performance is computed (e.g., reduction in RMSE or misclassification error) and a new tree is built that tries to reduce the leftover (residual) errors. In this form, many small trees are added, each one trying to take care of the under-performance produced by previous trees. In the end, one ends up with a sum of many small trees as the final model. This tree ensemble is expected to perform much better than each of the individual trees. Again, the final model is somewhat hard to interpret.
All the advantages mentioned for a single tree model apply to many-tree models. The main additional advantage is that these models often have (much) improved performance compared to a single tree. It is often possible to use an algorithm that implements one of these many-tree models and with little adjustment and tuning, obtain a model with excellent performance. It is, however, still important to tune/train the model.
I already mentioned the main disadvantage: These models are hard to interpret. If the goal is to have a user-friendly model that could be used by humans, trees are best. If the user is ok with typing values into a computer (or smartphone) and letting some algorithms run in the background and produce answers, then more complex models might be ok.
Another possible disadvantage of many-tree models is that they generally take longer to train, so if speed is important, one might not be able to use all of these model variants to their full extent.
As mentioned, bagging and boosting are approaches that can be applied to methods other than trees. For instance, one can bag/boost linear models, GAMs, etc.
The many-tree methods described above are examples of what is often called ensemble methods. Boosting is an example of using an ensemble of weak learners, i.e., a combination of models that individually don’t perform that well, but when combined, they often have outstanding performance. The combination of different individual models/learners (weak or not) is often called model averaging or model stacking, and such methods lead to some of the best-performing models in machine learning. For more on this, see e.g., the Stacked Models chapter of HMLR. We will not further cover these models, but if you have a situation where you need the best model performance available, such models might be worth looking at.
The tidymodels
framework allows access to many different
types of packages/algorithms that implement tree-based models. You can
see the full list
here.
Many of these methods and packages are similar and differ in the
details of the model implementation, the speed at which they run, if one
can parallelize them, etc. Some common ones that people use are
rpart
for individual trees, ranger
for random
forests, and xgboost
for boosting. Many others are
available too, e.g., the bonsai
package has several other
methods. Often it doesn’t matter much which package you use. Sometimes
you might need specific features that only a certain package can give
you, so you need to go with that one.
Since tidymodels
is still under development, there are
many other packages for tree-based and related model fitting that can’t
yet be accessed through it. If you absolutely need one that’s not yet
supported, you can either try to implement it through
tidymodels
or just write code that uses the model
directly.
For more details on the many-tree methods described here, see chapter 8 of ISLR and chapters 10-12 of HMLR.