In this unit, we will cover a class of models called classification and regression trees.
GLMs have the main limitation of only capturing linear relations between inputs and outcomes–any higher order terms or interactions must be manually specified. The polynomial and spline models we discussed allow for more flexible patterns, at the cost of more potential overfitting and also reduced interpretability. Another class of models that allow non-linear relation between inputs and outcomes are based on Classification and regression trees (CART). Outside of data science/statistics, those trees are often also known as decision trees.
Such tree models (as they are usually called) have the advantage of being fairly easy to understand and use, even by non-statisticians. As the name suggests, CART models can be used both for regression and classification, and in the latter case, can easily deal with multiple categories. They can also deal with missing values in the predictors.
The following shows a tree from Afonso et al. 2012, which is a study led by one of my colleagues Mark Ebell to determine if there are simple decision rules that can determine if a person has an influenza infection or not.
As you can see, this is a very simple model that is very intuitive and could be easily understood by physicians and other laypeople. The tree is used here for classification, with a binary outcome (presence/absence of flu).
The next example is a tree used for a regression problem to predict/model a continuous outcome.
In this study, the authors tried to determine the survival time of patients who had cancers with certain characteristics. The tree was used to group individuals, and for each group the tree/model predicts median survival.
The way one computes a tree is a little bit similar in concept to forward selection in a linear model. We start by looking at each predictor. We take the tree, which, when the outcome is split at some value of that predictor, leads to the best (cross-validated) performance increase in the model (e.g., lowest SSR/RMSE or highest Accuracy/F1 score). Let’s say we want to predict BMI, and our performance measure is RMSE of predicted value from the true outcome. The null model using no predictors gives us some value for the RMSE. We then try each predictor (let’s say we have exercise, calories, age, sex) and find that if we use exercise and assign everyone who exercises more than 23min per day one BMI value, and everyone who exercises less than 23min another BMI value, we get the best reduction in RMSE we could get (across all predictors and possible cut-off values). This is our first split using the
exercise predictor at 23min. We now have a tree with one root and 2 nodes (in this case, they are leaves, which is the name for the terminal nodes).
We now take each leaf and run through all predictors again including the one we just used and see which predictor split at some value will further improve model performance. Let’s say we find that for those exercising more than 23min, we can split by sex and get 2 new BMI values, one for males exercising >23min and one for females exercising >23min, which gives the maximum reduction in RMSE. Similarly, we find that for the <23min exercise people, splitting at 1500 calories a day is best. We now have a tree with a root, 2 internal nodes, and 4 terminal nodes.
We keep building the tree in this way until some criterion is met (e.g., no more than n observations in a given leaf, no further increase in performance). Note that in this procedure, some predictors may never enter the model. In that way, a tree performs an automatic subset selection, i.e., the algorithm might decide to not include predictor variables it doesn’t find useful. Also, any predictor could be used more than once.
Algorithms that implement the tree building routine differ in their details, which is why you will see many different options in both
caret. In general, if you have a good signal in your data, any algorithm should pick up on it.
Since trees have a tendency to overfit, it is common to regularize them by including a penalty for tree size. For instance, if we were to minimize SSR/RMSE, we would add a penalty to the cost function, C, to get
\[C = SSR + \lambda T, \]
where T is the number of leaves of the tree. More leaves means a larger tree, which is being penalized. The tuning parameter \(\lambda\) needs to be determined using parameter tuning or model training. Reducing a tree in this way is also called pruning.
While trees are fairly simple to understand, sometimes, especially if there are many branching points, a complex decision tree might still not be suitable in practice. Further, even with regularization, trees might overfit. There is a type of tree, called fast and frugal tree (FFT), which can potentially help with both aspects. The difference between a regular tree and a FFT is that for the FFT, at each split/decision point, at least one of the splits needs to end in a terminal node. Check e.g. the example diagrams shown in the FFT Wikipedia article to see what this means. This constraint makes the trees often simpler and thus easier to implement in practice (e.g. by a doctor) and they might also be more robust, i.e. their performance on future data might be better than a larger tree. Disadvantages of FFT are that sometimes they might be too simple and thus not perform as well as a full tree, and they only work for binary outcomes.
A very nice R package called
FFTrees implements FFT (and automatically compares their performance to regular trees and some of the tree based methods discussed below). You can find more about this R package here.
A great feature of trees is that they are relatively quick and easy to build and especially easy to understand. They can easily be communicated to and used by non-experts (e.g., doctors, other decision-makers). As such, trees (sometimes called decision diagrams or other names) are common in many fields. As mentioned, trees are also able to handle missing data in predictors, and they are often reasonably robust in the presence of collinear or near-zero-variance predictors since trees tend to use one of the variables and ignore the others. Tree models also tend to excel at inferring the presence of nonlinear interactions between variables. Often trees don’t need predictors to be standardized either.
The main disadvantage of trees is that they usually have reduced performance compared to other models. Thus, if a simple, interpretable model is the primary goal, trees are ideal. If instead, a high-performance predictive model is the goal, trees are rarely the best choice.
This excellent interactive tutorial gives a very nice, visual introduction to machine learning in general and trees in particular. It covers and nicely illustrates some of the topics discussed here, as well as topics discussed in previous units. In part 2 of the tutorial the authors discuss overfitting. I strongly recommend you check it out, even if you just skim through it. It is fun and informative!