In this unit, we will very briefly cover some further points related to ML that didn’t really fit into any other sections.
You learned previously that one can combine many trees into a larger model (e.g., a random forest or a boosted regression trees model), and that those models often have better performance. In the many-tree units, I briefly mentioned that this is an example of an ensemble model.
Instead of combining just the same kind of model (e.g., multiple trees), it is possible to build models which are combinations of different types of base models, e.g., combine a tree-based model with an SVM. Those approaches are known variously as ensemble methods/models or super learners or stacked methods/models. By combining different models in a smart way, it is often possible to increase performance beyond what can be achieved from any one individual model. The disadvantage is that fitting the model is more complex, takes more time, and results are even less interpretable (more black box) than any single model. And since each model has parameters that need tuning, more parameters means more data is needed for robust results. Nevertheless, if the only aspect that counts is predictive performance, and plenty of data is available, ensemble methods are often good choices.
Properly fitting ensemble models is not easy, and requires usually a
lot of data. In fact, so far I have never tried to fit an ensemble model
for any of my research projects. Nevertheless, it is useful to know
about them if you encounter them in the literature or if you have a
problem/data where you think they might be helpful. For more on those
models, check out the Stacked
Models chapter of HMLR, the ensembe models chapter in
TMWR and the stacks
package,
which integrates well with tidymodels
.
While I previously mentioned unsupervised learning here and there, we haven’t focused much on it in this course. The reason is that most data analysis problems deal with supervised learning, i.e. with data that has a specific outcome of interest. However, there are situations where data without a given outcome needs to be analyzed. For instance images or customers might need to be sorted into categories. This analysis approach is also called clustering. Sometimes, unsupervised learning is also used as a preparatory step in a supervised learning setting. For instance it can be used to reduce the number of predictors. This is called dimension reduction. It is common in areas where one measures lots of variables but the observations are small, e.g. genetic information on a few hundred individuals, with 1000s of genetic markers measured for each person. In such a case, one can reduce the number of predictor variables into a set of combinations of the original predictors such that the new set contains the most important information. Then one can use that reduced set to perform supervised learning.
Methods like K-means clustering or Hierarchical clustering are – as the name suggests – used for clustering of unlabeled data. Partial least squares (PLS) and Principal component analysis (PCA) are methods for dimension reduction. Since for unsupervised learning, a performance measure like RMSE or Accuracy does not exist, other metrics are used to define the quality of model results. Different algorithms use different ways to perform the unsupervised learning task.
The Unsupervised
Learning chapter of ISL discusses several unsupervised learning
methods. So do the Dimension Reduction and Clustering
sections of HMLR
and the Clustering
chapter of IDS. For an R
implementation, check out the
tidyclust
package.
See the references provided in the sections above, as well as the general references provided in the ML Introduction page.