Overview

In this unit, we will very briefly cover some further points related to ML that didn’t really fit into any other sections.

Learning Objectives

  • Be familiar with the idea of ensemble models.
  • Know about several unsupervised modeling approaches.

Ensemble methods/models

You learned previously that one can combine many trees into a larger model (e.g., a random forest or a boosted regression trees model), and that those models often have better performance. In the many-tree units, I briefly mentioned that this is an example of an ensemble model.

Instead of combining just the same kind of model (e.g., multiple trees), it is possible to build models which are combinations of different types of base models, e.g., combine a tree-based model with an SVM. Those approaches are known variously as ensemble methods/models or super learners or stacked methods/models. By combining different models in a smart way, it is often possible to increase performance beyond what can be achieved from any one individual model. The disadvantage is that fitting the model is more complex, takes more time, and results are even less interpretable (more black box) than any single model. And since each model has parameters that need tuning, more parameters means more data is needed for robust results. Nevertheless, if the only aspect that counts is predictive performance, and plenty of data is available, ensemble methods are often good choices.

Properly fitting ensemble models is not easy, and requires usually a lot of data. In fact, so far I have never tried to fit an ensemble model for any of my research projects. Nevertheless, it is useful to know about them if you encounter them in the literature or if you have a problem/data where you think they might be helpful. For more on those models, check out the Stacked Models chapter of HMLR.

Unsupervised learning

While I previously mentioned unsupervised learning here and there, we haven’t focused much on it in this course. The reason is that most data analysis problems deal with supervised learning, i.e. with data that has a specific outcome of interest. However, there are situations where data without a given outcome needs to be analyzed. For instance images or customers might need to be sorted into categories. This analysis approach is also called clustering. Sometimes, unsupervised learning is also used as a preparatory step in a supervised learning setting. For instance it can be used to reduce the number of predictors. This is called dimension reduction. It is common in areas where one measures lots of variables but the observations are small, e.g. genetic information on a few hundred individuals, with 1000s of genetic markers measured for each person. In such a case, one can reduce the number of predictor variables into a set of combinations of the original predictors such that the new set contains the most important information. Then one can use that reduced set to perform supervised learning.

Methods like K-means clustering or Hierarchical clustering are – as the name suggests – used for clustering of unlabeled data. Partial least squares (PLS) and Principal component analysis (PCA) are methods for dimension reduction. Since for unsupervised learning, a performance measure like RMSE or Accuracy does not exist, other metrics are used to define the quality of model results. Different algorithms use different ways to perform the unsupervised learning task.

The Unsupervised Learning chapter of ISL discusses several unsupervised learning methods. So do the Dimension Reduction and Clustering sections of HMLR and the Clustering chapter of IDS.

Further reading

See the references provided in the sections above, as well as the general references provided in the ML Introduction page.