Machine Learning Models - Overview
Overview
In this brief unit, I make a few general comments about machine learning models, before we look at several types of those models.
Learning Objectives
- Know what ML models are.
Introduction
By now, I am sure you have picked up on the fact that there is a whole zoo of different models out there one can use to analyze data. You have also likely picked up that the terminology is fuzzy. We could call a standard linear or logistic regression model a machine learning (ML) model, and people do. Or we can reserve the term ML for more complex models - however fuzzily we define “complex”. In the end, getting hung up about terminology is not that useful and important.
What is important is that you are at least a bit familiar with some of the more common models. Most of the time, you do not need to understand every detail of a model. You do however want to understand enough about how the model works, its strengths and limitations, to determine if a certain model is useful for a given situation or not. This way, when you read someone else’s modeling results, or try your own, you have an idea of what’s going in.
We will do a whirlwind tour of several classes of common models. There’s no way you can learn them in detail in the short amount of time we spend on them. However, you should be able to pick up enough about a type of model such that if you happen to work on a problem where you think a certain kind of model might be useful, you will then want to revisit that model and learn more.
General ML considerations
I want to make an important point here: The statistics/model fitting topics we have discussed so far, namely pre-processing data, choosing the right performance metric, using methods like train/test and cross-validation to minimize overfitting, and carefully evaluating your model results apply to pretty much any data analysis project, no matter if you fit a simple linear model or a complicated neural net.
Most models that are considered ML models tend to be more complex than the GLM type models we discussed so far. More complexity usually means more predictive performance, but also an increased chance for overfitting and more difficulty understanding what is going on inside the model and how to interpret model results.
ML models often have parameters that need to be tuned as part of the fitting process. In a first pass, you can often do those tuning operations without knowing much about the models. Once you dig deeper, it is useful to understand enough about the model to get an idea of what tuning a specific parameter for a given model actually means. Either way, always critically evaluate what your models return. Just because complex models are often black boxes and it’s hard to understand everything that goes on inside doesn’t mean you can skip your critical thinking and accept as reasonable whatever the model gives you back.
I think by now you have also picked up on the idea that there is no recipe for choosing a specific machine learning/statistical modeling approach. It depends on the data, the question, the overall goal, what others in the field are using, and potentially further factors. In general, the most thorough approach is to try both simple and complex models, and then decide based on model performance and other considerations such as simplicity, speed, usability, etc. on a specific model. There is always a level of subjectivity involved, i.e. different analysts might favor different models. As long as the thought process behind choosing a specific model is well explained and justified, you should be allowed to choose the model that you think is overall best for a given situation. Since it is very easy to fit multiple different models and compare results, it is not a bad idea to do that. You can report the results form the main model you chose as the main findings, with results from other models as supplementary material.
And with those general points out of the way, we’ll look at several different statistical/ML models in the following units.
Further resources
The books we have been using throughout this course all cover many aspects of ML. Especially ISL, IDS and HMLR are very good starting points for learning more about different machine learning methods. I will point to specific chapters when we discuss specific models.
There are also tons of online resources on machine learning models, the quality varies widely, but it might be worth looking around a bit. TheInsaneApp.com’s 10 Most Used Machine Learning Algorithms in Python with Code is a nice overview. Note that at the time I’m writing this, when I did a brief read-through I noticed some inaccuracies. E.g., they claim that a logistic regression model predicts 0/1, which is not quite right, it predicts probability which then is usually converted to 0/1 by defining a threshold. In general, when you look at resources like that, they are rarely completely wrong but might occasionally not be fully accurate (I’m sure my course website is no different 🙄). So it’s often good to cross-check with resources that are fairly certain to be right (e.g., textbooks or Wikipedia).
If you want to practice some more ML modeling using the tidymodels
framework, check out this free online course by Julia Silge, one of the main tidymodels
maintainers. It consists of 4 case studies that teach you both general ML ideas and how to do them with tidymodels
. Another online course focusing on tidymodels
is Allison Hill’s course.