Model performance revisited

Author

Andreas Handel

Published

2024-01-25

Modified

2024-03-20

Overview

This unit discusses the idea of assessing a model based on how well it performs on external data.

Learning Objectives

  • Understand the concept of overfitting.
  • Know that a model should generally be assessed by its performance with independent data.

Introduction

We discussed how to specify a metric and use that metric to try to find the model with the best performance. We covered the idea of defining a single numerical value (cost function/metric) and optimizing (usually minimizing) it to find the best model. There is however a very big caveat to this. The main point is: It usually doesn’t matter how well your model performs on the data that you used to build and fit your model!

This is a very important point, and one that unfortunately a majority of scientific papers still get completely wrong! It is one area where modern machine learning is much more careful compared to the traditional way statistics is taught/used. In the machine learning field, it is very much recognized that it doesn’t matter a whole lot how well your model performs on the data that you used to build your model! What matters is performance on similar new data like the data that was used to fit the model.

I’m going to repeat this and similar sentences a bunch of times throughout the rest of the course 😁. If you only take away 2 main points from this course, this would be one if them. The other is that doing data analysis in a reproducible (automated) manner is critical.

So let’s go into some more details regarding this important point.

Should we really minimize the cost function?

We discussed that once we have chosen a cost function for our problem, we are trying to find a model that minimizes this cost function (e.g., minimizes the RMSE or the misclassification error), and models with smaller cost functions are better. The problem with this concept is that in general, a bigger model will be more flexible and thus able to fit the data better. However, when we do data analysis, we generally are not (only) interested in having a model that works well for the specific data sample we used to fit your model. Our main question/hypothesis usually does not concern the actual data we have/fit. Instead, we generally want to say something about ‘the larger world’.

If we are asking inferential questions, we are interested in what the data analysis teaches us about this system in general. E.g., if we analyze data to see if there is a correlation between levels of atmospheric pollutants and cases of asthma among our study population, we are usually really interested in knowing if such a correlation is real in general.

If we are asking predictive questions, we are interested in a model that can predict future observations, not the ones we already have. E.g., if we analyze data for a specific treatment, we are not very interested how well the model predicts the effect of the drug on the people for which we collected the data (we already know that). Instead, we want to make general predictions about the effectiveness of the treatment on future patients.

In either case, what we want is a model that is generalizeable (also sometimes called externally valid ), and that applies equally well to new and similar data beyond the data we already collected.

What truly matters is how well our model can explain/predict other/future data, not just the data we already observed!

If we build a very complex model in an effort to match our existing data as closely as possible, what generally happens is that our model overfits. That means it becomes very good at modeling the data we use to build the model, but it won’t generalize very well to the general, broader context of other/future data. The reason for that is that there is noise (random variability) in any dataset, and if we have a model that is too flexible, it will not only match the overall signal/pattern (if there is any) but will also capture all the noise in our sample, which leads to worse performance on future data that have different amounts and types of noise/variability.

Bias-variance trade-off

As model complexity increases, models tend to perform/fit better to the data that is used to build the model. However, it is also generally the case that such complex models perform worse on future/new data compared to simpler models. This is an important general concept and is known in statistics as the bias-variance trade-off.

Bias describes the fact that a model that is too simple might get the data “systematically wrong”. A more restricted model like a simple linear model usually has more bias. Another way of saying this is that the model underfits, i.e., there are still patterns in the data that the model does not capture. More complex models generally reduce the bias and the underfitting problem..

Variance describes how much a model would vary if it were fit to another, similar dataset. If a model goes close to the training data, it will likely produce a different fit if we re-fit it to a new dataset. Such a model is overfitting the data. More complex models tend to be more likely to overfit.

While the concept sounds somewhat technical, you can get a very good and quick intuitive understanding by looking at the following figure.

The left panel shows fit of a linear, cubic and higher order spline model to the data. The linear model underfits, the cubic fits well, the higher order overfits. The right panel shows the same information by plotting mean square error as function of model flexibility and indicates that while a more flexible model always fits better to the training data, the fit to the test data is best at an intermediate level.

Bias-variance tradeoff. Source: ISLR.

In the example shown in this figure, the data was produced by taking the black curve and adding some noise on top. This gives the data shown as circles. Three models are fit. A linear model (yellow) is too restrictive and misses important patterns. The next model (blue line) is more flexible and is able to capture the main patterns. The most complex model (green line) gets fairly close to the data. But you can tell that it is trying to get too close to the data and thus overfits. If we had another data sample (took the black line and added some noise on top), the green model would not do so well. This is shown on the right side, where the grey line plots the MSE for each model for the given dataset. As the model gets more complex/flexible, they get closer to the data, and the MSE goes down. However, what matters is the model performance on an independent dataset. This is shown with the red curve. Here, you can see that the blue model has the lowest MSE.

The same concept holds for categorical outcomes, and for models with multiple predictors. No matter what the model, there is always a sweet spot for model complexity somewhere “in the middle”. This “middle” depends on the data and the question. Often, linear models are as good as one can get, and more complex models will overfit. Even for linear models, we might have to remove predictors to prevent overfitting (we’ll discuss that later). At other times, somewhat complicated models (e.g., neural nets) might perform best. In general, the more data (both quantity and richness), the less likely it is that a more complex model will lead to overfitting. However, we always need to check.

Overfitting and machine learning

If you only fit simple models (e.g., a linear model), and maybe decide based on scientific knowledge which predictors need to be in the model, then your risk of overfitting – while still present – is not that large. However, in machine learning, you often have complex models with many components that can be adjusted/tuned (we’ll get into that) to improve model performance. The danger is that if you have a very flexible model that can be finely tuned to perform well on the data, you have a very large risk of overfitting, namely of ending up with a model that is well tuned and performs very well on the data you use to build the model, but does not work so well on other data. Therefore, while overfitting is always something the be careful about, once you start using larger and more flexible models, you definitely need to guard against overfitting.

Dealing with overfitting

Now that you learned that the model that performs best (using whatever metric you chose) is not necessarily the best one, how can we evaluate model performance in a better way? There are different options, which we’ll discuss in the next units.

Summary

To repeat (again): We generally want to know how well a model performs in general and on new data - not the sample we fit it to. Testing/reporting model performance for the data the model was fit to very often leads to overfitting and optimistic/wrong conclusions about new/future data. There are several good ways to minimize overfitting, which we’ll cover next.

Further Resources

MLNAR’s blog post provides a nice further discussion of the idea of generalization and how different areas of data science (statistics, machine learning, causal modeling) think about this problem. I think the most important paragraph is actually the short last one. I want to add to this that while different areas might think about the question of generalization differently, all of them more or less agree, that in the end, what is important is the general conclusions you can draw from your statistical modeling analysis and it doesn’t matter (by itself) how well your model performs on the data that you used to build your model! What matters is what it means more generally.

Chapters 2 ISL covers the bias-variance trade-off.

Test yourself

Which of the following is NOT an important topic discussed in this unit?

Overfitting is only a problem for large machine learning models.

Most of the time, we want a model that performs well in general/on new data, not just for our sample.

Practice

  • Revisit the papers you found for the previous unit’s exercise. Go through them again and specifically focus on the model structure and complexity, and if results make sense. Try, as best as you can, to critically evaluate if the authors made suitable choices and explained their choices well.