Model diagnostics

Author

Andreas Handel

Published

2024-01-25

Modified

2024-03-20

Overview

This unit discusses different diagnostics that are helpful to assess models.

Learning Objectives

  • Be familiar with diagnostic approaches that can help assess model quality.

Introduction

We already discussed several approaches of assessing models. One was based on a broad comparison with the real world, the other one focused narrowly on model evaluation based on some performance metric. There are several other ways to assess models that are in some sense between the very broad and very narrow approaches we discussed.

Algorithm and Code assessment

These days, every model is implemented in and fit through some kind of numerical algorithm. It is important to ensure the algorithm worked. While this is not assessing the actual model, if you find that your model fitting algorithm didn’t work, you can’t use the results and will likely have to modify your model.

For simple models, such as GLM-type models, this is almost never a problem. But as models get more complex, things can go wrong. A common issue is algorithm convergence. What is meant by that is that your fitting routine can’t find the value that optimizes your objective function/performance metric. This is often related to overfitting. You are likely trying to fit a model that’s too complex given the data you have available.

It is generally not necessary to understand the details of how the underlying algorithm works. However, if you get error or warning messages, or if there are diagnostic readouts that indicate a problem, then you’ll have to first fix those before you can further consider the model. Often, the fix is to simplify the model.

Many fitting functions return useful information as output. Take a close look. If you thought you had 200 data points but your fitting function result states that N=180, it means the function might have dropped observations with missing values without warning you (R unfortunately does things like that at times). By carefully reading/plotting model returns, you can diagnose your model and catch problems.

If you get strange results (either unexpectedly bad or unexpectedly good), look carefully. As you know, most often things that are too good to be true are in fact not true. Bugs (either of the coding or thinking type) in any step of your analysis can lead to strange results.

If at any time during your analysis, you get warning messages, you need to investigate them carefully. Sometimes it is ok to ignore warning messages in R but only after you know precisely what they mean and why they happen!. Frequently, warning messages indicate you are doing something you shouldn’t be doing and can lead to wrong results.

Null model comparisons

You should always compare your models to a “dumb” null/baseline model. A null model for a continuous outcome would be one that always predicts the mean. That is it uses no information from the predictor variables. For categorical outcomes, a null model would always predict the most common category, without using any information about predictor variables. If your more complicated model can’t beat such a null model (based on the performance metric you chose), you need to question what your model is good for. Remember that the metric is important. A null model with, e.g., high accuracy might not be as good as a different model with lower accuracy. Recall the brain cancer example mentioned in a previous unit.

Single predictor comparisons

Before building and fitting a big model, it is useful to look at the different predictors individually. (Or if you have too many, at least at some you think are important.)

To that end, you can perform bi-variable analyses, fitting models that each only contain a single predictor and evaluate the model performance (ideally using cross-validation) of such single-predictor models. You should definitely do that for the predictor(s) that are of main interest.

Once you start fitting your larger models, you can compare those to your single predictor models. Conceptually, if you only look at the performance on the data used to build the model, your multi-predictor model always performs better or at least as good as your single predictor models, which in turn should perform at least as well as your null model. If that is not the case, it means something went wrong in the analysis. Of course, now that you are aware of overfitting, you know that if you evaluate your models through cross-validation, the bigger multi-variable model does not always perform better.

As an example, if you want to predict mortality and you have a model (say a logistic model with outcome of 5-year mortality yes/no) that includes as predictors diet, exercise, BMI, gender and age, and such a model performs as well as a model with only age, it means that including those additional variables does not add to model performance.

Observed versus predicted values

The model performance metric gives you a single quantity describing how well the model matches the data, given the metric you defined. That’s useful, but you generally want to dig deeper and compare model predictions to the observed data for individual observations.

For continuous outcomes, you can plot those two quantities on the x- and y-axes. For a (hypothetical) perfect model, all points are along the 45-degree line. You don’t actually want them all perfectly on the line since this suggests overfitting. Some scatter along the line is expected and healthy. However, you want to look for systematic deviations from this line, as it suggests potential problems, i.e., it likely means your model is biased and is not flexible enough to capture important patterns still found in the data (i.e., your model is underfitting). In that case, you will want to try different models.

Similarly, for categorical outcomes, you can look at the confusion matrix to see if there are lots of FN or FP, which might suggest the model gets certain categories systematically wrong.

In the common case that you only have 2 categories (e.g., yes/no) and you use a logistic model or some other model that predicts probabilities, you can plot those model-predicted probabilities together with the observed data (which will be just 0 or 1). You are again looking for any patterns that might indicate that there are systematic deviations between model predictions and data that could suggest that your model needs tweaking.

Residuals

Instead of (or in addition to) plotting observed versus predicted for continuous outcomes, you can plot the difference between the two. These differences are called the residuals. What you are looking for is a cloud of points with no discernible pattern. If there is a pattern (e.g., an overall skew, or more points above the 0 y-axes than below), it again suggests that there is still some pattern/signal in the data that the model didn’t capture.

Simulation

One of the best general approaches toward testing models is to simulate synthetic data and fit the model to the simulated data. You know what you used to generate the synthetic data, so you know what the model should return. For instance if you simulate data for the linear model \[Y=b_0 + b_1X_1 + b_2X_2\] (and add a bit of noise on top), and you chose b0=b1=b2=2, then when you fit such a linear model to this simulated data, those are the values for the coefficients you expect to get back (not quite, since you added a bit of noise, but they shouldn’t be far off). If you can’t get out what you stuck in, you have a problem. Most likely, it means you are over-fitting, and more than one combination of parameter values is giving almost the same performance measure, and the model can’t differentiate. You should then either get more data or make your model simpler.

This approach of simulating data is a very useful general approach - which is why we covered it in some detail. For anything beyond the simplest models, you should probably use this approach. The more complex your model becomes, the more useful this type of diagnosis is. It can at times be difficult to assess your model if you use a model for which it is not clear how the mapping of inputs to outcomes works (e.g. a complex machine learning model). Even then, if you make some data where say one predictor is strongly correlated with the outcome, and another one is only noise, then your model should properly predict that this is the case.

Sensitivity analysis

Sometimes you have observations that might have a strong impact on the model, i.e., without those observations, the best fitting model would look quite different. If you decided that those points are real (i.e., not data entry or other mistakes), you might want to run the model both with those data points present and absent to investigate how results might change. Similarly, you might make decisions along the data cleaning path (e.g., to remove all observations with missing values, or instead remove some variables with lots of missing) which could affect results. Performing analyses in multiple ways to see how these decisions affect outcomes is useful. If you find that the overall results stay the same, it instills confidence in the robustness of your findings. If in contrast, different decisions lead to different answers, it might point to something worth investigating further.

Since you are learning to set up your analysis in a fully automated way, doing such additional analyses is fairly painless. Often it just requires a small adjustment in code and waiting for the additional analysis to run.

Note that sometimes the term sensitivity analysis is used to imply a more limited approach, namely just the exploration of the impact of model parameters on the outcomes. However, in the more broader sense of the word, it is the exploration how changes in the analysis pipeline (different subsets of data, different modeling assumptions, etc.) might or might not impact the results.

Model Interpretation

A nice feature of simple models, such as GLMs, is that one can easily understand how specific variables influence the outcome. You just need to look at the coefficients in front of the input variables.1

Those simple approaches are not available anymore for more complex models. With complex models, we generally give up interpretability in exchange for better performance. However, we ideally want both. There has been a good amount of development to come up with methods that allow one to peek inside the black box, i.e. to understand why and how a complex model makes its predictions, and what the role of specific predictor variables is on the outcome. This area is generally called interpretable ML (or AI).

There’s no point in me repeating what others have already said (much better than I could) 😃, therefore, instead of me writing more on this, please take a look at chapters 2 and 3 of the Interpretable Machine Learning (IML) book, which gives a very nice introduction on this topic. As you can tell by the title, the whole book is about interpretation of machine learning methods, and is a great resource.

As models get more complex, making sense of them will increase in importance. Even if you are mainly interested in predictive performance and less on understanding your system, it is a good idea to investigate your model in some detail. You might often figure out things that can help further increase model performance. Also, without understanding at least to some extent how complex models make their predictions, the potential for unnoticed bias increases. A good example of this can be found in this Science paper which describes racial bias in a machine learning model that tries to predict healthcare needs of patients. See also this commentary on the article. Thus, this area of interpreting results from complex models will likely see much development in the near future, hopefully leading to models that are both powerful (and thus likely complex) and interpretable.

Given how easy it is to apply some of these methods to your models, I recommend that if you decide to use a somewhat complex model (or even a not-so complex one), you should always do at least some analyses that probe the model, e.g. perform a variable importance analysis and related investigations as described in the references above.

Summary

When assessing your models, it is important to go beyond the performance metric and look at individual model predictions and how closely they agree with the data. If you spot residual patterns, that might indicate that you could (but don’t have to) make your model more complex/flexible to try and capture additional details of the data. In general, model assessment is a holistic approach that you should do carefully and thoroughly.

Further Resources

Test yourself

What’s the term for the differences between data and model that you want to see distributed like a symmetric cloud?

On a predicted versus observed plots, the points should cluster along a horizontal line to indicate a good model fit.

If your overall model metric (e.g. Accuracy) is very good, you don’t need to look at individual predictions.

Practice

  • Revisit any of the papers you found in one of the previous exercises. See if the authors used any of the approaches discussed here to assess their model(s). Often, this kind of information would be in the supplement. At the minimum, you’d want the authors to mention that they did these checks. Unfortunately, you’ll often see it missing. It seems that at times, authors/analysts don’t want to look too closely, otherwise they would need to acknowledge that their cherished model is actually not that good 😁.

Footnotes

  1. Well, it can actually be quite tricky to interpret coefficients for anything but a linear model. For instance for a logistic model (and many other GLM), the impact of a change in a predictor on the outcome depends on the value of the predictor. Sometimes a 1-unit increase in some predictor (say drug dose) can lead to a strong change in the outcome (say cholesterol level), while for other values of that predictor, increasing it further by 1 unit might have almost no impact on the outcome. So one needs to be careful even when interpreting fairly simple GLM. However, the whole model is known so you can always figure out how one part relates to the other. That’s not the case any more for complex ML models.↩︎