Introduction

In the previous units, we discussed a way to evaluate model quality by looking at performance based on some cost function/metric we defined. While this is an important component of the model building and choosing process, one needs to go further.

The absolute value of a model’s performance is not always very meaningful. Is an RMSE of 3.2 good? Is 90% accuracy a good model? We can’t say without knowing something about the system and looking more closely.

Performing a detailed model diagnosis often increases our understanding of the data/model/analysis, and can help us figure out how to fix or improve the model. The following sections briefly describe some checks you can do to evaluate the quality of your model further.

Null model comparisons

You should always compare your models to a “dumb” null/baseline model. A null model for a continuous outcome would be one that always predicts the mean. That is it uses no information from the predictor variables. For categorical outcomes, a null model would always predict the most common category, without using any information about predictor variables. If your more complicated model can’t beat such a null model (based on the performance metric you chose), you need to question what your model is good for. Remember that the metric is important. A null model with, e.g., high accuracy might not be as good as a different model with lower accuracy. Recall the example of the brain cancer.

In tidymodels, you can use the null_model() function to compute the performance for a null/base model.

Single predictor comparisons

Before building and fitting a big model, it is useful to look at the different predictors individually. (Or if you have too many, at least at some you think are important.)

To that end, you can perform bi-variable analyses, fitting models that each only contain a single predictor and evaluate the model performance (ideally using cross-validation) of such single-predictor models. You should definitely do that for the predictor(s) that are of main interest.

Once you start fitting your larger models, you can compare those to your single predictor models. Conceptually, if you only look at the performance on the data used to build the model, your multi-predictor model always performs better or at least as good as your single predictor models, which in turn should perform at least as well as your null model. If that is not the case, it means something went wrong in the analysis. Of course, now that you are aware of overfitting, you know that if you evaluate your models through cross-validation, the bigger multi-variable model does not always perform better.

As an example, if you want to predict mortality and you have a model (say a logistic model with outcome of 5-year mortality yes/no) that includes as predictors diet, exercise, BMI, gender and age, and such a model performs as well as a model with only age, it means that including those additional variables does not add to model performance.

Observed versus predicted values

Once you have a model with good performance, you want to inspect their actual predictions. For continuous outcomes, you can plot observed (data) versus predicted (model) outcomes. For a (hypothetical) perfect model, all points are along the 45-degree line. You don’t actually want them all on the line since this suggests overfitting. Some scatter along the line is expected and “healthy”. However, you want to look for systematic deviations from this line, as it suggests potential problems, i.e., it likely means your model is biased and is not flexible enough to capture important patterns still found in the data (i.e., your model is underfitting). In that case, you will want to try different models. Similarly, for categorical outcomes, you can look at the confusion matrix to see if there are lots of FN or FP, which might suggest the model gets certain categories systematically wrong.

Residuals

Instead of (or in addition to) plotting observed versus predicted for continuous outcomes, you can plot the difference between the two. These differences are called the residuals. What you are looking for is a cloud of points with no discernible pattern. If there is a pattern (e.g., an overall skew, or more points above the 0 y-axes than below), it again suggests that there is still some pattern/signal in the data that the model didn’t capture.

Data simulation

One of the best general approaches toward testing models is to simulate data and fit the model to the simulated data. You know what you used to generate the fake data, so you know what the model should return. For instance if you simulate data for the linear model \[Y=b_0 + b_1X_1 + b_2X_2\] (and add a bit of noise on top), and you chose b0=b1=b2=2, then when you fit such a linear model to this simulated data, those are the values for the coefficients you expect to get back (not quite, since you added a bit of noise, but they shouldn’t be far off). If you can’t get out what you stuck in, you have a problem. Most likely, it means you are over-fitting, and more than one combination of parameter values is giving almost the same performance measure, and the model can’t differentiate. You should then either get more data or make your model simpler.

This approach of simulating data is a very useful general approach. You should always consider if it is an option for your specific data and question and use it if possible. The more complex your model becomes, the more useful this type of diagnosis is. It can however at times be difficult if you use a model for which it is not clear how the mapping of inputs to outcomes works (e.g. a complex machine learning model). Even then, if you make some data where say one predictor is strongly correlated with the outcome, and another one is only noise, then your model should properly predict that this is the case.

Sensitivity analysis

Sometimes you have observations that might have a strong impact on the model, i.e., without those observations, the best fitting model would look quite different. If you decided that those points are real (i.e., not data entry or other mistakes), you might want to run the model both with those data points present and absent to investigate how results might change. Similarly, you might make decisions along the data cleaning path (e.g., to remove all observations with missing values, or instead remove some variables with lots of missing) which could affect results. Performing analyses in multiple ways to see how these decisions affect outcomes is useful. If you find that the overall results stay the same, it instills confidence in the robustness of your findings. If in contrast, different decisions lead to different answers, it might point to something worth investigating further.

Since you are learning to set up your analysis in a fully automated way, doing such additional analyses is fairly painless. Often it just requires a small adjustment in code and waiting for the additional analysis to run.

Note that sometimes the term sensitivity analysis is used to imply in a more limited approach, namely just the exploration of the impact of model parameters on the outcomes. However, in the more broader sense of the word, it is the exploration how changes in the analysis pipeline (different subsets of data, different modeling assumptions, etc.) might or might not impact the results.

Model Interpretation

A nice feature of using subset selection or LASSO with GLMs, or fitting a single tree, is that the algorithm decides which predictor variables are important, and throws out the remaining ones. For the ones that stay in the model, we can look at the coefficients in front of each predictor variable, or look at the final tree/decision diagram, to assess the importance of individual predictor variables on the outcome. This provides an easy way to understand how specific variables influence the outcome.

Those simple approaches are not available anymore for more complex models. With complex models, we generally give up interpretability in exchange for better performance. However, we ideally want both. The last several years have seen a good amount of development to come up with methods that allow one to peek inside the black box, i.e. to understand why and how a complex model makes its predictions, and what the role of specific predictor variables is on the outcome. This area is generally called interpretable machine learning.

There’s no point in me repeating what others have already said (much better than I could) 😃, therefore, instead of me writing more on this, please take a look at chapters 2 and 3 of the Interpretable Machine Learning (IML) book, which gives a very nice introduction on this topic. As you can tell by the title, the whole book is about interpretation of machine learning methods, and is a great resource.

For a shorter, but also great resource, check out chapter 16 of HMLR, which provides both a quick introduction and overview to the topic, and lists and illustrates the use of several R packages to do various interpretation tasks.

Another good introduction is this chapter of the Tidy Modeling with R book.

As models get more complex, making sense of them will increase in importance. Even if you are mainly interested in predictive performance and less on understanding your system, it is a good idea to investigate your model in some detail. You might often figure out things that can help further increase model performance. Also, without understanding at least to some extent how complex models make their predictions, the potential for unnoticed bias increases. A good example of this can be found in this very recent Science paper which describes racial bias in a machine learning model that tries to predict healthcare needs of patients. See also this commentary on the article. Thus, this area of interpreting results from complex models will likely see much development in the near future, hopefully leading to models that are both powerful (and thus likely complex) and interpretable.

Given how easy it is to apply some of these methods to your models, I recommend that if you decide to use a somewhat complex model (or even a not-so complex one), you should always do at least some analyses that probe the model, e.g. perform a variable importance analysis and related investigations as described in the references above.

Other checks

Many fitting functions return useful information as output (which you can read with the summary or similar commands). Take a close look. If you thought you had 200 data points but your fitting function result states that N=180, it means the function might have dropped observations with missing values without warning you (R unfortunately does things like that at times). By carefully reading/plotting model returns, you can diagnose your model and catch problems.

If you get strange results (either unexpectedly bad or unexpectedly good), look carefully. As you know, most often things that are too good to be true are in fact not true. Bugs (either of the coding or thinking type) in any step of your analysis can lead to strange results.

If at any time during your analysis, you get warning messages, you need to investigate them carefully. Sometimes it is ok to ignore warning messages in R but only after you know precisely what they mean and why they happen!. Frequently, warning messages indicate you are doing something you shouldn’t be doing and can lead to wrong results.

Further learning

In addition to the sources mentioned above, other good reads are the Judging model effectiveness chapter and the When should you trust your predictions? chapter of TMWR.