In the previous units, we discussed a way to evaluate model quality
by looking at **performance** based on some cost
function/metric we defined. While this is an important component of the
model building and choosing process, one needs to go further.

The absolute value of a model’s performance is not always very meaningful. Is an RMSE of 3.2 good? Is 90% accuracy a good model? We can’t say without knowing something about the system and looking more closely.

Performing a detailed model diagnosis often increases our understanding of the data/model/analysis, and can help us figure out how to fix or improve the model. The following sections briefly describe some checks you can do to evaluate the quality of your model further.

You should always compare your models to a “dumb” null/baseline model. A null model for a continuous outcome would be one that always predicts the mean. That is it uses no information from the predictor variables. For categorical outcomes, a null model would always predict the most common category, without using any information about predictor variables. If your more complicated model can’t beat such a null model (based on the performance metric you chose), you need to question what your model is good for.

Before building and fitting a big model, it is useful to look at the different predictors individually. (Or if you have too many, at least at some you think are important.)

To that end, you can perform bi-variable analyses, fitting models that each only contain a single predictor and evaluate the model performance (ideally using cross-validation) of such single-predictor models. You should definitely do that for the predictor(s) that are of main interest.

Once you start fitting your larger models, you can compare those to your single predictor models. Conceptually, if you only look at the performance on the data used to build the model, your multi-predictor model always performs better or at least as good as your single predictor models, which in turn should perform at least as well as your null model. If that is not the case, it means something went wrong in the analysis. Of course, now that you are aware of overfitting, you know that if you evaluate your models through cross-validation, the bigger multi-variable model does not always perform better.

As an example, if you want to predict mortality and you have a model (say a logistic model with outcome of 5-year mortality yes/no) that includes as predictors diet, exercise, BMI, gender and age, and such a model performs as well as a model with only age, it means that including those additional variables does not add to model performance.

Once you have a model with good performance, you want to inspect
their actual predictions. For continuous outcomes, you can plot observed
(data) versus predicted (model) outcomes. For a (hypothetical) perfect
model, all points are along the 45-degree line. You don’t actually want
them all on the line since this suggests overfitting. Some scatter along
the line is expected and “healthy”. However, you want to look for
systematic deviations from this line, as it suggests potential problems,
i.e., it likely means your model is *biased* and is not flexible
enough to capture important patterns still found in the data (i.e., your
model is underfitting). In that case, you will want to try different
models. Similarly, for categorical outcomes, you can look at the
confusion matrix to see if there are lots of FN or FP, which might
suggest the model gets certain categories systematically wrong.

Instead of (or in addition to) plotting observed versus predicted for continuous outcomes, you can plot the difference between the two. These differences are called the residuals. What you are looking for is a cloud of points with no discernible pattern. If there is a pattern (e.g., an overall skew, or more points above the 0 y-axes than below), it again suggests that there is still some pattern/signal in the data that the model didn’t capture.

One of the best general approaches toward testing models is to
simulate data and fit the model to the simulated data. You know what you
used to generate the fake data, so you know what the model should
return. For instance if you simulate data for the linear model \[Y=b_0 + b_1X_1 + b_2X_2\] (and add a bit
of noise on top), and you chose
*b _{0}*=

This approach of simulating data is a very useful general approach. You should always consider if it is an option for your specific data and question and use it if possible. The more complex your model becomes, the more useful this type of diagnosis is. It can however at times be difficult if you use a model for which it is not clear how the mapping of inputs to outcomes works (e.g. a complex machine learning model). Even then, if you make some data where say one predictor is strongly correlated with the outcome, and another one is only noise, then your model should properly predict that this is the case.

Sometimes you have observations that might have a strong impact on
the model, i.e., without those observations, the best fitting model
would look quite different. If you decided that those points are
*real* (i.e., not data entry or other mistakes), you might want
to run the model both with those data points present and absent to
investigate how results might change. Similarly, you might make
decisions along the data cleaning path (e.g., to remove all observations
with missing values, or instead remove some variables with lots of
missing) which could affect results. Performing analyses in multiple
ways to see how these decisions affect outcomes is useful. If you find
that the overall results stay the same, it instills confidence in the
robustness of your findings. If in contrast, different decisions lead to
different answers, it might point to something worth investigating
further.

Since you are learning to set up your analysis in a fully automated way, doing such additional analyses is fairly painless. Often it just requires a small adjustment in code and waiting for the additional analysis to run.

Note that sometimes the term sensitivity analysis is used to imply in a more limited approach, namely just the exploration of the impact of model parameters on the outcomes. However, in the more broader sense of the word, it is the exploration how changes in the analysis pipeline (different subsets of data, different modeling assumptions, etc.) might or might not impact the results.

A nice feature of using subset selection or LASSO with GLMs, or
fitting a single tree, is that the algorithm decides which predictor
variables are important, and throws out the remaining ones. For the ones
that stay in the model, we can look at the coefficients in front of each
predictor variable, or look at the final tree/decision diagram, to
assess the importance of individual predictor variables on the outcome.
This provides an easy way to understand *how* specific variables
influence the outcome.

Those simple approaches are not available anymore for more complex
models. With complex models, we generally give up interpretability in
exchange for better performance. However, we ideally want both. The last
several years have seen a good amount of development to come up with
methods that allow one to peek inside the black box, i.e. to understand
why and how a complex model makes its predictions, and what the role of
specific predictor variables is on the outcome. This area is generally
called *interpretable machine learning*.

There’s no point in me repeating what others have already said (much
better than I could) 😃, therefore, instead of me writing more on this,
please take a look at chapters 2
and 3 of the *Interpretable Machine Learning (IML) book*,
which gives a very nice introduction on this topic. As you can tell by
the title, the whole book is about interpretation of machine learning
methods, and is a great resource.

For a shorter, but also great resource, check out chapter 16 of HMLR, which provides both a quick introduction and overview to the topic, and lists and illustrates the use of several R packages to do various interpretation tasks.

Another good introduction is this chapter of the *Tidy
Modeling with R* book.

As models get more complex, making sense of them will increase in importance. Even if you are mainly interested in predictive performance and less on understanding your system, it is a good idea to investigate your model in some detail. You might often figure out things that can help further increase model performance. Also, without understanding at least to some extent how complex models make their predictions, the potential for unnoticed bias increases. A good example of this can be found in this very recent Science paper which describes racial bias in a machine learning model that tries to predict healthcare needs of patients. See also this commentary on the article. Thus, this area of interpreting results from complex models will likely see much development in the near future, hopefully leading to models that are both powerful (and thus likely complex) and interpretable.

Given how easy it is to apply some of these methods to your models, I recommend that if you decide to use a somewhat complex model (or even a not-so complex one), you should always do at least some analyses that probe the model, e.g. perform a variable importance analysis and related investigations as described in the references above.

Many fitting functions return useful information as output (which you
can read with the `summary`

or similar commands). Take a
close look. If you thought you had 200 data points but your fitting
function result states that N=180, it means the function might have
dropped observations with missing values without warning you
(`R`

unfortunately does things like that at times). By
carefully reading/plotting model returns, you can diagnose your model
and catch problems.

If you get *strange* results (either unexpectedly bad or
unexpectedly good), look carefully. As you know, most often things that
are too good to be true are in fact not true. Bugs (either of the coding
or thinking type) in any step of your analysis can lead to strange
results.

If at any time during your analysis, you get warning messages, you
need to investigate them carefully. Sometimes it is ok to ignore warning
messages in R **but only after you know precisely what they mean
and why they happen!**. Frequently, warning messages indicate you
are doing something you shouldn’t be doing and can lead to wrong
results.

In addition to the sources mentioned above, other good reads are the
*Judging model
effectiveness* chapter and the *When should you trust your
predictions?* chapter of TMWR.