Up to now, I have not shown much math or equations. Below, you will encounter some. The equations are fairly simple and I’m sure you’ll be able to follow once you spend a few minutes to think through them. Most often, there are functions in R that compute all those quantities for you. But sometimes, you will have to compute a few of them yourself. It is therefore good to know their definitions. Once you do, you realize that you can simply compute them yourself with just a few lines of R code. Again, most often you won’t need to do that, but sometimes it is useful to have that ability and know how to do it.

Introduction

When you fit a model, e.g., a linear model using lm() in R (either directly, or calling it through some other package, such as parsnip or tidymodels), what exactly happens? What does it mean when we ask the computer to fit a model? Think about what’s going on for a minute.

What lm() or similar functions do is vary the parameters of a model until the model predicted outcomes are as close as possible to the actual observed outcomes in the data. Let’s say you want to fit a multivariable linear model like this one, where you want to fit some outcome \(Y\) given some predictor variables \(X_i\) (e.g. fitting/predicting weight of a person, given their height and age). The model equation will look like this:

\[Y_m=b_0 + b_1X_1 + b_2X_2 + \ldots + b_n X_n.\]

When you ask the computer to fit this model, it will guess some values of the coefficients b_i and then vary them (in a smart way) until the predicted Y_m (often also called \(\hat{y}\) or \(\hat{Y}\)) from the model are as close as possible to the actual data values Y_d (often also just called \(y\) or \(Y\)).

This procedure for fitting models is very general and applies to any kind of model, as well as both continuous and categorical outcomes. What changes between problems and approaches is how exactly the fitting routine goes about varying the parameters (something we won’t discuss in this course), and how we define closeness between data and model outcomes. The latter is a scientific choice, and you need to specify it when you fit models.

The cost/loss/objective function

For the computer to perform the routine of vary model parameters to get the model as close to the data as possible, we have to quantify this closeness (or difference) between model and data. To do so, we define some function of both model predictions (Y_m) and measured/data (Y_d) outcomes. This function goes by different names, but the terms most commonly used are cost function or loss function. In more general math or stats context, this is also called an objective function. Other names are common in different areas of science, see e.g., those listed in the linked Wikipedia article. Another term that is used by tidymodels is simply metric; this term is somewhat more common in machine learning situations. The yardstick package from tidymodels is all about such metrics. I will interchangeably use cost, lost, and objective function, as well as metric, so just be aware that they all mean the same thing: a function that is used to measure how close the model predictions are to the data.

The convention is to set up the problem such that larger discrepancies between model and data lead to a larger numerical value of this function. Then our goal is to minimize the objective function, which means minimizing the difference between the data and the model predictions. (Occasionally, problems are set up to maximize this function. You can always change it into a minimization problem by e.g., taking the inverse or adding a minus sign.)

In equation form, we have in the general case: \[C = f(Y_m,Y_d).\]

The \(Y_m\) are all the predictions made by the model for the outcome, and the \(Y_d\) are the actual outcomes from the data. If you have \(N\) observations, you will have \(N\) pairs of \(Y_m\) and \(Y_d\) and then you need to specify some function, f, that uses all those values to compute a single value to measure your performance. The different types of functions, f, depend on the type of data and other considerations. Once you define f, this is the performance metric which quantifies how well a given model fits the data.

When you do a single model fit, e.g., using lm(), the computer calculates f for different values of the parameters, b_i, until it finds the ones that produce the best (i.e., minimum) value of C.¹

Similarly, if you fit two different models (e.g., a linear model with 3 predictor variables, and another one with 5), you can use the value of C to compare model performance (with some important caveats, which we will discuss shortly).

Cost/loss/objective functions for continuous outcomes

Least squares

You are likely familiar with one of the most widely used functions for f, the least-squares method. If you have ever fit a linear model (e.g. using lm() in R or an equivalent function in a different statistical software), chances are you used least squares as your objective function (maybe without knowing that you did). For least squares, we compute the squared difference between model prediction and data for each observation and sum it all up. In equation form:

\[C = f(Y_m,Y_d) = \sum_i (Y_m^i - Y_d^i)^2\] Again, \(Y^i_m\) are all the predictions made by the model for the outcome, and the \(Y^i_d\) are the actual outcomes from the data. The quantity \(C\) for this equation has many names. A common one is least squares error, or sum of square residuals (SSR), or residual sum of squares (RSS), or sum of squares (SS), or residual square error (RSE), and a bunch of similar names. You will usually be able to tell from the context what is being used as the performance metric.

You will often see a variation where one divides by the sample size, i.e. \(C\) will look like

\[C = \frac{1}{N} \sum_i (Y_m^i - Y_d^i)^2\]

This is called mean squared error. Other names of exist of course.

Dividing by the sample size has the advantage of allowing you to compare values across samples of different size from the same dataset (but it doesn’t really work for comparing across different datasets). For instance if you compare model performance on training and test data (to be discussed shortly), and each has different sample size, you need to make sure you standardize by it.

If you want to compare different models on the same dataset which might include some missing values, and one of your models can deal with missing data while the other cannot, you need to be careful. One option is to fit both models only to the data without missing values. If you decide to allow one model to use the observations that have some missing values, while the other model does not, you definitely need to standardize by the sample size. Even then, care is needed, since the samples with some missing data might be systematically different from those without and thus the datasets might not be equivalent anymore.

Another variant is a version where at the end you take the square-root, i.e.

\[C = \sqrt{\frac{1}{N} \sum_i (Y_m^i - Y_d^i)^2}\]

which is called the root mean squared error (RMSE). The advantage of taking the square-root at the end is that now the units of \(C\) are the same as those of your outcomes. THis often makes interpretation of the results easier. In general, it is best to use MSE or RMSE. In tidymodels, the yardstick package has the rmse() metric built-in.

Coefficient of determination

An equivalent alternative to SSR is to use a quantity called Coefficient of determination, or more commonly \(R^2\) (R-squared). This quantity is defined as \[1-RSS/TSS\] where RSS is the residual sum of square introduced above and TSS is the total sum of squares. The latter is defined as

\[TSS = \sum_i (Y_{av} - Y_d^i)^2\]

where \(Y_{av}\) is the mean of the data. Therefore, the equation for \(R^2\) is

\[R^2 = 1- \frac{\sum_i (Y_m^i - Y_d^i)^2}{\sum_i (Y_{av} - Y_d^i)^2}\]

Since TSS is fixed for a given dataset, minimizing SSR is equivalent to maximizing \(R^2\). You might see \(R^2\) reported in papers, and it is highly likely that the fitting was performed by minimizing SSR (or MSE or RMSE). The SST is a useful quantity since it tells us the performance of a dumb/null model which uses no information from any predictor variables, but instead just predicts the mean of the outcomes. Any model you build that includes predictors needs to do better than this dumb null model.

Least squares fitting is simple and frequently used. A lot of standard routines in major statistical packages use least squares. It often makes good sense to penalize predictions that deviate from the actual value with the squared distance (and under certain conditions, this is equivalent to maximizing the likelihood). However, sometimes a different way to define the function f might be useful. We’ll discuss a few of them briefly.

Beyond least squares

An alternative to least squares is to penalize not with the distance squared, but linearly with the (absolute) distance. This metric is called (mean) absolute error or (least) absolute deviation, and the model is

\[C = f(Y_m,Y_d) = \sum_i |Y_m^i - Y_d^i|\]

This approach can be useful if the data contains outliers (that are real, and one can’t justify removing them during cleaning). With a squared distance penalty, outliers have a strong impact on the model fit. With a linear penalty, such outliers carry less weight. Because of this, the linear difference approach is sometimes called a robust estimation.

Robust estimation methos such as MAE and many others often require the use of specialized R packages. A large number of R packages exist that allow fitting this way, see e.g., the CRAN Robust Task View. The yardstick package as the metric mae() which computes the mean absolute error.

Another way to define f is with step functions. The idea is that as long as model and data are within some distance, the penalty is zero. Once model and data differ by some threshold, a penalty is given, e.g., a fixed value or a linear or quadratic penalty. Such types of schemes to define f are common in the class of models called Support Vector Machines, which we will look at later in the course.

Some other metrics are implemented in yardstick. You can check out which ones here. For instance there is one called huber_loss() which is a combination of least squares and absolute error.

No matter what scheme you choose, it might at times be useful to weigh data points. In the examples given above, each model-data pair was given the same importance. Sometimes it might be that you have some data points that should carry more weight. A common case is if you have measurements not on the individual level but some aggregate level. As an example, assume you have a fraction of heart attacks among all patients for some time period in different hospitals, and you want to fit that fraction. You don’t know the number of people who had heart attacks, only the fraction. But you do know something about the total number of beds each hospital has. You could then argue that hospitals with more beds have more patients and thus likely more heart attacks, and therefore the data from larger hospitals should get more weight, and you could e.g., multiply each term in the sum by bed size. Note that this is a scientific decision based on your expert knowledge of the data. Almost all fitting routines allow you to provide weights for your data and you can then perform weighted least squares (or a weighted version of whatever other approaches you choose). Tidymodels has some support for weighted fitting.

Performance measures for categorical outcomes

For categorical outcomes, one needs different ways to specify the loss function/metric. Again, a lot of different options exist. Here, we’ll just discuss a few common ones.

Accuracy

The simplest way to determine performance of a model for categorical data is to count the fraction of times the model did (not) correctly predict the outcome. If we instead count the fraction of correct predictions made by the model, it is called accuracy. If we focus on the number of times the model got it wrong, it is called the (mis)classification error. The yardstick package has it as metric accuracy().

While accuracy is often not a bad choice of metric, sometimes just counting the number of times a model didn’t get it right might not be the best idea. A common situation where accuracy (just counting how often the model prediction is right/wrong) is not very useful is for rare (but important) outcomes. Data of this type is often called unbalanced data. As an example, say we had some algorithm that tried to use brain images to predict if people have brain cancer. Fortunately, brain cancer is rare. Let’s say (I’m making this number up) that 1 in a million people who undergo this screening procedure actually have this cancer. Therefore, a model that predicts that nobody has cancer would be a very accurate model, it would only make one mistake in a million. However, missing that one person would be a very important mistake. We likely would prefer a model that flags 10 people as (potentially) having a cancer, including the person who really does and 9 false positives. This model has worse accuracy since it gets 9 out 1 million wrong. But it’s likely more important to catch the one true case, even if it means temporarily scaring several other individuals, until further checks show that they are healthy. Of course this trade-off is very common. You likely know it in the context of balancing sensitivity and specificity of say a clinical test. Other areas where this happens is, e.g., predicting the (hopefully rare) errors when building a plane engine, flagging the (rare) credit card fraud, and many others of that type.

Beyond accuracy

In instances where accuracy/misclassification is not a good performance metric, other metrics are more helpful. A large variety of such metrics exist. Some only apply to the special (but very common) case of a binary outcome, and others apply more generally. Fairly common ones are:

(Cohen’s) Kappa, which is implemented in yardstick as kappa()
Cross entropy, which is
Area under a Receiver operating curve (AUC ROC) (roc_auc() in yardstick)
Matthews correlation coefficient (mcc() in yarstick)
F-score (f_meas() in yardstick)

Remember that yardstick is part of the tidymodels set of packages, so if you work in the tidymodels framework, you have access to these (and other) metrics.

The confusion matrix

For the common case of a binary outcome, one can construct what is called the confusion matrix (also known as 2x2 table in epidemiology). The confusion matrix tracks 4 quantities: true positives (TP, model predicts positive, and data is positive), true negative (TN, both model and data are negative), false positive (FP, model wrongly predicts positive) and false negative (FN, model wrongly predicts negative).

The confusion matrix comes with very confusing terminology since many of the quantities are labeled differently in different fields. For instance, epidemiologists tend to call the quantity TP/(TP+FN) sensitivity, while in other areas, such as machine learning, it is called recall. For a good overview of those different quantities and terminology see this Wikipedia article.

One can of course also make a table for a case with more than 2 outcomes and track the true values versus the predictions. Some of the metrics generalize to such a scenario of more than 2 outcomes, but not all do.

Custom cost functions/metrics

While one of the standard loss functions/metrics (e.g., those just discussed or other commone ones) might work for your data and question, it is worth thinking carefully about what it is you want to optimize, i.e., how exactly the function f should work.

There are many different performance measures that have already been “invented”. All metrics currently implemented in yardstick are listed here.

If no existing measure is suitable for your problem, you can and should define your own. Based on scientific/expert knowledge, you should define the most appropriate cost function and then measure your models based on that. Getting the cost function right is an important, and often overlooked part of the modeling process. For instance, revisiting the example I gave above, if we think that falsely accusing a person of being a criminal has cost (in some units, that could be monetary or otherwise) of 1, but missing to identify a person who is a threat has cost of 1000, then we should weigh false positives and false negatives accordingly when computing our cost function and designing our model. This is similar to applying weights in the continuous case. The yardstick package allows you to define your own metrics, as described here.

Likelihood approach

The most general approach to defining the cost/loss function, f, is the likelihood approach. Using this approach, which underlies a lot of modern statistics, both frequentist and Bayesian approaches, you start by explicitly specifying a model/distribution for the data (both deterministic and probability parts). As an example, you might postulate that expected number of cases of Cholera in a given month depends linearly on the monthly amount of rainfall (deterministic component) and that the actual/measured number of cases are Poisson distributed (probability component). Both the deterministic and probabilistic components of your model have parameters. You write down (or simulate) the likelihood and then run some optimization routine which varies parameters until it found the optimal result (maximum of the likelihood, or more commonly the minimum of the log-likelihood).

Using the Likelihood is the most general and flexible approach to model fitting and for more complex problems, often the only feasible approach. It applies to any kind of outcome, and you can specify any model you want. Unfortunately, it is generally more complicated and often requires more custom-written code (though a lot more user-friendly R packages have become available in recent times). We can’t get much into likelihood-based fitting in this course. For more information, I recommend Ben Bolker’s book “Ecological Models and Data in R” as a good starting point for (mostly frequentists) Likelihood fitting approaches. Another book that introduces the ideas from a Bayesian perspective is Statistical Rethinking. Unfortunately, neither book is freely available online (as far as I’m aware). If you know of any good introductory books that teach model fitting using likelihood approaches with R, please let me know.

Further materials

If you want to get some more intuition with fitting a continuous outcome, go through this interactive tutorial. If it looks like the RStudio Primer tutorials you’ve seen, that is no accident. Both use the learnr package. The difference is that I wrote this tutorial, the other ones were written by RStudio and are therefore more polished and less buggy 😄.

The model evaluation section of HMLR lists and describes various performance metrics. This chapter of the Tidy Modeling with R book also discusses different performance metrics.

Chapter 2 of ISLR, which you already looked at previously, also discusses the topic of performance, and several other issues we are discussing in this module. Now, or once you are finished with the other readings of this module, might be good time to revisit that chapter.

Technically, for lm() and some other simple models, it is possible to find the best parameter values other than by ‘trial and error’, but for many other models, this is more or less what is happening. Developing routines that can quickly, efficiently, and reliably find the optimum value of a function (here the minimum of C) is part of the large field of optimization. Fortunately, for our purposes, we can let the different functions in R do this task without us having to worry too much about how the optimal value is determined. However, for more advanced analyses, the built-in optimizers sometimes do not work. In that case, I recommend using the nloptr package - but we won’t need it in this course.↩︎

Model Performance Evaluation

Andreas Handel

2023-03-15 13:57:17.791998