Assessment - Improving Models

Author

Andreas Handel

Modified

2024-03-20

Quiz

Get the quiz sheet for this module from the general Assessments page. Fill it in, then submit to the online grading system before the deadline.

Project

Review and provide feedback on part 3 of the projects you have been assigned to as described in the Projects section.

Discussion

Because you’ll be doing the project reviews, there is no discussion assignment for this module.

Exercise

We’ll practice some of the topics covered in this and prior units by continuing with the exercise we started previously.

Setup

You’ll continue working on the exercise you started previously. We’ll also do some group work again, using the – by now familiar – M1, M2, … setup. Assign each other a number. As much as possible, do it such that you end up working with group members you have not (or not in a while) worked with. Use the ‘circular’ setup such that everyone will work on their own repository and on one other person’s repository.

This exercise has 3 parts. Part 1 is due Wednesday morning, part 2 Friday morning, and part 3 Friday evening. Coordinate with your team member accordingly.

This exercise draws a good bit on the Getting Started tidymodels tutorial. If you haven’t yet, now would be a great time to at least skim through the tutorial and make sure you understand the overall steps.

Part 1

We’ll continue this exercise using the previous Quarto fitting-exercise.qmd file.

At this stage, I assume all the data wrangling and EDA code, as well as the model fitting code you worked on previously is present and fully functional. If there are still some issues that need to be resolved, go ahead and do so.

Setting a random seed

This exercise involves sampling. Sampling means that under the hood, the computer draws some random numbers and does certain actions based on the values of those numbers (for instance it might draw a random number between 0 and 1 and if it’s >0.5 it does one thing, otherwise another thing).

Really, for a computer, these random numbers are pseudo-random, and are a long sequence of random looking numbers, but if one starts at the same point in that sequence, one gets the same collection of random numbers. This is important for reproducibility. To ensure we get the same random numbers each time, we need to set a seed.

There are two options. One option is to set a single seed at the top of your script. This way, if you re-run the script with the same seed, everything should stay the same, if you change the seed, everything changes. The disadvantage of setting a single seed at the beginning is that if you add some other bits to your code that perform more operations that require random numbers, then you advance in the sequence of random numbers and everything downstream that depends on random numbers changes.

The other option is to set a seed before each operation you know involves random numbers. For that you do need to know which parts involve randomness and which don’t. It also means if you want to try a different seed - always good to ensure results are robust and don’t depend on random variation - then you have to change the seed multiple times in your code.

Yet another option, and the one I personally like best, is to specify a value for a seed at the start, and then re-set the seed to this value before every operation that involves random numbers. The drawback might be that some samples look too similar, but since almost always they are used internally in different ways, one should get enough variability.

For this exercise, we’ll try the last method. At this point in our code, add a statement that sets a seed with rngseed = 1234. We’ll use it shortly.

Data prep

We are doing one more change to the data. Since the RACE variable has a few values that are coded weirdly (7 or 88, which probably means some sort of missing), we’ll completely remove that variable for this exercise.

So for this exercise, we’ll have a data frame with 120 observations and 6 variables, namely Y, DOSE, AGE, SEX, WT, HT.

Since we are about to do some random sampling of our data, now is the time to set a seed with set.seed(rngseed). After you’ve done that, write code that takes the data and splits it randomly into a 75% train and 25% test set, following for instance the example in the Data Splitting section of the Get Started tidymodels tutorial.

We only have 120 observations here. This is not much, so for a real project, it might not make a lot of sense to do a train/test split. Instead, one would probably use all data during the model fitting process, and then use cross-validation to try and estimate model performance on unseen data. But since this is an exercise, we’ll do the train/test split.

Also note that we are doing a random train/test split. But for a case like we have here, one might want to split differently. For instance we could split by using everyone who got the lower doses as training data, and then testing if we can predict the higher-dose outcomes. This is a common question of interest in drug development.

Model fitting

Use the tidymodels framework to fit two linear models to our continuous outcome of interest, (here, Y). The first model should only use DOSE as predictor, the second model should use all predictors. For both models, the metric to optimize should be RMSE.

This is the same as you did previously, but for this exercise, you should only use the training data set for fitting.

Model performance assessment 1

Compute predictions for the two models on the training data (we’ll use the test data later). Then use observed and predicted values to compute RMSE of the best-fitting model.

Also compute the RMSE of a null-model, one that would just predict the mean outcome for each observation, without using any predictor information. You can use the null_model() function from tidymodels for this (using the parsnip engine), or just do it “by hand”.

Compare the 3 RMSE values and discuss which model is performing the best (according to those metrics). You should have gotten values of 948, 702 and 627 for the null model and models 1 and 2 respectively. If your values are different, it’s likely that you didn’t set the random seed.

Model performance assessment 2

So far, we have assessed performance on the training set. But as you know by now, that is often not really of interest, we want to know how the model performs in general. We could use our test data to check, but we want to reserve that until the very end, to get a final “fair” assessment of our best model.

We could use an information criterion, such as AIC, to try to estimate how well the models might perform on unseen data. Even better, we can use use cross-validation to compute performance on unseen data.

Since CV includes sampling, and therefore random numbers, it’s time again to (re)-set the random number seed. We’ll use the same value as before.

Follow the Evaluate your model with resampling section of the tutorial and do a 10-fold cross-validation. Doing so, you fit the 2 models to the data 10 times. Each time, you use 90% to fit the model, and 10% to evaluate (compute the RMSE) of the model. If you take the average for the RMSE over the 10 folds, you get an estimate for the model performance on data not used for fitting. It’s the same idea as doing the initial train/test split, but now we use it as part of the model building and choosing part, while we keep the test data as a final evaluation of the model and won’t use it before then.

Compute the RMSE for both models again. Of course nothing changes for the null model. Compare the new RMSE estimates obtained through CV with those obtained earlier. What did and didn’t change?

Also look at the standard error for the RMSE. Since you are now sampling, you not only get a single estimate for RMSE, but one for each sample, so you can look at the variation in RMSE. This gives you a good indication of how robust your model performance is.

Finally, run the code again that creates the CV folds and does the fitting. This time, choose a different value for the random seed. The RMSE values for the CV fits will change. That’s just due to the randomness in the data splitting. If we had more data, we would expect to get less variability. The overall pattern between changes in the RMSE values for the fits to the training data without CV, and what we see with CV, should still be the same.

If you want to get more robust RMSE estimates with CV, you can try to set repeats to some value. That creates more samples by repeating the whole CV procedure several times. In theory this might give more robust results. You might encounter some warning messages. This is likely related that occasionally, by chance, data is split in a way that some information (e.g., a certain value for SEX in our data) is missing from one one of the folds. That can cause issues.

Wrap up part 1

Make sure everything runs and works as expected. Also make sure everything is well commented/documented/explained! Then commit, push and tell your classmate that they can take over. Finish this by Wednesday.

Part 2

Fork and clone (or if you are added as collaborator, clone directly) your classmate’s repository. Open their fitting-exercise.qmd file.

Add a heading that says # This section added by YOURFULLNAME. I need this so I can grade accordingly.

Model predictions

Nothing beats a visual inspection of results. A very useful figure is one that plots predicted versus observed values. This can quickly show you if there are important deviations that suggest the model isn’t very good.

Put the observed values and the predicted values from your 3 original model fits to all of the training data (the ones without the CV) into a data frame. Also add a label to indicate the model. Then use ggplot to create a figure that plots (as symbols) observed values on the x-axis and predictions (from each of the 3 models, including the null model) on the y-axis. Use a different color and/or a different symbol to differentiate between the 3 model predictions. Alternatively, you can use facets. Let both x and y axes go from 0 to 5000 and add a 45 degree line. For a good model, the points will fall along that line, namely observed and predicted values agree - with some scatter.

You should see that the “predictions” from the null model are a straight horizontal line. We expect that, since we predict exactly the same value (the mean) for each observation. For the model that only includes dose, you should the data fall along 3 horizontal lines. Can you understand why? If not clear, look at the DOSE variable and the values it takes. That should tell you why you only get 3 different predicted values for the outcome.

The model with all predictors looks the best, though there’s still a lot of scatter around the diagonal line. If that scatter is fairly random, it could mean that we captured all the “signal” with our model and the rest is noise. If there seems to be some pattern to the scatter – as it seems to be here, with model predictions lower than observed values for high values – it can indicate that there’s still a lot of the outcome pattern that our model can’t explain.

A good way to see if there are patterns is to plot predicted versus residuals (the latter being just residuals = predicted-observed). Make a plot that plots that for model 2. Also add a straight line at 0. Make sure your y-axis goes the same amount into the positive and negative direction.

For a good model, there should be a cloud of data without any discernible pattern. Here’ we again see that there’s some residual pattern, in genera there are more and higher negative values compared to positive ones. That could either be because we are missing important information, i.e. we need more variables. Or it could be that the model is too simple, for instance it could be that the outcome depends on some variable in a nonlinear way.

We can’t do anything about the former problem (other than collecting more data), but we can try to explore if more complex models can capture the data better. We’ll do that in an upcoming exercise.

For this exercise, we’ll do one more exploration of our existing models.

Model predictions and uncertainty

As you learned, model predictions without a measure of uncertainty are usually not that useful. We generally always want to determine uncertainty in our estimates/predictions.

While for a linear model it is possible to get uncertainty directly from the uncertainty of the parameter estimates, for more complex models, this is not possible. However, it is always possible to use the bootstrap method to sample the data, fit models to the data, and get uncertainty that way.

We’ll give that a try. Since our plot above suggested that model 1 is not good, we’ll focus here on model 2.

First, we need to (re)-set the random seed again, we’ll use the same value as above. Then, use the bootstraps function from the rsample package to create 100 bootstraps. These bootstraps should use the training data.

Next, write a loop (or use a map or apply function) to fit the model to each of the bootstrap samples and make predictions from this model for the original training data. Record all predictions (e.g., in an array or a list).

If you are wondering how to get the individual bootstrap samples out of the object that bootstraps gave you back: Assuming you called the object that you got back dat_bs, you can get a single bootstrap sample with this code:

dat_sample = rsample::analysis(dat_bs$splits[[i]])

Once you have all your predictions stored, compute the mean and confidence intervals. Assuming your predictions are stored in an array called pred_bs that has as many rows as samples and columns as data points, this bit of code computes median and 89% confidence intervals.

preds <- pred_bs |> apply(2, quantile, c(0.055, 0.5, 0.945)) |> t()

Finally, make a figure that plots observed values on the x-axis, and point estimate (obtained from your original predictions on the training data), as well as median and the upper and lower bounds - obtained by the bootstrap sampling and stored in pred on the y-axis. You can for instance use black symbols for original predictions (the point estimate, which is the mean), and some colors to indicate median and lower and upper confidence limits. As above, make sure x- and y-axis are on the same and add a 45 degree line.

Interpret what you see in this plot.

The bootstrap uses sampling, and CV uses sampling. The difference is that for bootstrap, we sample from all the (training) data with replacement. For CV, we randomly partitioned some of the data into the part used to fit, the other part was used to evaluate/assess. So the two approaches are conceptually similar in that they use a sampling-based method, but not the same.

Wrap up part 2

Make sure everything runs and works as expected. Then commit, push and if you forked the repo, initiate a pull request. Tell our classmate that its done.

Part 3

Now it’s back to you, the portfolio repository owner. Based on the work you and your classmate did, perform a final evaluation task and then summarize/discuss the models.

Final evaluation using test data

Let’s do a final model evaluation, this time using the test data.

Use the fit of model 2 using the training data you computed previously, and now use this fitted model to make predictions for the test data. This gives us an indication how our model generalizes to data that wasn’t used to construct/fit/train the model.

Then make a plot that shows predicted versus observed for both the training data (which you did above) and in the same plot, also show predicted versus observed for the test data (using, e.g. different symbols or colors).

You should see that the observed/predicted values for the test data are mixed in with the train data. That’s a good sign. If the test data points were systematically “off”, it would indicate a problem such as overfitting to the training data.

Overall model assessment

Let’s critique all our models. Here are some points for discussion:

  • We want to make sure that any model we have performs better than the null model. Is that the case?

  • Does model 1 with only dose in the model improve results over the null model? Do the results make sense? Would you consider this model “usable” for any real purpose?

  • Does model 2 with all predictors further improve results? Do the results make sense? Would you consider this model “usable” for any real purpose?

Use all the assessment bits you computed above to reason about the models. And I’m not really looking for “the right” answer here (if that even exists), I just want to see that you are thinking about what your fit results and thus underlying models mean, and that you can assess their strengths and weaknesses and interpret what you are doing and seeing.

Test and website update

Make sure everything works. Make any further updates that you think need to be made. Then rebuild your portfolio website and make sure everything works and shows up as expected. Then commit and push.

Since this is part of your portfolio site, you don’t need to post anything, I know where to find it.

More explorations

If you want additional practice, you can re-do this exercise, but this time consider the categorical/binary variable SEX as your outcome and the others as the predictors. That does of course not make a lot of scientific sense, but as an exercise you might want to explore to get some practice with categorical outcomes.

Your metric will change from RMSE to one of your choice that is suitable for categorical/classification problems. That also means you can’t do a scatterplot for observed versus predicted, but you can look at e.g. a confusion matrix. Overall, it’s the same conceptual workflow. So if you want to try a categorical outcome, go ahead. This is completely optional.