Quiz

Get the quiz sheet for this module from the general Assessments page. Fill it in, then submit to the online grading system before the deadline.

Exercise

For this exercise, we will implement some of the machine learning models covered in this module, and some of the general approaches we covered previously. This is a continuation of the Flu Analysis exercise you’ve been working on over the last few weeks. The goal is that by the end of this exercise, you will have an example of a nice full analysis. So let’s get started!

Note: This exercise is likely time-consuming and you might get stuck. Please plan accordingly. Start early and ask for help in Slack.

Setup

This is a solo-exercise, and it’s part of your portfolio.

The goal is to have both a complete and well-organized data analysis example at the end of the exercise. So as you go through this exercise, in addition to doing the indicated tasks and writing the code, keep organizing things as needed. You want to make sure you have readme files or other documentation that briefly explain what files are where and in which order to run them. Make sure your R/Quarto files are easy to understand and that they all run.

As you re-organize, you can decide the mix of R and Quarto files and what you want to combine and what to split. However you do it, make sure it’s documented and easy to understand.

Pre-processing

We previously covered pre-processing, but haven’t specifically looked at that yet in an exercise. So let’s add that part. This can be added in various places to the code/files you already created previously. You can for instance make it part of the cleaning/wrangling code, or include it in the recipe part of your tidymodels workflow (or a mix).

Feature/Variable removal

You probably realized that some of the variables contain the same information in slightly different ways. Specifically, Weakness, Cough and Myalgia exist on both a severity score and as Yes/No. Further, there are 2 variables for cough yes/no. These variables are strongly correlated and thus at a minimum, don’t help when you model, and are actually more likely to confuse your model and lead to errors/warnings/poor model performance (e.g. “predictions from a rank-deficient fit may be misleading). So let’s fix that. For those symptoms where you have both multiple levels and yes/no, remove all the yes/no versions. That should remove 4 variables.

Categorical/Ordinal predictors

Some of your predictors are categorical (e.g., Yes/No) and the 3 symptom severity factors are ordinal, with None < Mild < Moderate < Severe.

We can code the categorical variables as unordered factors and the others as ordered factors. I want you to do that as practice. The functions step_dummy() and step_ordinalscore() will help. See e.g. the help file example for step_ordinalscore(). Not that to deal with ordered factors in a statistical analysis, one needs special approaches (and we won’t actually do those for this exercise.)

Low (“near-zero”) variance predictors

If you look at your data, you’ll see that some predictors are fairly unbalanced, with most patients reporting No and only a few Yes. If almost everyone is in one category and almost nobody in others, that often (but not always) means those predictor variables are not very helpful in fitting/predicting the outcome. Furthermore, if you do cross-validation and one of your samples happen to not include one predictor level at all, things might go wrong when applied to the holdout-sample. (There are methods to deal with this specific problem, but they are more complicated than what we will do here.) Thus, it is worth considering if we want to remove them. The recipes package in tidymodels has the function step_nzv() which can do that for you automatically. If you have lots of predictors, that might be good to use. But it’s often better to decide manually for each variable based on your scientific expertise if you want to remove it or not. We’ll take that approach here. After looking at the data, we decide to remove those binary predictors that have <50 entries in one category (there are 2). Write code to remove them.

You should end up with a data frame that has 730 observations and 26 variables. This is the dataset we’ll use for modeling.

Analysis code

Start a new Quarto file for this analysis, call it machinelearing.qmd, place it into the code folder. Note that for this exercise, some of your code might take long to run. This is one of the situations where it is often good to have a setup where some of the heavy computations are done by separate R scripts/functions, and the results saved, pulled into and displayed in the Quarto file. How you want to do it here is up to you, as long as the Quarto file shows all the main results and your code/file(s) are well documented.

To keep things a bit simpler, for this exercise we focus on a single outcome, the continuous, numerical value of Body temperature. Thus, we are fitting regression models here. Once you finished the whole workflow for that outcome, you are welcome to (optionally) do it again and run a classification for the categorical outcome of Nausea. Since we are doing a regression, and we don’t have any specific expert knowledge that tells us how we should build our performance metric, we’ll go with one of the usual ones, namely RMSE. But remember to always think about your performance metric and don’t just use the default without at least considering other options.

Data Setup

  • Start by setting the random seed to 123. This should make everything reproducible and everyone should get the same results.

  • Split the dataset into 70% training, 30% testing. Also use the outcome BodyTemp as stratification. This allows for more balanced outcome values in the train and test sets. See e.g., section 3 of the Get Started tutorial.

  • We want to do 5-fold cross-validation, 5 times repeated. (There’s no specific reason to do this 5x5 pattern, other than to show you that there are different ways to pick the sample, and that I want you to not use the default.) For the CV folds, we also want to stratify on BodyTemp, as we did for the main train/test split. Use the vfold_cv() function to create a resample object for the training data with these specifications.

  • Create a recipe for the data and fitting. You won’t need to do much, just make sure you code the categorical variables as dummy variables, otherwise things might not work smoothly. For that, you want to use the step_dummy() function and pick all nominal predictor variables (which are actually all predictor variables here, since the only continuous variable is our outcome).

Null model performance

Write some code to compute the performance of a null model, i.e. a “model” that doesn’t use any predictor information. For a continuous outcome and RMSE as our metric, a null model is one that always predicts the mean of the outcome. Compute the RMSE for both training and test data for such a “model”. We’ll use that later to compare it to the performance of our real models. Of course, we expect/hope our real models that use predictor information to be better. If they aren’t that means they are no good.

Model tuning and fitting

We’ll fit a tree, a LASSO model, and a random forest. I chose those because they are used in the tutorial on the tidymodels website. You can of course add further models. For the tree, see the Tune model parameters section of the Get Started tutorial. For LASSO and the random forest, check out the Case Study section of the Get Started tutorial. Note that you will need to adjust the code for our scenario since we have a continuous outcome.

If you follow the tutorial, you’ll likely use the packages rpart, glmnet and ranger to fit those 3 models. Make sure they are installed and loaded.

I suggest you write code for each model separately. A lot of the code will look similar, so once you got the first one set up, the other two should be easier. They mainly differ in the commands specifying the tuning parameters and the tuning grid.

Each of these models requires some tuning. For the choices regarding the tuning parameters, you can follow the examples. Most of the models have more things that can be tuned, but for now you can stick to what they show in the tutorial. Follow the examples by setting up a workflow, set a tuning grid, and then use the tune_grid() function to tune the model using cross-validation.

Note that the tuning part, i.e., calling tune_grid() might take a good bit of time to run (possibly minutes).

The steps (block of code) you should have here are 1) model specification, 2) workflow definition, 3) tuning grid specification and 4) tuning using cross-validation and the tune_grid() function.

Model evaluation

Once you have done the tuning, you can take a look at some diagnostics by sending your object returned from the tune_grid() function to autoplot(). For instance if you tuned the tree and saved the result as tree_tune_res, you can run tree_tune_res %>% autoplot(). Depending on the model, the plot will be different, but in general it shows you what happened during the tuning process.

Next, you want to get the model that the tuning process has determined is the best. You can get the best-fit model with select_best() and finalize_workflow() and then do one more fit to the training data with this final workflow using the fit() function. Follow the examples in the tutorial.

To evaluate the final fit for each model, do the following.

  1. Make two plots, one that shows model predictions from the tuned model versus actual outcomes, and one that plots residuals. The actual outcomes you get straight from the data, the predicted outcomes you can get by applying the predict() function to the final fit.

  2. Look at/print the model performance and compare it with the null model (still only on training data). Here, we want the performance of the tuned, best-fitting model on the CV dataset (we are not yet touching the test data). You can get that for instance with the show_best() function, which gives you the mean cross-validated performance for the best models. It also shows the standard deviation for the performance. Compare that model performance with the null model.

The mean and standard deviation of the performance give you a measure of overall performance and variability in that measure. The plots show you if there are any systematic deviations between model and data. Taken together, these can be compared for the different models and based on those (and as wanted, other considerations) a final model can be chosen.

Implement the model tuning/fitting and evaluating steps for all 3 models.

Comments for specific models

Here are some more suggestions and hints. Most of this is optional but worth trying.

For the tree model, if you want to plot the tree, you can use the rpart.plot package and run this command rpart.plot(extract_fit_parsnip(best_tree_fit)$fit) (assuming your result from the final workflow fit is called best_tree_fit). You might get a warning message, but the tree will show. You will likely find when you look at the actual/predicted plot or the residual plot that the tree model does not perform very well, and the model only predicts a few discrete outcome values. That’s also noticeable when you compare RMSE for the tree model and the null model, they are very similar.

For the lasso model, you will likely find that it performs a bit better than the tree, but not a lot. If you want to see a plot for how the number of predictors included in the LASSO model changes with the tuning parameter, you can extract the model from your final fit (say it’s called best_lasso_fit) with x <- best_lasso_fit$fit$fit$fit and then plot(x, "lambda"). I know, this is awful code having to dig that deep into the best_lasso_fit object. You might also be able to use the extract_fit_engine() function to get the underlying fit object produced by LASSO, e.g., x <- extract_fit_engine(best_lasso_fit) might work (this is new in tidymodels and I haven’t fully tried it yet).

For the random forest model, you will likely again find that it performs a bit better than the tree, but not a lot. The tuning setup might require some fiddling around, I had a few initial tries where the whole tuning failed. For a model like random forest, all variables stay in the model. There are ways to look at the variables that are most important. If you want to do that, you again need to pull out the fit object. Say it’s called best_rf_fit, you can do that with x <- best_rf_fit$fit$fit$fit and then use the vip() function from the vip package to plot the importance of the variables. Alternatively extract_fit_engine() should also work.

Model selection

Once you have implemented above steps for the 3 models, you should have a “best” fit for each one based on the tuning process. For each best model you should have performance, uncertainty around the performance measure, and some diagnostic plots. While for any real research project, you likely want to look deeper (e.g. at uncertainty in predictions instead of just overall performance), for now this is enough. Pick one of the three models. Explain why you pick it. There is no single answer that’s correct, I just want you to practice reasoning for why you are doing something: in this case justify why you are picking the model you do.

Final evaluation

Once you picked your final model, you are allowed to once – and only once – fit it to the test data and check how well it performs on that data. This gives you a somewhat honest estimate of how the model might perform for new, unseen data. You can do that using the last_fit() function applied to the model you end up choosing. For the final model applied to the test set, report performance and the diagnostic plots as above.

And that concludes what is likely a fairly long exercise. The code itself is not that long, but it will take you time to cobble it together from the tidymodel tutorial and possibly other sources.

Test and website update

Make sure your analysis and all results are nicely documented and everything runs/renders correctly. Then, add the newly created Quarto document as an entry into your _quarto.yml file, as a sub-menu of Flu fitting. Call it Machine Learning. Recompile your portfolio website and make sure everything works and shows up as expected. Then commit and push.

Since this is part of your portfolio site, you don’t need to post anything, I know where to find it. Therefore there is no exercise Slack channel for this module.

Discussion

Let’s use this week’s discussion to talk a bit more about the projects. Last week you received feedback on our project from a few classmates, this week I want everyone to have a chance to hear about and comment on all projects.

To that end, post a summary description of your project to the discussion channel. Briefly describe your data, your question(s), your (planned) methods and (expected) results. Sort of like an abstract for a paper. It doesn’t have to be very long, but should contain enough detail that others can get the overall idea. This way, everyone can get a quick glimpse as to what everyone else is doing. Also provide a link to the repo for easy access (if your project is in a public repository. For private repos, you can skip this).

Then, also mention specific questions/struggles/concerns you might have regarding your project. It’s quite likely that there is some overlap in approaches and questions among you and your classmates. Hopefully, through this discussion you can provide each other with some help and input.

Post this by Wednesday. Then read the summaries of your classmates’ projects. If you see a question/topic that you think you might be able to help with, either by answering a specific question, or by providing some general feedback, do so.

And of course, as you look at each others projects, it’s a 2-way street. You can provide help/feedback, but you are also welcome to take inspiration from what others are doing and integrate some of that into your own project.

Since you get feedback from me at the other submission points, I plan to largely stay out of this week’s discussion 😁. So this is all you helping each other. Each of you has already seen a few projects and will be reviewing 2 projects in a lot of depth at the end, but hopefully with the discussion this week you get a bit of an idea of what everyone else is doing, and you can provide feedback to others and earlier.