Get the quiz sheet for this module from the general Assessments page. Fill it in, then submit to the online grading system before the deadline.
For this exercise, we will implement some of the machine learning models covered in this module, and some of the general approaches we covered previously. This is a continuation of the data analysis exercise you’ve been working on over the last few weeks, analyzing flu data. The goal is that by the end of this exercise, you will have an example of a nice full analysis. So let’s get started!
Note: This exercise is likely time-consuming and you might get stuck. Please plan accordingly. Start early and ask for help in Slack.
This is a solo-exercise. Make sure your previous repo for this exercise is fully up-to-date and synced.
The goal is to have both a complete and well-organized data analysis example at the end of the exercise. So as you go through this exercise, in addition to doing the indicated tasks and writing the code, keep organizing things.
You want to remove anything that’s not part of the exercise (any files/folders from the template). You also want to make sure you have readme files or other documentation that briefly explain what files are where and in which order to run them. Make sure your R/Rmd scripts are easy to understand and that they all run.
As you re-organize, you can decide the mix of R and Rmd files and what you want to combine and what to split. However you do it, make sure it’s documented and easy to understand.
We previously covered pre-processing, but haven’t specifically looked at that yet in an exercise. So let’s add that part. This can be added in various places to the code. You can for instance make it part of the cleaning code, or include it in the
recipe part of your
tidymodels workflow (or a mix).
You probably realized that some of the variables contain the same information in slightly different ways. Specifically,
Myalgia exist on both a severity score and as Yes/No. Further, there are 2 variables for cough yes/no. These variables are strongly correlated and thus at a minimum, don’t help when you model, and are actually more likely to confuse your model and lead to errors/warnings/poor model performance (e.g. “predictions from a rank-deficient fit may be misleading). So let’s fix that. For those symptoms where you have both multiple levels and yes/no, remove all the yes/no versions. That should remove 4 variables.
We also want to code the 3 symptom severity factors as ordinal, so make sure they are coded as ordered factors. The order should of course be None < Mild < Moderate < Severe.
If you look at your data, you’ll see that some predictors are fairly unbalanced, with most patients reporting
No and only a few
Yes. If almost everyone is in one category and almost nobody in others, that often (but not always) means those predictor variables are not very helpful in fitting/predicting the outcome. Furthermore, if you do cross-validation and one of your samples happen to not include one predictor level at all, things might go wrong when applied to the holdout-sample. (There are methods to deal with this specific problem, but they are more complicated than what we will do here.) Thus, it is worth considering if we want to remove them. The
recipes package in
tidymodels has the function
step_nzv() which can do that for you automatically. If you have lots of predictors, that might be good to use. But it’s often better to decide manually for each variable based on your scientific expertise if you want to remove it or not. We’ll take that approach here. After looking at the data, we decide to remove those binary predictors that have <50 entries in one category (there are 2). Write code to remove them.
You should end up with a data frame that has 730 observations and 26 variables. This is the dataset we’ll use for modeling.
I suggest you put all the analysis code into one R Markdown file. Note that for this exercise, some of your code might take long to run. In such situations, it is often good to have an alternative setup, where some of the heavy computations are done by separate R scripts/functions, and the results saved, pulled into and displayed in the main R Markdown file. How you want to do it here is up to you, as long as the R Markdown file shows all the main results and your code/file(s) is/are well documented.
You will likely be able to recycle bits of our earlier code, but you are also welcome to start fresh.
To keep things a bit simpler, for this exercise we focus on a single outcome, the continuous, numerical value of Body temperature. Thus, we are fitting regression models here. Once you finished the whole workflow for that outcome, you are welcome to (optionally) do it again and run a classification for the categorical outcome of Nausea. Since we are doing a regression, and we don’t have any specific expert knowledge that tells us how we should build our performance metric, we’ll go with one of the usual ones, namely RMSE. But remember to always think about your performance metric and don’t just use the default without at least considering other options.
Start by setting the random seed to
123. This should make everything reproducible and everyone should get the same results.
Split the dataset into 70% training, 30% testing. Also use the outcome
BodyTemp as stratification. This allows for more balanced outcome values in the train and test sets. See e.g., section 3 of the Get Started tutorial.
We want to do 5-fold cross-validation, 5 times repeated. (There’s no specific reason to do this 5x5 pattern, other than to show you that there are different ways to pick the sample, and that I want you to not use the default.) For the CV folds, we also want to stratify on
BodyTemp, as we did for the main train/test split. Use the
vfold_cv() function to create a resample object for the training data with these specifications.
Create a recipe for the data and fitting. You won’t need to do much, just make sure you code the categorical variables as dummy variables, otherwise things might not work smoothly. For that, you want to use the
step_dummy() function and pick all nominal predictor variables (which are actually all predictor variables here, since the only continuous variable is our outcome).
Write some code to compute the performance of a null model, i.e. a “model” that doesn’t use any predictor information. For a continuous outcome and RMSE as our metric, a null model is one that always predicts the mean of the outcome. Compute the RMSE for both training and test data for such a “model”. We’ll use that later to compare it to the performance of our real models. Of course, we expect/hope our real models that use predictor information to be better. If they aren’t that means they are no good.
We’ll fit a tree, a LASSO model, and a random forest. I chose those because they are used in the tutorial on the
tidymodels website. You can of course add further models. For the tree, see the Tune model parameters section of the Get Started tutorial. For LASSO and the random forest, check out the Case Study section of the Get Started tutorial. Note that you will need to adjust the code for our scenario since we have a continuous outcome.
If you follow the tutorial, you’ll likely use the packages
ranger to fit those 3 models. Make sure they are installed and loaded.
I suggest you write code for each model separately. A lot of the code will look similar, so once you got the first one set up, the other two should be easier. They mainly differ in the commands specifying the tuning parameters and the tuning grid.
Each of these models requires some tuning. For the choices regarding the tuning parameters, you can follow the examples. Most of the models have more things that can be tuned, but for now you can stick to what they show in the tutorial. Follow the examples by setting up a workflow, set a tuning grid, and then use the
tune_grid() function to tune the model using cross-validation.
Note that the tuning part, i.e., calling
tune_grid() might take a good bit of time to run (possibly minutes).
The steps (block of code) you should have here are 1) model specification, 2) workflow definition, 3) tuning grid specification and 4) tuning using cross-validation and the
Once you have done the tuning, you can take a look at some diagnostics by sending your object returned from the
tune_grid() function to
autoplot(). For instance if you tuned the tree and saved the result as
tree_tune_res, you can run
tree_tune_res %>% autoplot(). Depending on the model, the plot will be different, but in general it shows you what happened during the tuning process.
Next, you want to get the model that the tuning process has determined is the best. You can get the best-fit model with
finalize_workflow() and then do one more fit to the training data with this final workflow using the
fit() function. Follow the examples in the tutorial.
To evaluate the final fit for each model, do the following.
Make two plots, one that shows model predictions from the tuned model versus actual outcomes, and one that plots residuals. The actual outcomes you get straight from the data, the predicted outcomes you can get by applying the
predict() function to the final fit.
Look at/print the model performance and compare it with the null model (still only on training data). Here, we want the performance of the tuned, best-fitting model on the CV dataset (we are not yet touching the test data). You can get that for instance with the
show_best() function, which gives you the mean cross-validated performance for the best models. It also shows the standard deviation for the performance. Compare that model performance with the null model.
The mean and standard deviation of the performance give you a measure of overall performance and variability in that measure. The plots show you if there are any systematic deviations between model and data. Taken together, these can be compared for the different models and based on those (and as wanted, other considerations) a final model can be chosen.
Implement the model tuning/fitting and evaluating steps for all 3 models.
Once you have implemented above steps for the 3 models, you should a “best” fit for each one based on the tuning process. For each best model you should have performance, uncertainty around the performance measure, and some diagnostic plots. While for any real research project, you likely want to look deeper (e.g. at uncertainty in predictions instead of just overall performance), for now this is enough. Pick one of the three models. Explain why you pick it. There is no single answer that’s correct, I just want you to learn to reason/argue for why you are doing something: in this case justify why you are picking the model you do.
Once you picked your final model, you are allowed to once – and only once – fit it to the test data and check how well it performs on that data. This gives you a somewhat honest estimate of how the model might perform for new, unseen data. You can do that using the
last_fit() function applied to the model you end up choosing. For the final model applied to the test set, report performance and the diagnostic plots as above.
And that concludes what is likely a fairly long exercise. The code itself is not that long, but it will take you time to cobble it together from the
tidymodel tutorial and possibly other sources.
Make sure your analysis and all results are nicely documented and when knitted, your main analysis Rmd file produces a nice-looking HTML page. In a future exercise, you will add this to your portfolio. No need to post about the exercise, I know where to find it.
Let’s use this week’s discussion to talk about projects. Post a short summary of your project, very briefly describing your data, your question(s), your (planned) methods and (expected) results. Sort of like an abstract for a paper. This way, everyone can get a quick glimpse as to what everyone else is doing. Also provide a link to the repo for easy access.
Then, also post any specific questions/struggles/concerns you might have regarding your project. It’s quite likely that there is some overlap in approaches and questions among you and your classmates. Hopefully, through this discussion you can provide each other with some help and input.
Post this by Wednesday. Then read the summaries of your classmates’ projects. If you see a question/topic that you think you might be able to help with, either by answering a specific question, or by providing some general feedback, do so.
And of course, as you look at each others projects, it’s a 2-way street. You can provide help/feedback, but you are also welcome to take inspiration from what others are doing and integrate some of that into your own project.
Since you get feedback from me at each of your submission points, I plan to stay out of this week’s discussion 😁. So this is all you helping each other. Each of you will be reviewing 2 projects in a lot of depth at the end, but hopefully with the discussion this week you get a bit of an idea of what everyone else is doing, and you can provide feedback to others and earlier.