Get the quiz sheet for this module from the general Assessments page. Fill it in, then submit to the online grading system before the deadline.
For this exercise, we will implement some of the machine learning models covered in this module, and some of the general approaches we covered previously. This is a continuation of the data analysis exercise you’ve been working on over the last few weeks, analyzing flu data. The goal is that by the end of this exercise, you will have an example of a nice full analysis. So let’s get started!
Note: This exercise is likely time-consuming and you might get stuck. Please plan accordingly. Start early and ask for help in Slack.
This is a solo-exercise. Make sure your previous repo for this exercise is fully up-to-date and synced.
The goal is to have both a complete and well-organized data analysis example at the end of the exercise. So as you go through this exercise, in addition to doing the indicated tasks and writing the code, keep organizing things.
You want to remove anything that’s not part of the exercise (any files/folders from the template). You also want to make sure you have readme files or other documentation that briefly explain what files are where and in which order to run them. Make sure your R/Rmd scripts are easy to understand and that they all run.
As you re-organize, you can decide the mix of R and Rmd files and what you want to combine and what to split. However you do it, make sure it’s documented and easy to understand.
We previously covered pre-processing, but haven’t specifically looked
at that yet in an exercise. So let’s add that part. This can be added in
various places to the code. You can for instance make it part of the
cleaning code, or include it in the recipe
part of your
tidymodels
workflow (or a mix).
You probably realized that some of the variables contain the same
information in slightly different ways. Specifically,
Weakness
, Cough
and Myalgia
exist
on both a severity score and as Yes/No. Further, there are 2 variables
for cough yes/no. These variables are strongly correlated and thus at a
minimum, don’t help when you model, and are actually more likely to
confuse your model and lead to errors/warnings/poor model performance
(e.g. “predictions from a rank-deficient fit may be misleading). So
let’s fix that. For those symptoms where you have both multiple levels
and yes/no, remove all the yes/no versions. That should remove 4
variables.
We also want to code the 3 symptom severity factors as ordinal, so make sure they are coded as ordered factors. The order should of course be None < Mild < Moderate < Severe.
If you look at your data, you’ll see that some predictors are fairly
unbalanced, with most patients reporting
No
and only a few Yes
. If almost everyone is
in one category and almost nobody in others, that often (but not always)
means those predictor variables are not very helpful in
fitting/predicting the outcome. Furthermore, if you do cross-validation
and one of your samples happen to not include one predictor level at
all, things might go wrong when applied to the holdout-sample. (There
are methods to deal with this specific problem, but they are more
complicated than what we will do here.) Thus, it is worth considering if
we want to remove them. The recipes
package in
tidymodels
has the function step_nzv()
which
can do that for you automatically. If you have lots of predictors, that
might be good to use. But it’s often better to decide manually
for each variable based on your scientific expertise if you want to
remove it or not. We’ll take that approach here. After looking at the
data, we decide to remove those binary predictors that
have <50 entries in one category (there are 2). Write code to remove
them.
You should end up with a data frame that has 730 observations and 26 variables. This is the dataset we’ll use for modeling.
I suggest you put all the analysis code into one R Markdown file. Note that for this exercise, some of your code might take long to run. In such situations, it is often good to have an alternative setup, where some of the heavy computations are done by separate R scripts/functions, and the results saved, pulled into and displayed in the main R Markdown file. How you want to do it here is up to you, as long as the R Markdown file shows all the main results and your code/file(s) is/are well documented.
You will likely be able to recycle bits of our earlier code, but you are also welcome to start fresh.
To keep things a bit simpler, for this exercise we focus on a single outcome, the continuous, numerical value of Body temperature. Thus, we are fitting regression models here. Once you finished the whole workflow for that outcome, you are welcome to (optionally) do it again and run a classification for the categorical outcome of Nausea. Since we are doing a regression, and we don’t have any specific expert knowledge that tells us how we should build our performance metric, we’ll go with one of the usual ones, namely RMSE. But remember to always think about your performance metric and don’t just use the default without at least considering other options.
Start by setting the random seed to 123
. This should
make everything reproducible and everyone should get the same
results.
Split the dataset into 70% training, 30% testing. Also use the
outcome BodyTemp
as stratification. This allows for more
balanced outcome values in the train and test sets. See e.g., section 3 of the
Get Started tutorial.
We want to do 5-fold cross-validation, 5 times repeated. (There’s
no specific reason to do this 5x5 pattern, other than to show you that
there are different ways to pick the sample, and that I want you to not
use the default.) For the CV folds, we also want to stratify on
BodyTemp
, as we did for the main train/test split. Use the
vfold_cv()
function to create a resample object for the
training data with these specifications.
Create a recipe for the data and fitting. You won’t need to do
much, just make sure you code the categorical variables as dummy
variables, otherwise things might not work smoothly. For that, you want
to use the step_dummy()
function and pick all nominal
predictor variables (which are actually all predictor variables here,
since the only continuous variable is our outcome).
Write some code to compute the performance of a null model, i.e. a “model” that doesn’t use any predictor information. For a continuous outcome and RMSE as our metric, a null model is one that always predicts the mean of the outcome. Compute the RMSE for both training and test data for such a “model”. We’ll use that later to compare it to the performance of our real models. Of course, we expect/hope our real models that use predictor information to be better. If they aren’t that means they are no good.
We’ll fit a tree, a LASSO model, and a random forest. I chose those
because they are used in the tutorial on the tidymodels
website. You can of course add further models. For the tree, see the
Tune model parameters section of the Get Started
tutorial. For LASSO and the random forest, check out the Case
Study section of the Get Started
tutorial. Note that you will need to adjust the code for our
scenario since we have a continuous outcome.
If you follow the tutorial, you’ll likely use the packages
rpart
, glmnet
and ranger
to fit
those 3 models. Make sure they are installed and loaded.
I suggest you write code for each model separately. A lot of the code will look similar, so once you got the first one set up, the other two should be easier. They mainly differ in the commands specifying the tuning parameters and the tuning grid.
Each of these models requires some tuning. For the choices regarding
the tuning parameters, you can follow the examples. Most of the models
have more things that can be tuned, but for now you can stick to what
they show in the tutorial. Follow the examples by setting up a workflow,
set a tuning grid, and then use the tune_grid()
function to
tune the model using cross-validation.
Note that the tuning part, i.e., calling
tune_grid()
might take a good bit of time to run (possibly
minutes).
The steps (block of code) you should have here are 1) model
specification, 2) workflow definition, 3) tuning grid specification and
4) tuning using cross-validation and the tune_grid()
function.
Once you have done the tuning, you can take a look at some
diagnostics by sending your object returned from the
tune_grid()
function to autoplot()
. For
instance if you tuned the tree and saved the result as
tree_tune_res
, you can run
tree_tune_res %>% autoplot()
. Depending on the model,
the plot will be different, but in general it shows you what happened
during the tuning process.
Next, you want to get the model that the tuning process has
determined is the best. You can get the best-fit model with
select_best()
and finalize_workflow()
and then
do one more fit to the training data with this final
workflow using the fit()
function. Follow the examples in
the tutorial.
To evaluate the final fit for each model, do the following.
Make two plots, one that shows model predictions from the tuned
model versus actual outcomes, and one that plots residuals. The actual
outcomes you get straight from the data, the predicted outcomes you can
get by applying the predict()
function to the final
fit.
Look at/print the model performance and compare it with the null
model (still only on training data). Here, we want the performance of
the tuned, best-fitting model on the CV dataset (we are not yet touching
the test data). You can get that for instance with the
show_best()
function, which gives you the mean
cross-validated performance for the best models. It also shows the
standard deviation for the performance. Compare that model performance
with the null model.
The mean and standard deviation of the performance give you a measure of overall performance and variability in that measure. The plots show you if there are any systematic deviations between model and data. Taken together, these can be compared for the different models and based on those (and as wanted, other considerations) a final model can be chosen.
Implement the model tuning/fitting and evaluating steps for all 3 models.
Once you have implemented above steps for the 3 models, you should a “best” fit for each one based on the tuning process. For each best model you should have performance, uncertainty around the performance measure, and some diagnostic plots. While for any real research project, you likely want to look deeper (e.g. at uncertainty in predictions instead of just overall performance), for now this is enough. Pick one of the three models. Explain why you pick it. There is no single answer that’s correct, I just want you to learn to reason/argue for why you are doing something: in this case justify why you are picking the model you do.
Once you picked your final model, you are allowed to once –
and only once – fit it to the test data and check how
well it performs on that data. This gives you a somewhat honest estimate
of how the model might perform for new, unseen data. You can do that
using the last_fit()
function applied to the model you end
up choosing. For the final model applied to the test set, report
performance and the diagnostic plots as above.
And that concludes what is likely a fairly long exercise. The code
itself is not that long, but it will take you time to cobble it together
from the tidymodel
tutorial and possibly other sources.
Make sure your analysis and all results are nicely documented and when knitted, your main analysis Rmd file produces a nice-looking HTML page. In a future exercise, you will add this to your portfolio. No need to post about the exercise, I know where to find it.
Let’s use this week’s discussion to talk about projects. Post a short summary of your project, very briefly describing your data, your question(s), your (planned) methods and (expected) results. Sort of like an abstract for a paper. This way, everyone can get a quick glimpse as to what everyone else is doing. Also provide a link to the repo for easy access.
Then, also post any specific questions/struggles/concerns you might have regarding your project. It’s quite likely that there is some overlap in approaches and questions among you and your classmates. Hopefully, through this discussion you can provide each other with some help and input.
Post this by Wednesday. Then read the summaries of your classmates’ projects. If you see a question/topic that you think you might be able to help with, either by answering a specific question, or by providing some general feedback, do so.
And of course, as you look at each others projects, it’s a 2-way street. You can provide help/feedback, but you are also welcome to take inspiration from what others are doing and integrate some of that into your own project.
Since you get feedback from me at each of your submission points, I plan to stay out of this week’s discussion 😁. So this is all you helping each other. Each of you will be reviewing 2 projects in a lot of depth at the end, but hopefully with the discussion this week you get a bit of an idea of what everyone else is doing, and you can provide feedback to others and earlier.
Comments for specific models
Here are some more suggestions and hints. Most of this is optional but worth trying.
For the tree model, if you want to plot the tree, you can use the
rpart.plot
package and run this commandrpart.plot(extract_fit_parsnip(best_tree_fit)$fit)
(assuming your result from the final workflow fit is calledbest_tree_fit
). You might get a warning message, but the tree will show. You will likely find when you look at the actual/predicted plot or the residual plot that the tree model does not perform very well, and the model only predicts a few discrete outcome values. That’s also noticeable when you compare RMSE for the tree model and the null model, they are very similar.For the lasso model, you will likely find that it performs a bit better than the tree, but not a lot. If you want to see a plot for how the number of predictors included in the LASSO model changes with the tuning parameter, you can extract the model from your final fit (say it’s called
best_lasso_fit
) withx <- best_lasso_fit$fit$fit$fit
and thenplot(x, "lambda")
. I know, this is awful code having to dig that deep into thebest_lasso_fit
object. Maybe there’s a better way of doing this with sometidymodels
command (potentially some combination of$fit
andextract_fit_parnsip()
). If you know one, use it.For the random forest model, you will likely again find that it performs a bit better than the tree, but not a lot. The tuning setup might require some fiddling around, I had a few initial tries where the whole tuning failed. For a model like random forest, all variables stay in the model. There are ways to look at the variables that are most important. If you want to do that, you again need to pull out the fit object. Say it’s called
best_rf_fit
, you can do that withx <- best_rf_fit$fit$fit$fit
and then use thevip()
function from thevip
package to plot the importance of the variables.