Get the quiz sheet for this module from the general Assessments page. Fill it in, then submit to the online grading system before the deadline.
For this exercise, we will implement some of the machine learning
models covered in this module, and some of the general approaches we
covered previously. This is a continuation of the
Flu Analysis
exercise you’ve been working on over the last
few weeks. The goal is that by the end of this exercise, you will have
an example of a nice full analysis. So let’s get started!
Note: This exercise is likely time-consuming and you might get stuck. Please plan accordingly. Start early and ask for help in Slack.
This is a solo-exercise, and it’s part of your portfolio.
The goal is to have both a complete and well-organized data analysis example at the end of the exercise. So as you go through this exercise, in addition to doing the indicated tasks and writing the code, keep organizing things as needed. You want to make sure you have readme files or other documentation that briefly explain what files are where and in which order to run them. Make sure your R/Quarto files are easy to understand and that they all run.
As you re-organize, you can decide the mix of R and Quarto files and what you want to combine and what to split. However you do it, make sure it’s documented and easy to understand.
We previously covered pre-processing, but haven’t specifically looked
at that yet in an exercise. So let’s add that part. This can be added in
various places to the code/files you already created previously. You can
for instance make it part of the cleaning/wrangling code, or include it
in the recipe
part of your tidymodels
workflow
(or a mix).
You probably realized that some of the variables contain the same
information in slightly different ways. Specifically,
Weakness
, Cough
and Myalgia
exist
on both a severity score and as Yes/No. Further, there are 2 variables
for cough yes/no. These variables are strongly correlated and thus at a
minimum, don’t help when you model, and are actually more likely to
confuse your model and lead to errors/warnings/poor model performance
(e.g. “predictions from a rank-deficient fit may be misleading). So
let’s fix that. For those symptoms where you have both multiple levels
and yes/no, remove all the yes/no versions. That should remove 4
variables.
Some of your predictors are categorical (e.g., Yes/No) and the 3 symptom severity factors are ordinal, with None < Mild < Moderate < Severe.
We can code the categorical variables as unordered
factors and the others as ordered factors. I
want you to do that as practice. The functions step_dummy()
and step_ordinalscore()
will help. See e.g. the help file
example for step_ordinalscore()
. Not that to deal with
ordered factors in a statistical analysis, one needs special approaches
(and we won’t actually do those for this exercise.)
If you look at your data, you’ll see that some predictors are fairly
unbalanced, with most patients reporting
No
and only a few Yes
. If almost everyone is
in one category and almost nobody in others, that often (but not always)
means those predictor variables are not very helpful in
fitting/predicting the outcome. Furthermore, if you do cross-validation
and one of your samples happen to not include one predictor level at
all, things might go wrong when applied to the holdout-sample. (There
are methods to deal with this specific problem, but they are more
complicated than what we will do here.) Thus, it is worth considering if
we want to remove them. The recipes
package in
tidymodels
has the function step_nzv()
which
can do that for you automatically. If you have lots of predictors, that
might be good to use. But it’s often better to decide manually
for each variable based on your scientific expertise if you want to
remove it or not. We’ll take that approach here. After looking at the
data, we decide to remove those binary predictors that
have <50 entries in one category (there are 2). Write code to remove
them.
You should end up with a data frame that has 730 observations and 26 variables. This is the dataset we’ll use for modeling.
Start a new Quarto file for this analysis, call it
machinelearing.qmd
, place it into the code
folder. Note that for this exercise, some of your code might take long
to run. This is one of the situations where it is often good to have a
setup where some of the heavy computations are done by separate R
scripts/functions, and the results saved, pulled into and displayed in
the Quarto file. How you want to do it here is up to you, as long as the
Quarto file shows all the main results and your code/file(s) are well
documented.
To keep things a bit simpler, for this exercise we focus on a single outcome, the continuous, numerical value of Body temperature. Thus, we are fitting regression models here. Once you finished the whole workflow for that outcome, you are welcome to (optionally) do it again and run a classification for the categorical outcome of Nausea. Since we are doing a regression, and we don’t have any specific expert knowledge that tells us how we should build our performance metric, we’ll go with one of the usual ones, namely RMSE. But remember to always think about your performance metric and don’t just use the default without at least considering other options.
Start by setting the random seed to 123
. This should
make everything reproducible and everyone should get the same
results.
Split the dataset into 70% training, 30% testing. Also use the
outcome BodyTemp
as stratification. This allows for more
balanced outcome values in the train and test sets. See e.g., section 3 of the
Get Started tutorial.
We want to do 5-fold cross-validation, 5 times repeated. (There’s
no specific reason to do this 5x5 pattern, other than to show you that
there are different ways to pick the sample, and that I want you to not
use the default.) For the CV folds, we also want to stratify on
BodyTemp
, as we did for the main train/test split. Use the
vfold_cv()
function to create a resample object for the
training data with these specifications.
Create a recipe for the data and fitting. You won’t need to do
much, just make sure you code the categorical variables as dummy
variables, otherwise things might not work smoothly. For that, you want
to use the step_dummy()
function and pick all nominal
predictor variables (which are actually all predictor variables here,
since the only continuous variable is our outcome).
Write some code to compute the performance of a null model, i.e. a “model” that doesn’t use any predictor information. For a continuous outcome and RMSE as our metric, a null model is one that always predicts the mean of the outcome. Compute the RMSE for both training and test data for such a “model”. We’ll use that later to compare it to the performance of our real models. Of course, we expect/hope our real models that use predictor information to be better. If they aren’t that means they are no good.
We’ll fit a tree, a LASSO model, and a random forest. I chose those
because they are used in the tutorial on the tidymodels
website. You can of course add further models. For the tree, see the
Tune model parameters section of the Get Started
tutorial. For LASSO and the random forest, check out the Case
Study section of the Get Started
tutorial. Note that you will need to adjust the code for our
scenario since we have a continuous outcome.
If you follow the tutorial, you’ll likely use the packages
rpart
, glmnet
and ranger
to fit
those 3 models. Make sure they are installed and loaded.
I suggest you write code for each model separately. A lot of the code will look similar, so once you got the first one set up, the other two should be easier. They mainly differ in the commands specifying the tuning parameters and the tuning grid.
Each of these models requires some tuning. For the choices regarding
the tuning parameters, you can follow the examples. Most of the models
have more things that can be tuned, but for now you can stick to what
they show in the tutorial. Follow the examples by setting up a workflow,
set a tuning grid, and then use the tune_grid()
function to
tune the model using cross-validation.
Note that the tuning part, i.e., calling
tune_grid()
might take a good bit of time to run (possibly
minutes).
The steps (block of code) you should have here are 1) model
specification, 2) workflow definition, 3) tuning grid specification and
4) tuning using cross-validation and the tune_grid()
function.
Once you have done the tuning, you can take a look at some
diagnostics by sending your object returned from the
tune_grid()
function to autoplot()
. For
instance if you tuned the tree and saved the result as
tree_tune_res
, you can run
tree_tune_res %>% autoplot()
. Depending on the model,
the plot will be different, but in general it shows you what happened
during the tuning process.
Next, you want to get the model that the tuning process has
determined is the best. You can get the best-fit model with
select_best()
and finalize_workflow()
and then
do one more fit to the training data with this final
workflow using the fit()
function. Follow the examples in
the tutorial.
To evaluate the final fit for each model, do the following.
Make two plots, one that shows model predictions from the tuned
model versus actual outcomes, and one that plots residuals. The actual
outcomes you get straight from the data, the predicted outcomes you can
get by applying the predict()
function to the final
fit.
Look at/print the model performance and compare it with the null
model (still only on training data). Here, we want the performance of
the tuned, best-fitting model on the CV dataset (we are not yet touching
the test data). You can get that for instance with the
show_best()
function, which gives you the mean
cross-validated performance for the best models. It also shows the
standard deviation for the performance. Compare that model performance
with the null model.
The mean and standard deviation of the performance give you a measure of overall performance and variability in that measure. The plots show you if there are any systematic deviations between model and data. Taken together, these can be compared for the different models and based on those (and as wanted, other considerations) a final model can be chosen.
Implement the model tuning/fitting and evaluating steps for all 3 models.
Once you have implemented above steps for the 3 models, you should have a “best” fit for each one based on the tuning process. For each best model you should have performance, uncertainty around the performance measure, and some diagnostic plots. While for any real research project, you likely want to look deeper (e.g. at uncertainty in predictions instead of just overall performance), for now this is enough. Pick one of the three models. Explain why you pick it. There is no single answer that’s correct, I just want you to practice reasoning for why you are doing something: in this case justify why you are picking the model you do.
Once you picked your final model, you are allowed to once –
and only once – fit it to the test data and check how
well it performs on that data. This gives you a somewhat honest estimate
of how the model might perform for new, unseen data. You can do that
using the last_fit()
function applied to the model you end
up choosing. For the final model applied to the test set, report
performance and the diagnostic plots as above.
And that concludes what is likely a fairly long exercise. The code
itself is not that long, but it will take you time to cobble it together
from the tidymodel
tutorial and possibly other sources.
Make sure your analysis and all results are nicely documented and
everything runs/renders correctly. Then, add the newly created Quarto
document as an entry into your _quarto.yml
file, as a
sub-menu of Flu fitting
. Call it
Machine Learning
. Recompile your portfolio website and make
sure everything works and shows up as expected. Then commit and
push.
Since this is part of your portfolio site, you don’t need to post anything, I know where to find it. Therefore there is no exercise Slack channel for this module.
Let’s use this week’s discussion to talk a bit more about the projects. Last week you received feedback on our project from a few classmates, this week I want everyone to have a chance to hear about and comment on all projects.
To that end, post a summary description of your project to the discussion channel. Briefly describe your data, your question(s), your (planned) methods and (expected) results. Sort of like an abstract for a paper. It doesn’t have to be very long, but should contain enough detail that others can get the overall idea. This way, everyone can get a quick glimpse as to what everyone else is doing. Also provide a link to the repo for easy access (if your project is in a public repository. For private repos, you can skip this).
Then, also mention specific questions/struggles/concerns you might have regarding your project. It’s quite likely that there is some overlap in approaches and questions among you and your classmates. Hopefully, through this discussion you can provide each other with some help and input.
Post this by Wednesday. Then read the summaries of your classmates’ projects. If you see a question/topic that you think you might be able to help with, either by answering a specific question, or by providing some general feedback, do so.
And of course, as you look at each others projects, it’s a 2-way street. You can provide help/feedback, but you are also welcome to take inspiration from what others are doing and integrate some of that into your own project.
Since you get feedback from me at the other submission points, I plan to largely stay out of this week’s discussion 😁. So this is all you helping each other. Each of you has already seen a few projects and will be reviewing 2 projects in a lot of depth at the end, but hopefully with the discussion this week you get a bit of an idea of what everyone else is doing, and you can provide feedback to others and earlier.
Comments for specific models
Here are some more suggestions and hints. Most of this is optional but worth trying.
For the tree model, if you want to plot the tree, you can use the
rpart.plot
package and run this commandrpart.plot(extract_fit_parsnip(best_tree_fit)$fit)
(assuming your result from the final workflow fit is calledbest_tree_fit
). You might get a warning message, but the tree will show. You will likely find when you look at the actual/predicted plot or the residual plot that the tree model does not perform very well, and the model only predicts a few discrete outcome values. That’s also noticeable when you compare RMSE for the tree model and the null model, they are very similar.For the lasso model, you will likely find that it performs a bit better than the tree, but not a lot. If you want to see a plot for how the number of predictors included in the LASSO model changes with the tuning parameter, you can extract the model from your final fit (say it’s called
best_lasso_fit
) withx <- best_lasso_fit$fit$fit$fit
and thenplot(x, "lambda")
. I know, this is awful code having to dig that deep into thebest_lasso_fit
object. You might also be able to use theextract_fit_engine()
function to get the underlying fit object produced by LASSO, e.g.,x <- extract_fit_engine(best_lasso_fit)
might work (this is new intidymodels
and I haven’t fully tried it yet).For the random forest model, you will likely again find that it performs a bit better than the tree, but not a lot. The tuning setup might require some fiddling around, I had a few initial tries where the whole tuning failed. For a model like random forest, all variables stay in the model. There are ways to look at the variables that are most important. If you want to do that, you again need to pull out the fit object. Say it’s called
best_rf_fit
, you can do that withx <- best_rf_fit$fit$fit$fit
and then use thevip()
function from thevip
package to plot the importance of the variables. Alternativelyextract_fit_engine()
should also work.