Get the quiz sheet for this module from the general **Assessments** page.
Fill it in, then submit to the online grading system before the
deadline.

For this exercise, we will implement some of the machine learning models covered in this module, and some of the general approaches we covered previously. This is a continuation of the data analysis exercise you’ve been working on over the last few weeks, analyzing flu data. The goal is that by the end of this exercise, you will have an example of a nice full analysis. So let’s get started!

**Note: This exercise is likely time-consuming and you might
get stuck. Please plan accordingly. Start early and ask for help in
Slack.**

This is a solo-exercise. Make sure your previous repo for this exercise is fully up-to-date and synced.

The goal is to have both a complete and well-organized data analysis example at the end of the exercise. So as you go through this exercise, in addition to doing the indicated tasks and writing the code, keep organizing things.

You want to remove anything that’s not part of the exercise (any files/folders from the template). You also want to make sure you have readme files or other documentation that briefly explain what files are where and in which order to run them. Make sure your R/Rmd scripts are easy to understand and that they all run.

As you re-organize, you can decide the mix of R and Rmd files and what you want to combine and what to split. However you do it, make sure it’s documented and easy to understand.

We previously covered pre-processing, but haven’t specifically looked
at that yet in an exercise. So let’s add that part. This can be added in
various places to the code. You can for instance make it part of the
cleaning code, or include it in the `recipe`

part of your
`tidymodels`

workflow (or a mix).

You probably realized that some of the variables contain the same
information in slightly different ways. Specifically,
`Weakness`

, `Cough`

and `Myalgia`

exist
on both a severity score and as Yes/No. Further, there are 2 variables
for cough yes/no. These variables are strongly correlated and thus at a
minimum, don’t help when you model, and are actually more likely to
confuse your model and lead to errors/warnings/poor model performance
(e.g. “predictions from a rank-deficient fit may be misleading). So
let’s fix that. For those symptoms where you have both multiple levels
and yes/no, remove all the yes/no versions. That should remove 4
variables.

We also want to code the 3 symptom severity factors as ordinal, so
make sure they are coded as **ordered factors**. The order
should of course be None < Mild < Moderate < Severe.

If you look at your data, you’ll see that some predictors are fairly
**unbalanced**, with most patients reporting
`No`

and only a few `Yes`

. If almost everyone is
in one category and almost nobody in others, that often (but not always)
means those predictor variables are not very helpful in
fitting/predicting the outcome. Furthermore, if you do cross-validation
and one of your samples happen to not include one predictor level at
all, things might go wrong when applied to the holdout-sample. (There
are methods to deal with this specific problem, but they are more
complicated than what we will do here.) Thus, it is worth considering if
we want to remove them. The `recipes`

package in
`tidymodels`

has the function `step_nzv()`

which
can do that for you automatically. If you have lots of predictors, that
might be good to use. But it’s often better to decide *manually*
for each variable based on your scientific expertise if you want to
remove it or not. We’ll take that approach here. After looking at the
data, we decide to remove those **binary** predictors that
have <50 entries in one category (there are 2). Write code to remove
them.

You should end up with a data frame that has 730 observations and 26 variables. This is the dataset we’ll use for modeling.

I suggest you put all the analysis code into one R Markdown file. Note that for this exercise, some of your code might take long to run. In such situations, it is often good to have an alternative setup, where some of the heavy computations are done by separate R scripts/functions, and the results saved, pulled into and displayed in the main R Markdown file. How you want to do it here is up to you, as long as the R Markdown file shows all the main results and your code/file(s) is/are well documented.

You will likely be able to recycle bits of our earlier code, but you are also welcome to start fresh.

To keep things a bit simpler, for this exercise we focus on a single
outcome, the **continuous, numerical** value of
**Body temperature**. Thus, we are fitting
**regression** models here. Once you finished the whole
workflow for that outcome, you are welcome to (optionally) do it again
and run a **classification** for the
**categorical** outcome of **Nausea**. Since
we are doing a regression, and we don’t have any specific expert
knowledge that tells us how we should build our performance metric,
we’ll go with one of the usual ones, namely RMSE. But remember to always
think about your performance metric and don’t just use the default
without at least considering other options.

Start by setting the random seed to

`123`

. This should make everything reproducible and everyone should get the same results.Split the dataset into 70% training, 30% testing. Also use the outcome

`BodyTemp`

as stratification. This allows for more balanced outcome values in the train and test sets. See e.g., section 3 of the*Get Started*tutorial.We want to do 5-fold cross-validation, 5 times repeated. (There’s no specific reason to do this 5x5 pattern, other than to show you that there are different ways to pick the sample, and that I want you to not use the default.) For the CV folds, we also want to stratify on

`BodyTemp`

, as we did for the main train/test split. Use the`vfold_cv()`

function to create a resample object for the training data with these specifications.Create a recipe for the data and fitting. You won’t need to do much, just make sure you code the categorical variables as dummy variables, otherwise things might not work smoothly. For that, you want to use the

`step_dummy()`

function and pick all nominal predictor variables (which are actually all predictor variables here, since the only continuous variable is our outcome).

Write some code to compute the performance of a null model, i.e. a “model” that doesn’t use any predictor information. For a continuous outcome and RMSE as our metric, a null model is one that always predicts the mean of the outcome. Compute the RMSE for both training and test data for such a “model”. We’ll use that later to compare it to the performance of our real models. Of course, we expect/hope our real models that use predictor information to be better. If they aren’t that means they are no good.

We’ll fit a tree, a LASSO model, and a random forest. I chose those
because they are used in the tutorial on the `tidymodels`

website. You can of course add further models. For the tree, see the
*Tune model parameters* section of the *Get Started*
tutorial. For LASSO and the random forest, check out the *Case
Study* section of the *Get Started*
tutorial. Note that you will need to adjust the code for our
scenario since we have a continuous outcome.

If you follow the tutorial, you’ll likely use the packages
`rpart`

, `glmnet`

and `ranger`

to fit
those 3 models. Make sure they are installed and loaded.

I suggest you write code for each model separately. A lot of the code will look similar, so once you got the first one set up, the other two should be easier. They mainly differ in the commands specifying the tuning parameters and the tuning grid.

Each of these models requires some tuning. For the choices regarding
the tuning parameters, you can follow the examples. Most of the models
have more things that can be tuned, but for now you can stick to what
they show in the tutorial. Follow the examples by setting up a workflow,
set a tuning grid, and then use the `tune_grid()`

function to
tune the model using cross-validation.

**Note that the tuning part, i.e., calling
tune_grid() might take a good bit of time to run (possibly
minutes).**

The steps (block of code) you should have here are 1) model
specification, 2) workflow definition, 3) tuning grid specification and
4) tuning using cross-validation and the `tune_grid()`

function.

Once you have done the tuning, you can take a look at some
diagnostics by sending your object returned from the
`tune_grid()`

function to `autoplot()`

. For
instance if you tuned the tree and saved the result as
`tree_tune_res`

, you can run
`tree_tune_res %>% autoplot()`

. Depending on the model,
the plot will be different, but in general it shows you what happened
during the tuning process.

Next, you want to get the model that the tuning process has
determined is the best. You can get the best-fit model with
`select_best()`

and `finalize_workflow()`

and then
do one more fit to the **training data** with this final
workflow using the `fit()`

function. Follow the examples in
the tutorial.

To evaluate the final fit for each model, do the following.

Make two plots, one that shows model predictions from the tuned model versus actual outcomes, and one that plots residuals. The actual outcomes you get straight from the data, the predicted outcomes you can get by applying the

`predict()`

function to the final fit.Look at/print the model performance and compare it with the null model (still only on training data). Here, we want the performance of the tuned, best-fitting model on the CV dataset (we are not yet touching the test data). You can get that for instance with the

`show_best()`

function, which gives you the mean cross-validated performance for the best models. It also shows the standard deviation for the performance. Compare that model performance with the null model.

The mean and standard deviation of the performance give you a measure of overall performance and variability in that measure. The plots show you if there are any systematic deviations between model and data. Taken together, these can be compared for the different models and based on those (and as wanted, other considerations) a final model can be chosen.

**Implement the model tuning/fitting and evaluating steps for
all 3 models.**

Once you have implemented above steps for the 3 models, you should a “best” fit for each one based on the tuning process. For each best model you should have performance, uncertainty around the performance measure, and some diagnostic plots. While for any real research project, you likely want to look deeper (e.g. at uncertainty in predictions instead of just overall performance), for now this is enough. Pick one of the three models. Explain why you pick it. There is no single answer that’s correct, I just want you to learn to reason/argue for why you are doing something: in this case justify why you are picking the model you do.

Once you picked your final model, you are allowed to once –
**and only once** – fit it to the test data and check how
well it performs on that data. This gives you a somewhat honest estimate
of how the model might perform for new, unseen data. You can do that
using the `last_fit()`

function applied to the model you end
up choosing. For the final model applied to the test set, report
performance and the diagnostic plots as above.

And that concludes what is likely a fairly long exercise. The code
itself is not that long, but it will take you time to cobble it together
from the `tidymodel`

tutorial and possibly other sources.

Make sure your analysis and all results are nicely documented and when knitted, your main analysis Rmd file produces a nice-looking HTML page. In a future exercise, you will add this to your portfolio. No need to post about the exercise, I know where to find it.

Let’s use this week’s discussion to talk about projects. Post a short summary of your project, very briefly describing your data, your question(s), your (planned) methods and (expected) results. Sort of like an abstract for a paper. This way, everyone can get a quick glimpse as to what everyone else is doing. Also provide a link to the repo for easy access.

Then, also post any specific questions/struggles/concerns you might have regarding your project. It’s quite likely that there is some overlap in approaches and questions among you and your classmates. Hopefully, through this discussion you can provide each other with some help and input.

Post this by Wednesday. Then read the summaries of your classmates’ projects. If you see a question/topic that you think you might be able to help with, either by answering a specific question, or by providing some general feedback, do so.

And of course, as you look at each others projects, it’s a 2-way street. You can provide help/feedback, but you are also welcome to take inspiration from what others are doing and integrate some of that into your own project.

Since you get feedback from me at each of your submission points, I plan to stay out of this week’s discussion 😁. So this is all you helping each other. Each of you will be reviewing 2 projects in a lot of depth at the end, but hopefully with the discussion this week you get a bit of an idea of what everyone else is doing, and you can provide feedback to others and earlier.

## Comments for specific models

Here are some more suggestions and hints. Most of this is optional but worth trying.

For the tree model, if you want to plot the tree, you can use the`rpart.plot`

package and run this command`rpart.plot(extract_fit_parsnip(best_tree_fit)$fit)`

(assuming your result from the final workflow fit is called`best_tree_fit`

). You might get a warning message, but the tree will show. You will likely find when you look at the actual/predicted plot or the residual plot that the tree model does not perform very well, and the model only predicts a few discrete outcome values. That’s also noticeable when you compare RMSE for the tree model and the null model, they are very similar.For the lasso model, you will likely find that it performs a bit better than the tree, but not a lot. If you want to see a plot for how the number of predictors included in the LASSO model changes with the tuning parameter, you can extract the model from your final fit (say it’s called`best_lasso_fit`

) with`x <- best_lasso_fit$fit$fit$fit`

and then`plot(x, "lambda")`

. I know, this is awful code having to dig that deep into the`best_lasso_fit`

object. Maybe there’s a better way of doing this with some`tidymodels`

command (potentially some combination of`$fit`

and`extract_fit_parnsip()`

). If you know one, use it.For the random forest model, you will likely again find that it performs a bit better than the tree, but not a lot. The tuning setup might require some fiddling around, I had a few initial tries where the whole tuning failed. For a model like random forest, all variables stay in the model. There are ways to look at the variables that are most important. If you want to do that, you again need to pull out the fit object. Say it’s called`best_rf_fit`

, you can do that with`x <- best_rf_fit$fit$fit$fit`

and then use the`vip()`

function from the`vip`

package to plot the importance of the variables.