Get the quiz sheet for this module from the general Assessments page. Fill it in, then submit to the online grading system before the deadline.
We’ll practice the topics covered in this and prior units by continuing with the exercise we started previously.
For this week’s exercise, we’ll continue on the exercise/repository we started previously. We’ll also do some group work again, using the – by now familiar – M1, M2, … setup. Assign each other a number. As much as possible, do it such that you end up working with group members you have not (or not in a while) worked with. The goal is that, as in past exercises, everyone will work on their own repository and on one other person’s repository. If you are in a group of 4, you can also do pairs. That of course doesn’t work in a group of 3, so you need to use the ‘circular’ setup there.
Make sure you have the latest version of the repository on your local machine. As needed, pull/push/merge to make sure remote and local repositories are in sync. Open the project in RStudio.
At this stage, I assume all the data wrangling and EDA code, as well as the model fitting code you worked on previously is present and fully functional. If there are still some issues that need to be resolved, go ahead and do so.
Start with a new/clean analysis script. Start with a clean analysis
script. Use another R Markdown file. If you want, you can place some of
the code into a separate R script and then source/load it from your
Rmd
file, or have everything inside the Rmd
file. In either case, document/comment well.
Write code that takes the data and splits it randomly into a train and test that, following for instance the example in the Data Splitting section of the Get Started tidymodels tutorial.
Next, following the example in the Create Recipes section
of the Get Started tidymodels tutorial, create a simple recipe that
fits our categorical outcome of interest to all
predictors (we’ll start with categorical and all predictors since that’s
the closest to the shown example). For now, you can ignore the concept
of roles
and features
they mention.
Set a model as you did in the previous exercise, then use the
workflow()
package to create a simple workflow that fits a
logistic model to all predictors using the glm
function. To
that end, follow the Fit a model with a recipe section of the
tutorial and adjust for your case.
You should end up with a fit object similar to the one shown at the end of that section in the tutorial - of course, yours will look somewhat different since you are using a different dataset, but overall things should look similar.
Follow the example in the Use a trained workflow to predict section of the tutorial to look at the predictions, ROC curve and ROC-AUC for your data. Apply it to both the training and the test data. ROC curve analysis and ROC-AUC is another common performance measure/metric for categorical outcomes. If you are not familiar with it, you can read more about them by following the link in the tutorial. It’s not too important to go into the details for now. The focus here is on getting the code to work. In general, a ROC-AUC of 0.5 means the model is no good, 1 is a perfect model. Generally, somewhere above 0.7 do people think the model might be useful. If your model has a ROC-AUC a lot less than 0.5, you likely have an issue with how your factors are coded or how tidymodels is interpreting them.
Let’s re-do the fitting but now with a model that only fits the main
predictor to the categorical outcome. You should notice that the only
thing you have to change is to set up a new recipe
, this
time one that only has the name of the predictor of interest on the
right side of the formula (instead of the .
symbol, which
is shorthand notation for “all predictors”.) Then you can set up a new
workflow with the new recipe, rerun the fit and evaluate performance
using the same code as above. In general, if you do multiple
models/recipes, you might want to write a loop to go over them, or
parallelize/vectorize things. For now, just copying and pasting most of
the code is ok.
Make sure everything runs and works as expected. Then commit, push and tell your classmate that they can take over. Finish this by Wednesday.
Fork and clone (or if you are added as collaborator, clone directly) your classmate’s repository. Add code at the bottom of theirs. Mark where your contribution starts with your name.
Add code that repeats the above steps, but now fits linear models to the continuous outcome. One model with all predictors, one model with just the main predictor. For that, you need to change your metric. RMSE is a good one to choose. You should find that a lot of the code your classmate wrote can copied/pasted/re-used with only minimal modifications.
Make sure everything runs and works as expected. Then commit, push and if you forked the repo, initiate a pull request. Tell our classmate that its done.
The original repository owner should make sure everything works, and update as needed. At this point, you might also want to perform some cleanup in your repository. You will keep working on this exercise, and eventually add it to your online portfolio website, so it’s a good idea to clean up a bit. Update README files, remove files and folders that are not used, etc. Ideally, you should have only materials in there that are relevant to the various exercise parts you (and your classmate contributors) did for this. Any leftover stuff from the original template should go or be updated.
Once all is ready, push to GitHub. No need to post a link since it’s the same one as last week.
We also covered overfitting and strategies to minimize it (e.g., cross-validation), and further model assessment strategies in the materials. We’ll practice those in future exercises. I figured the above will already take a good bit of getting used to, and we will be able to practice more when we look at further model types in the coming weeks.
Review and provide feedback on part 3 of the projects you have been assigned to as described in the Projects section.
Because you’ll be doing the project reviews, there is no discussion assignment for this module.