Quiz

Get the quiz sheet for this module from the general Assessments page. Fill it in, then submit to the online grading system before the deadline.

Exercise

Overview

For this exercise, we will participate in one more Tidy Tuesday exercise. This time, we’ll apply a full analysis workflow to the data.

This is another solo exercise, so no group work this week, but of course more GitHub. This exercise will go into your portfolio.

Setup

Use your portfolio website. Make sure it’s up to date and fully synced. Open it in R Studio. From last week’s exercise, you should have an empty Rmd file called tidytuesday_exercise2.Rmd. Open it, that’s where you’ll write your Tidy Tuesday analysis.

Go to the TidyTuesday Github repository. Look for the dataset for this week, and read the instructions on how to get the data. You will also be provided with a data dictionary. If the data is available for download, place it somewhere in your portfolio repository (e.g., you can make a new folder called data).

What to do

Write code and text to perform the following steps:

  1. Load, wrangle and explore the data. By now you know this is an iterative procedure, so it’s ok to have these parts of the process/code intertwined.

  2. Once you understand the data sufficiently, formulate a question/hypothesis. This will determine your outcome of interest and, if applicable, main predictor(s) of interest. Since we don’t know the data yet, it might be that the question is a bit contrived and not actually too interesting, but I’m sure there will be more than one potentially reasonable question one can ask, no matter what the data will be.

  3. Once you determine the question and thus your outcome and main predictors, further pre-process and clean the data as needed. Then split into train/test.

  4. Fit at least 4 different ML models to the data using the tidymodels framework we practiced. Use the CV approach for model training/fitting. Explore the quality of each model by looking at performance, residuals, uncertainty, etc. All of this should still be evaluated using the training/CV data. You can of course recycle code from the previous exercise, but I also encourage you to explore further, e.g. try different ML models or use different metrics. You might have to do that anyway, depending on your question/outcome.

  5. Based on the model evaluations, decide on one model you think is overall best. Explain why. It doesn’t have to be the model with the best performance. You make the choice, just explain why you picked the one you picked.

  6. As a final, somewhat honest assessment of the quality of the model you chose, evaluate it (performance, residuals, uncertainty, etc.) on the test data. This is the only time you are allowed to touch the test data, and only once. Report model performance on the test data.

  7. Summarize everything you did and found in a discussion. Of course, your Rmd file should contain commentary/documentation on everything you do for each step.

General pointers

Add ample information/documentation in the form of either code comments or text in the Rmd file. Ideally both. You should provide a running commentary on what you do, why you do it, and if you use some R trickery, maybe also how (so if you stare at it in a few weeks, you can remember what in the world you did back then 😄).

I don’t want to see any p-values anywhere 😁! This is an exploratory analysis, thus there is no place for p-values.

Finish up

Once done with your Tidy Tuesday analysis, rebuild your portfolio site to make sure everything works and looks good, that all the links work, etc. Then push to GitHub by the deadline. Since this will be on your portfolio website, and I know where to find it, there is no need to post any link this week.

And you are again free to broadcast your effort on social media, e.g., using the #tidytuesday hash tag on Twitter. I’ll again be looking out for #MADA2021 tagged tweets 😁.

Discussion

Write a post in this week’s discussion channel that addresses this topic:

Now that we’ve essentially reached the end of the course and did the whole data-analysis pipeline, which specific topics in the data-analysis workflow do you still find unclear/confusing? Which ones do you wish we had covered more? Are you looking for further resources on a specific topic and haven’t found anything good?

Post by Wednesday, then comment on each others posts. I hope this discussion will allow us to point each other to further possibly helpful resources and maybe help clarify some still open questions.

And just to re-iterate: I very much welcome contributions to the course pages, so if you have resources you think should be added, or sections that are unclear and you can improve, please contribute! By now you know how to do that: Fork the course repo, make changes, send a pull request. I’ll take a look, if things look good I’ll merge. And of course add you to the Acknowledgments section of the course 😄.