Get the quiz sheet for this module from the general Assessments page. Fill it in, then submit to the online grading system before the deadline.
For this exercise, we will participate in one more Tidy Tuesday exercise. This time, we’ll apply a full analysis workflow to the data.
This is another solo exercise, so no group work this week, but of course more GitHub. This exercise will go into your portfolio.
Use your portfolio website. Make sure it’s up to date and fully
synced. Open it in R Studio. From last week’s exercise, you should have
an empty Rmd file called tidytuesday_exercise2.qmd
. Open
it, that’s where you’ll write your Tidy Tuesday analysis.
Go to the TidyTuesday Github
repository. Look for the dataset for this week, and read the
instructions on how to get the data. You will also be provided with a
data dictionary. If the data is available for download, place it
somewhere in your portfolio repository (e.g., you can make a new folder
called data
).
Write code and text to perform the following steps:
Load, wrangle and explore the data. By now you know this is an iterative procedure, so it’s ok to have these parts of the process/code intertwined.
Once you understand the data sufficiently, formulate a question/hypothesis. This will determine your outcome of interest and, if applicable, main predictor(s) of interest. Since we don’t know the data yet, it might be that the question is a bit contrived and not actually too interesting, but I’m sure there will be more than one potentially reasonable question one can ask, no matter what the data will be.
Once you determine the question and thus your outcome and main predictors, further pre-process and clean the data as needed. Then split into train/test. (It might be that the data is too small for this split to make sense in real life but for this exercise, we’ll just do it.)
Fit at least 3 different model types to the data using the
tidymodels
framework we practiced. Use the CV approach for
model training/fitting. Explore the quality of each model by looking at
performance, residuals, uncertainty, etc. All of this should still be
evaluated using the training/CV data. You can of course recycle code
from previous exercises, but I also encourage you to explore further,
e.g. try different ML models or use different metrics. You might have to
do that anyway, depending on the data and your
question/outcome.
Based on the model evaluations and your scientific question/hypothesis, decide on one model you think is overall best. Explain why. It doesn’t have to be the model with the best performance. You make the choice, just explain why you picked the one you picked.
As a final, somewhat honest assessment of the quality of the model you chose, evaluate it (performance, residuals, uncertainty, etc.) on the test data. This is the only time you are allowed to touch the test data, and only once. Report model performance on the test data.
Summarize everything you did and found in a discussion. Make sure you discuss your findings with regard to your original question/hypothesis. What did you learn? If feasible, show a summary figure or table that illustrates your main scientific finding from this analysis.
Add ample information/documentation in the form of both code comments and explanatory text You should provide a running commentary on what you do, why you do it, and how your R code accomplishes that (so that if you stare at it in a few weeks, you can remember what in the world you did back then 😄).
I don’t want to see any p-values anywhere 😁! This is an exploratory analysis, thus there is no place for p-values.
Once done with your Tidy Tuesday analysis, rebuild your portfolio site to make sure everything works and looks good, that all the links work, etc. Then push to GitHub by the deadline. Since this will be on your portfolio website, and I know where to find it, there is no need to post any link this week.
Write a post in this week’s discussion channel that addresses this topic:
Now that we’ve essentially reached the end of the course and did the whole data-analysis pipeline, which specific topics in the data-analysis workflow do you still find unclear/confusing? Which ones do you wish we had covered more? Are you looking for further resources on a specific topic and haven’t found anything good?
Post by Wednesday, then comment on each others posts. I hope this discussion will allow us to point each other to further possibly helpful resources and maybe help clarify some still open questions.
And just to re-iterate: I very much welcome contributions to the course pages, so if you have resources you think should be added, or sections that are unclear and you can improve, please contribute! By now you know how to do that: Fork the course repo, make changes, send a pull request. I’ll take a look, if things look good I’ll merge. And of course add you to the Acknowledgments section of the course 😄.