Get the quiz sheet for this module from the general Assessments page. Fill it in, then submit to the online grading system before the deadline.
This exercise lets you do a bit of data loading and wrangling. And of course more group work and GitHub 😁.
While this is going to be a small and incomplete data analysis, we’ll practice using the full setup. To that end, grab another copy of the dataanalyis-template, make it your own repository (name it
YOURNAME-MADA-analysis2), and clone it to your local computer. Then use the structure of this repository to do the project as described below.
We’ll also use the same group setup as last week. Assign each member in your group an (arbitrary) number (I’m calling them M1, M2, …). The order should be different than last week so you get to interact with a different group member. Everyone will first work on their own and finish this part by Wednesday. Then M1 will contribute to M2’s repository, M2 will work on M3, etc. The last person (M3 or M4), will work on M1’s repository. This way, everyone will work on their own and one group member’s repository.
Previously, you did a quick exploration of a dataset that came with an R package (
gapminder data inside
dslabs package). A lot of datasets can be found inside R packages. For instance, this page lists what is likely only a small fraction. There is even a Reddit group dedicated to R datasets. The good and the bad about datasets that come with R packages is that they are often fairly clean/tidy. That’s unfortunately not how most “real world” datasets look like. Getting dirty and messy datasets and wrangling them into a form that is suitable for statistical analysis is part of most workflows and often takes a lot of time. We’ll start practicing this here by getting data that might or might not be very clean.
Go to the CDC’s data website at https://data.cdc.gov/. Browse through the site and identify a dataset of interest.
Which dataset you choose is up to you. To make a future exercise easier on you, I suggest you pick a dataset that has at least 5 different variables, and a mix of continuous and categorical ones. Often, 5 variables means 5 columns. That would be the case in properly formatted data. However, some of the data provided by the CDC is rather poorly formatted. For instance this dataset has the same variable (age) in separate columns, and it is also discretized. As we’ll discuss, these are two really bad things you can do to your data, so I recommend staying away from such datasets. There are plenty on that website, so I’m sure you’ll find one that is suitable and interesting to you.
If you absolutely can’t find a good dataset on the CDC website, you can get one somewhere else. It needs to be real world data (so no training/teaching data repositories), decently documented, and readily available.
Get the dataset you selected into R. You can do that by downloading the file to your computer first and placing it into the
raw_data folder. Then modify the
processingscript.R file to read in your data file. You can delete or comment out anything that’s already in that R script.
If no file is available for download, or if it is too large to download and place in the repo (GitHub doesn’t like really large files, one of its limitations), you can instead write code to pull it directly from the source through the API (Google is your friend to figure out what commands you need to write).
Once the data is loaded, you’ll likely need to do some processing. Comprehensive and full cleaning of the data is not necessary. Instead, decide on a few variables of interest and clean those. Think of one variable as the main outcome of interest (in a plot, that would generally be shown on the y-axis) and some other variable(s) as the predictors of interest. Similar to the height/weight example code. Or it could be some quantity changing over time. It’s your choice.
Once you written code that processes the bits of the data you are interested in, save them as an RDS file, similar to what you see in the example code.
Add enough commentary into your R script such that your classmate who will take over next knows what you are doing and what variables they should work with!
When all the cleaning parts work, commit and push your changes to GitHub. Then let the group member who will take over (see above) know where to find your repository.
Fork and clone your fellow group member’s repository as you did previously. Once you have it on your local computer, open the
analysisscript.R file and replace the existing code with new code that loads the data cleaned by your colleague and produces a few quick plots and/or tables. Then update the
Manuscript.Rmd file by removing the old results and showing your new results.
Once all done, commit, push to your repo on Github (the fork of the original), then initiate a pull request (PR) to the original repository.
The original repository owner should check the PR they received from their colleague, request changes if needed, approve if all looks okay, then merge and update their own local repo. Then post the link to the repository into the
module4_exercise channel. I’ll take a look/grade the project after the deadline.
No discussion assignment this week. Instead, submit project part 1.
Submission of part 1 is due. Submit a link with a URL to your repository to the project_topics Slack channel.