Quiz

Get the quiz sheet for this module from the general Assessments page. Fill it in, then submit to the online grading system before the deadline.

Exercise

This exercise lets you do a bit of data loading and wrangling. And of course more group work and GitHub 😁.

The first part of the exercise is due by Wednesday, so your classmate can do their part before the Friday deadline.

Since this will be part of your portfolio site, and you already posted a link to that previously, you don’t need to post anything, I know where to find it. Therefore there is no exercise Slack channel for this module.

Setup

We’ll also use the same group setup as last week. Assign each member in your group an (arbitrary) number (I’m calling them M1, M2, …). The order should be different than last week so you get to interact with a different group member. Everyone will first work on their own and finish their first part by Wednesday. Then M1 will contribute to M2’s repository, M2 will work on M3, etc. The last person, will work on M1’s repository. This way, everyone will work on their own and one group member’s repository.

This is going to be a small data analysis. I generally recommend making each data analysis (or other) project its own GitHub repository, and always using a structure like the one provided in the Data Analysis Template (or something similar). However, for this small exercise and for logistic reasons, you’ll use your portfolio/website repository. If you want, you can make a new folder for this exercise inside your portfolio repository (call it, e.g., dataanalysis-exercise) and in that folder, create sub-folders that are similar to the one from the Data Analysis Template (e.g., a data folder and a code folder).

Finding Data

Previously, you did a quick exploration of a dataset that came with an R package (gapminder data inside dslabs package). A lot of datasets can be found inside R packages. For instance, this page lists what is likely only a small fraction. There is even a Reddit group dedicated to R datasets. The good and the bad about datasets that come with R packages is that they are often fairly clean/tidy. That’s unfortunately not how most “real world” datasets look like. Getting dirty and messy datasets and wrangling them into a form that is suitable for statistical analysis is part of most workflows and often takes a lot of time. We’ll start practicing this here by getting data that might or might not be very clean.

Go to the CDC’s data website at https://data.cdc.gov/. Browse through the site and identify a dataset of interest.

Which dataset you choose is up to you. I suggest you pick a dataset that has at least 5 different variables, and a mix of continuous and categorical ones. Often, 5 variables means 5 columns. That would be the case in properly formatted data. However, some of the data provided by the CDC is rather poorly formatted. For instance this dataset has the same variable (age) in separate columns, and it is also discretized. As we’ll discuss, these are two really bad things you can do to your data, so I recommend staying away from such datasets. There are plenty on that website, so I’m sure you’ll find one that is suitable and interesting to you.

If you absolutely can’t find a good dataset on the CDC website, you can get one somewhere else. It needs to be real world data (so no training/teaching data repositories), decently documented, and readily available.

Getting the data

To get the dataset you selected, it is easiest if you download the file to your computer and place it inside your repository (ideally into a rawdata folder within this exercise folder).

Remember that GitHub doesn’t like large files. So if you pick a large data file (>100MB), first place it somewhere outside your repository, then reduce it by e.g., writing some R code that selects only a portion of the data. Once it’s small enough, you can place it into the GitHub repository. If no file is available for easy download, or if it is too large to download and place in the repo, you can instead write code to pull it directly from the source. This is generally done via an API that the place you get the data from provides (Google is your friend to figure out what commands you need to write).

Loading and processing the data

Now, write code that loads the data and processes/cleans it.

Add a dataanalysis_exercise.qmd file to the main folder of your repository (the place where the other _exercise.qmd files are). You can write code either into that Quarto file, or do a combined R script + Quarto, like the examples in the Data Analysis Template.

At the top of the Quarto file, add a brief description of the data, where you got it, what it contains. Also add a link to the source.

Then write code that reads/load and process the data. Comprehensive and full cleaning of all the data is not necessary. Instead, decide on a few variables of interest and clean those. Think of one variable as the main outcome of interest (in a plot, that would generally be shown on the y-axis) and some other variable(s) as the predictors of interest. Similar to the height/weight example code in the Data Analysis Template. Or it could be some quantity changing over time. It’s your choice.

Once you have written code that processes the bits of the data you are interested in, save the cleaned data as an RDS file. Also include a summary table of your cleaned data. Feel free to peek at the Manscript.qmd file of the Data Analysis Template for inspiration.

Add enough commentary to your Quarto file and R code such that your classmate who will take over next knows what you are doing and what variables they should work with.

When all these parts are done and work, commit and push your changes to GitHub. Then let the group member who will take over (see above) know that it’s their turn.

Wrangling and exploring data

Fork and clone your fellow group member’s repository using the same workflow you used in a previous exercise. Once you have it on your local computer, open the Quarto (or Quarto+R) file(s).

Add code and text below the part your classmate did. Add a heading to indicate where your section starts and also add your name. Specifically, have a heading that says # This section added by YOURFULLNAME. I need this so I can grade accordingly.

Write code that loads the Rds file of the cleaned data that your colleague produced. Then add some code that produces a few plots and/or tables. These can be purely descriptive and exploratory, or if you feel comfortable with some basic R statistical commands, you can also fit some model to the data.

Add enough commentary such that your classmate and any reader knows what you are doing (and why).

Once all done, commit, push to your repo on Github (the fork of the original), then initiate a pull request (PR) to the original repository.

Posting the “final” product

The original repository owner should check the PR they received from their colleague, request changes if needed, approve if all looks okay, then merge and update their own repo.

In a final step, update the _quarto.yml file and include a menu item for “Data Analysis Exercise” pointing to the new file. Re-create your website and make sure it all works and the new project shows up.

Since this will be part of your portfolio site, and you already posted a link to that previously, you don’t need to post anything, I know where to find it. I will assess both the original contribution and the addition made by the second person.

Discussion

No discussion assignment this week. Instead, submit project part 1.

Project

Submission of part 1 is due. Submit a link with a URL to your project repository to the project_related Slack channel.

Assessment - Data Finding and Wrangling

Andreas Handel

2023-02-11 14:29:13.025827