Assessment - Data Processing

Author

Andreas Handel

Modified

2024-03-20

Quiz

Get the quiz sheet for this module from the general Assessments page. Fill it in, then submit to the online grading system before the deadline.

Discussion

How has your experience with AI tools for this course been so far? What have you found helpful, what doesn’t seem to work? Any cool tools, tricks, resources you found? Write an initial post by Wednesday, then comment on and learn from each other by Friday.

Exercise

This exercise lets you do a bit of data loading and wrangling. You’ll also work together through GitHub again.

The first part of the exercise is due by Wednesday, so your classmate can do their part before the Friday deadline.

Since this will be part of your portfolio site, and you already posted a link to that previously, you don’t need to post anything, I know where to find it. Therefore there is no exercise Slack channel for this module.

Setup

We’ll also use the familiar group setup. Check the group assignment to make sure you know who else is in your group for this week. Assign each member in your group an (arbitrary) number (I’m calling them M1, M2, …). The order should be different than before so you get to interact with a different group member. Everyone will first work on their own repository. Then M1 will contribute to M2’s repository, M2 will work on M3, etc. The last person, will work on M1’s repository. This way, everyone will work on their own and one group member’s repository.

Part 1

Part 1 is due by Wednesday.

Finding Data

Previously, you did a quick exploration of a dataset that came with an R package (gapminder data inside dslabs package). A lot of datasets can be found inside R packages. For instance, this page lists what is likely only a small fraction. The good and the bad about datasets that come with R packages is that they are often fairly clean/tidy. That’s unfortunately not how most “real world” datasets look like. Getting dirty and messy datasets and wrangling them into a form that is suitable for statistical analysis is part of most workflows and often takes a lot of time. We’ll start practicing this here by getting data that might or might not be very clean.

Go to the CDC’s data website at https://data.cdc.gov/. Browse through the site and identify a dataset of interest.

Which dataset you choose is up to you. I suggest you pick a dataset that has at least 100 observations with 5 different variables, and a mix of continuous and categorical ones. Often, 5 variables means 5 columns. That would be the case in properly formatted data. However, some of the data provided by the CDC is rather poorly formatted. For instance CDC’s dataset on traumatic brain injury has the same variable (age) in separate columns, and it is also discretized. As we’ll discuss, these are two really bad things you can do to your data, so I recommend staying away from such datasets. There are plenty on that website, so I’m sure you’ll find one that is suitable and interesting to you.

Getting the data

To get the dataset you selected, it is easiest if you download the file to your computer and place it inside your portfolio repository. Note that in general, you should make each data analysis (or other) project its own GitHub repository, and always use a structure like the one provided in the Data Analysis Template (or something similar). However, for this small exercise and for logistic reasons, you’ll use your portfolio/website repository, and just a single folder. Make a new folder called cdcdata-exercise inside your portfolio repository. Place the data into that folder.

Remember that GitHub doesn’t like large files. So if you pick a large data file (>50MB), first place it somewhere outside your repository, then reduce it by e.g., writing some R code that selects only a portion of the data. Once it’s small enough, you can place it into the GitHub repository.

While you should be able to find data for direct download from the CDC website, sometimes you need to write a bit of code to pull data from a source. This is usually done through an API. R has packages that make this relatively easy. If you ever encounter that situation, search online for instructions. Google/Stackoverflow are your friends to figure out what commands you need to write).

Exploring the data

Now, write code that explores the data. Add a new Quarto document called cdcdata-exercise.qmd to the folder you just created.

Start by providing a brief description of the data, where you got it, what it contains. Also add a link to the source.

Then write code that reads/loads the data. As needed, process the data (e.g., if there are weird symbols in the data, or missing values coded as 999 or anything of that sort, write code to fix it.) If your dataset has a lot of variables, pick a few of them (at least 5).

Once you have the data processed and cleaned, perform some exploratory/descriptive analysis on this cleaned dataset. Make some summary tables, produce some figures. Try to summarize each variable in a way that it can be described by a distribution. For instance if you have a categorical variable, show what % are in each category. If you have a continuous variable, make a plot to see if it’s approximately normal, then try to summarize it to determine its mean and standard deviation.

The idea is that your descriptive analysis will provide enough information for your classmate to make synthetic data that looks similar, along the lines discussed in the synthetic data module.

Remember to add both text to your Quarto file and comments into your code to explain what you are doing.

In a final step, update the _quarto.yml file and include a menu item for “Data Analysis Exercise” pointing to the new file. Follow the format of the existing entries. Remember to be very careful about the right amount of empty space. Re-create your website and make sure it all works and the new project shows up on the website.

A new GitHub workflow

If everything works as expected, commit and push your changes to GitHub. Instead of using the fork + pull-request workflow we’ve tried a few times, we’ll explore a different collaborative approach. In this approach, you and your collaborator work on the same repository. To that end, you need to add your classmate as collaborator. Go to Github.com, find your portfolio repository, go to Settings, then Collaborators. Choose Add Collaborator and add your classmate. Your classmate should receive an invitation, which they need to accept. With this, they are now able to directly push and pull to your repository, without them needing to create a fork. (You can remove them after this exercise if you don’t want them to be able to continue having write access to your repository).

To avoid any potential merge conflicts, once your classmate takes over, you shouldn’t make further changes to the repository.

Part 2

Part 2 is due by Friday.

Joining and cloning the GitHub repository

You should have received an invitation to be a collaborator on your classmate’s repository. Accept it, then directly clone (not fork) the repository to your local computer.

Find and open the cdcdata-exercise.qmd file. At the bottom, write a comment that says something like This section contributed by YOURNAME. This needs to be there for me to be able to grade your contribution.

Making synthetic data

Take a look at the descriptive analysis your classmate did.

Next, produce a new synthetic data set with the same structure as their cleaned/processed data. You are encouraged to use some LLM AI tools to help write the code. If you do, specify in the Quarto document or as comments the AI prompts you are using.

Write code that produces synthetic data, then summarizes/explores the data with a few tables and figures similar to those made by your classmate for the original data.

Add comments in the code and text into the Quarto document to explain what you did and how close your synthetic data is compared to the original data.

Make sure everything works and the website renders ok.

Finishing the GitHub workflow

Once all is done and works, commit, push to GitHub. Note that you are now directly pushing to the original repo which is owned by your classmate. This is easier, you don’t need to do fork and pull request (PR). It’s also more dangerous, since you could potentially mess up your classmate’s repo. So make sure things work before committing and pushing.

Some more comments on GitHub workflows:

In general, if you work closely with someone on a project, it might make sense to add them as collaborator, and as needed coordinate with them to avoid merge conflicts. Otherwise, telling someone to contribute by forking and sending a pull request is the safer approach, and you have control if you want to accept their changes or not.

There is yet another common way to use GitHub, namely collaborators working in the same repository, but with different branches. Think of a branch like a fork, but it happens inside the repository. Work can occur independently on branches, and at some point one can merge branches. This allows people to work in a single repository, but minimizes possible merge conflicts. This approach is standard for larger projects with many collaborators. For this class, we won’t use branches, but note that they are useful and commonly used.

Part 3

The original repository owner doesn’t really need to do anything. What their classmate added hopefully works and shows up on the website. You still might want to check to make sure that everything is ok.

Since this will be part of your portfolio site, and you already posted a link to that previously, you don’t need to post anything, I know where to find it. I will assess both the original contribution and the addition made by the second person.