Assessment - Data Processing
Quiz
Get the quiz sheet for this module from the general Assessments page. Fill it in, then submit to the online grading system before the deadline.
Discussion
How has your experience with AI tools for this course been so far? What have you found helpful, what doesn’t seem to work? Any cool tools, tricks, resources you found? Write an initial post by Wednesday, then comment on and learn from each other by Friday.
Exercise
This exercise lets you do a bit of data loading and wrangling. You’ll also work together through GitHub again.
The first part of the exercise is due by Wednesday, so your classmate can do their part before the Friday deadline.
Setup
We’ll also use the familiar group setup. Check the group assignment to make sure you know who else is in your group for this week. Assign each member in your group an (arbitrary) number (I’m calling them M1, M2, …). The order should be different than before so you get to interact with a different group member. Everyone will first work on their own repository. Then M1 will contribute to M2’s repository, M2 will work on M3, etc. The last person, will work on M1’s repository. This way, everyone will work on their own and one group member’s repository.
Part 1
Part 1 is due by Wednesday.
Finding Data
Previously, you did a quick exploration of a dataset that came with an R package (gapminder data inside dslabs package). A lot of datasets can be found inside R packages. For instance, this page lists what is likely only a small fraction. The good and the bad about datasets that come with R packages is that they are often fairly clean/tidy. That’s unfortunately not how most “real world” datasets look like. Getting dirty and messy datasets and wrangling them into a form that is suitable for statistical analysis is part of most workflows and often takes a lot of time. We’ll start practicing this here by getting data that might or might not be very clean.
Go to the CDC’s data website. Browse through the site and identify a dataset of interest.
Which dataset you choose is up to you. I suggest you pick a dataset that has at least 100 observations with 5 different variables, and a mix of continuous and categorical ones. Often, 5 variables means 5 columns. That would be the case in properly formatted data. However, some of the data provided by the CDC is rather poorly formatted. For instance CDC’s dataset on traumatic brain injury has the same variable (age) in separate columns, and it is also discretized. As we’ll discuss, these are two really bad things you can do to your data, so I recommend staying away from such datasets. There are plenty on that website, so I’m sure you’ll find one that is suitable and interesting to you.
Getting the data
To get the dataset you selected, it is easiest if you download the file to your computer and place it inside your portfolio repository. Note that in general, you should make each data analysis (or other) project its own GitHub repository, and always use a structure like the one provided in the Data Analysis Template (or something similar). However, for this small exercise and for logistic reasons, you’ll use your portfolio/website repository, and just a single folder. Make a new folder called cdcdata-exercise inside your portfolio repository. Place the data into that folder.
Remember that GitHub doesn’t like large files. So if you pick a large data file (>50MB), first place it somewhere outside your repository, then reduce it by e.g., writing some R code that selects only a portion of the data. Once it’s small enough, you can place it into the GitHub repository.
While you should be able to find data for direct download from the CDC website, sometimes you need to write a bit of code to pull data from a source. This is usually done through an API. R has packages that make this relatively easy. If you ever encounter that situation, search online for instructions. Google/Stackoverflow are your friends to figure out what commands you need to write).
Exploring the data
Now, write code that explores the data. Add a new Quarto document called cdcdata-exercise.qmd to the folder you just created.
Start by providing a brief description of the data, where you got it, what it contains. Also add a link to the source.
Then write code that reads/loads the data. As needed, process the data (e.g., if there are weird symbols in the data, or missing values coded as 999 or anything of that sort, write code to fix it.) If your dataset has a lot of variables, pick a few of them (at least 5).
Once you have the data processed and cleaned, perform some exploratory/descriptive analysis on this cleaned dataset. Make some summary tables, produce some figures. Try to summarize each variable in a way that it can be described by a distribution. For instance if you have a categorical variable, show what % are in each category. If you have a continuous variable, make a plot to see if it’s approximately normal, then try to summarize it to determine its mean and standard deviation.
The idea is that your descriptive analysis will provide enough information for your classmate to make synthetic data that looks similar, along the lines discussed in the synthetic data module.
Remember to add both text to your Quarto file and comments into your code to explain what you are doing.
In a final step, update the _quarto.yml file and include a menu item for “Data Analysis Exercise” pointing to the new file. Follow the format of the existing entries. Remember to be very careful about the right amount of empty space. Re-create your website and make sure it all works and the new project shows up on the website.
GitHub Collaboration
If everything works as expected, commit and push your changes to GitHub. Let your team member know that things are ready. We are using the fork + pull-request workflow again. If you forgot, go back to the previous exercise and follow those detailed instructions.
Part 2
Part 2 is due by Friday.
Making synthetic data
Get a copy of your classmate’s repository using the fork+clone approach discussed previously.
Find and open the cdcdata-exercise.qmd file. At the bottom, write a comment that says something like This section contributed by YOURNAME. This needs to be there for me to be able to grade your contribution.
Take a look at the descriptive analysis your classmate did.
Next, produce a new synthetic data set with the same structure as their cleaned/processed data. You are encouraged to use some LLM AI tools to help write the code. If you do, specify in the Quarto document or as comments the AI prompts you are using.
Write code that produces synthetic data, then summarizes/explores the data with a few tables and figures similar to those made by your classmate for the original data.
Add comments in the code and text into the Quarto document to explain what you did and how close your synthetic data is compared to the original data.
Make sure everything works and the website renders ok.
Finishing the GitHub workflow
Once all is done and works, commit, push to GitHub. Visit your fork of the repo on GitHub.com and initiate a pull request as you did in a previous exercise.
Part 3
The original repository owner should receive a PR. Hopefully there are no conflicts and the updates can be merged. Pull the updates to your local machine and make sure it all works. Rebuild/render the whole website. Then push back. Check your website online to make sure everything shows up as it should and that all your exercises are reachable through the menu.
Since this will be part of your portfolio site, and you already posted a link to that previously, you don’t need to post anything, I know where to find it.
Another fork and pull exercise
This is optional. You can do it at any time during this course (and more than once) 😁.
Help improve the course with your contributions! Find something wrong/unclear/worth improving with this course (e.g. a typo, something confusing, a broken link, a suggestion for a new reference, or anything else). Go to the GitHub repository for this course. Follow the steps outlined above: Fork the course to your personal account, clone it to your local computer, implement your updates, push it back to GitHub, then initiate a pull request. I will get a notification of your pull request. If things look ok and no conflicts exist, I will merge your improvements into the course. And just like that, you have contributed to improving this course! (And of course, you will be listed in the Acknowledgments section of the main course page.)
Another option for helping to improve the course website is to file a GitHub Issue. Feel free to do so any time during the course to let me know of anything that needs fixing.