Get the quiz sheet for this module from the general Assessments page. Fill it in, then submit to the online grading system before the deadline.
This exercise lets you do a bit of data loading and wrangling. And of course more group work and GitHub đ.
The first part of the exercise is due by Wednesday, so your classmate can do their part before the Friday deadline.
Since this will be part of your portfolio site, and you already posted a link to that previously, you donât need to post anything, I know where to find it. Therefore there is no exercise Slack channel for this module.
Weâll also use the same group setup as last week. Assign each member in your group an (arbitrary) number (Iâm calling them M1, M2, âŚ). The order should be different than last week so you get to interact with a different group member. Everyone will first work on their own and finish their first part by Wednesday. Then M1 will contribute to M2âs repository, M2 will work on M3, etc. The last person, will work on M1âs repository. This way, everyone will work on their own and one group memberâs repository.
This is going to be a small data analysis. I generally recommend
making each data analysis (or other) project its own GitHub repository,
and always using a structure like the one provided in the
Data Analysis Template
(or something similar). However,
for this small exercise and for logistic reasons, youâll use
your portfolio/website repository. If you want, you can make a
new folder for this exercise inside your portfolio repository (call it,
e.g., dataanalysis-exercise
) and in that folder, create
sub-folders that are similar to the one from the
Data Analysis Template
(e.g., a data
folder
and a code
folder).
Previously, you did a quick exploration of a dataset that came with
an R package (gapminder
data inside dslabs
package). A lot of datasets can be found inside R packages. For
instance, this
page lists what is likely only a small fraction. There is even a Reddit group dedicated to R
datasets. The good and the bad about datasets that come with R
packages is that they are often fairly clean/tidy. Thatâs unfortunately
not how most âreal worldâ datasets look like. Getting dirty and messy
datasets and wrangling them into a form that is suitable for statistical
analysis is part of most workflows and often takes a lot of time. Weâll
start practicing this here by getting data that might or might not be
very clean.
Go to the CDCâs data website at https://data.cdc.gov/. Browse through the site and identify a dataset of interest.
Which dataset you choose is up to you. I suggest you pick a dataset that has at least 5 different variables, and a mix of continuous and categorical ones. Often, 5 variables means 5 columns. That would be the case in properly formatted data. However, some of the data provided by the CDC is rather poorly formatted. For instance this dataset has the same variable (age) in separate columns, and it is also discretized. As weâll discuss, these are two really bad things you can do to your data, so I recommend staying away from such datasets. There are plenty on that website, so Iâm sure youâll find one that is suitable and interesting to you.
If you absolutely canât find a good dataset on the CDC website, you can get one somewhere else. It needs to be real world data (so no training/teaching data repositories), decently documented, and readily available.
To get the dataset you selected, it is easiest if you download the
file to your computer and place it inside your repository (ideally into
a rawdata
folder within this exercise folder).
Remember that GitHub doesnât like large files. So if you pick a large data file (>100MB), first place it somewhere outside your repository, then reduce it by e.g., writing some R code that selects only a portion of the data. Once itâs small enough, you can place it into the GitHub repository. If no file is available for easy download, or if it is too large to download and place in the repo, you can instead write code to pull it directly from the source. This is generally done via an API that the place you get the data from provides (Google is your friend to figure out what commands you need to write).
Now, write code that loads the data and processes/cleans it.
Add a dataanalysis_exercise.qmd
file to the main folder
of your repository (the place where the other _exercise.qmd
files are). You can write code either into that Quarto file, or do a
combined R script + Quarto, like the examples in the
Data Analysis Template
.
At the top of the Quarto file, add a brief description of the data, where you got it, what it contains. Also add a link to the source.
Then write code that reads/load and process the data. Comprehensive
and full cleaning of all the data is not necessary. Instead, decide on a
few variables of interest and clean those. Think of one variable as the
main outcome of interest (in a plot, that would generally be shown on
the y-axis) and some other variable(s) as the predictors of interest.
Similar to the height/weight example code in the
Data Analysis Template
. Or it could be some quantity
changing over time. Itâs your choice.
Once you have written code that processes the bits of the data you
are interested in, save the cleaned data as an RDS file. Also include a
summary table of your cleaned data. Feel free to peek at the
Manscript.qmd
file of the
Data Analysis Template
for inspiration.
Add enough commentary to your Quarto file and R code such that your classmate who will take over next knows what you are doing and what variables they should work with.
When all these parts are done and work, commit and push your changes to GitHub. Then let the group member who will take over (see above) know that itâs their turn.
Fork and clone your fellow group memberâs repository using the same workflow you used in a previous exercise. Once you have it on your local computer, open the Quarto (or Quarto+R) file(s).
Add code and text below the part your classmate did. Add a
heading to indicate where your section starts and also add your name.
Specifically, have a heading that says
# This section added by YOURFULLNAME
. I need this so I can
grade accordingly.
Write code that loads the Rds
file of the cleaned data
that your colleague produced. Then add some code that produces a few
plots and/or tables. These can be purely descriptive and exploratory, or
if you feel comfortable with some basic R
statistical
commands, you can also fit some model to the data.
Add enough commentary such that your classmate and any reader knows what you are doing (and why).
Once all done, commit, push to your repo on Github (the fork of the original), then initiate a pull request (PR) to the original repository.
The original repository owner should check the PR they received from their colleague, request changes if needed, approve if all looks okay, then merge and update their own repo.
In a final step, update the _quarto.yml
file and include
a menu item for âData Analysis Exerciseâ pointing to the new file.
Re-create your website and make sure it all works and the new project
shows up.
Since this will be part of your portfolio site, and you already posted a link to that previously, you donât need to post anything, I know where to find it. I will assess both the original contribution and the addition made by the second person.
No discussion assignment this week. Instead, submit project part 1.
Submission of part 1 is due. Submit a link with a URL to your
project repository to the project_related
Slack
channel.