Get the quiz sheet for this module from the general Assessments page. Fill it in, then submit to the online grading system before the deadline.
Discussion
Look online and find an example of a research project that provides (or claims to provide) all materials to allow reproduction of results, similar to Dr. Brian McKay’s project I shared with you. If you are able, download the materials and see if you can reproduce things. I suggest you focus on projects that are done with our set of tools (R/Quarto, etc.), but that’s not required. Report the project you found and your experience being able (or not) to reproduce it as a post in the Module 2 Discussion channel. Do so by Wednesday 5pm.
Then take a look at a few of your classmates’ postings and discuss/comment on what they found. Do so by the Friday deadline.
Exercise
For this exercise, you will perform a small toy data analysis that allows you to continue practicing our tools. You will start doing group work.
As a reminder, you are allowed to use AI tools for this - or any other - task in this course. If you do, you should add comments to your R/Quarto files to indicate where and how you used AI.
Group setup
Find your fellow group members and organize yourself. You can find group assignments in the important-information channel. There is also a dedicated channel for each group. Get in touch with your group members. You will need to exchange GitHub user names. Assign each group member an (arbitrary) number (I’m calling them M1, M2, …).
You will start working in your portfolio repository and finish this part by Wednesday. Then M2 will contribute to M1’s repository, M3 will work on M2s, etc. M1 will contribute to the last person in the group (M3/M4/M5, based on the number of people in our group). This way, everyone will work on their own and one other group member’s repository.
Because there are multiple parts to this exercise, the due dates are adjusted.
Part 1
Part 1 is due by Tuesday.
Getting started
Open your portfolio website repository/project. Then open the - currently empty - file for this exercise, called starter-analyis-exercise.qmd.
Documenting well is very important! Add lots of comments to your code/file. I suggest that your code should be more than half comments. For every block of code, you have a few lines of comments at the beginning explaining what the code block does, and then each line of code gets its separate line of comment with more details. Comment on both the how and why of your code. This much commentary might seem overkill initially. But as your code gets more complex, it will be very useful. Both your collaborators, and your future self looking at the code you wrote several weeks ago will be incredibly thankful for your comments!
If you write R code, your comments will be lines that start with #. For Quarto or R Markdown files, you can either add comments as Markdown text above/below your code, and/or add comments inside your R code chunks. Both is ideal. Positron allows you to quickly turn sections of a document into comments or un-comment them. Just select the part you want to change, and hit Ctrl/Cmd + /. That can be useful for turning on/off code during testing, or hiding some parts of text that’s just meant for you but not for the reader.
Loading and checking data
We’ll look at and play with some data from the dslabs package. Write a code chunk using the library() function that loads the package (install the dslabs package first if you don’t have it yet).
It is good practice to load all packages at the beginning of your code. So if you are using some R package, instead of loading it with library just before you use it, place all your library commands at the beginning of your R script or Quarto file. Also, add a short comment explaining why you are loading a certain package. For a complex project, it might even make sense to list all packages you use in a readme file. You can also use something like the renv package which keeps track of all your packages and makes sure someone running your project at a later time gets exactly the same packages you use. While renv needs a bit of getting used to, and it’s not required for the course, I encourage you to check it out and if you want to, use it.
All the code you write for this (and any other) project should be written into an R or Quarto file, not in the R console. The reason for that is that you want a permanent record of what you did, and the ability to modify and re-run your analysis easily. For this exercise, the best option is to write the code directly into the Quarto document.
We’ll look at the gapminder dataset from dslabs. Once you have installed and loaded the dslabs package, the dataset is available. I.e., different than datasets you get from external sources, those that come with R packages are available right after you load the package. If you want to learn more about the dataset, you can run help(gapminder) or ?gapminderin the R console - this is one instance where an R call should only be done interactively, not inside a file since you don’t want the help page to open when you run your code in an automated/scripted manner.
Write code to use the str() and summary() functions to take a look at the data. Use the class() function to check what type of object gapminder is. To illustrate how that should look, you should have something like these lines of code and R output so far for this exercise.
# load dslabs packagelibrary("dslabs")
Warning: package 'dslabs' was built under R version 4.5.2
# only run the next command interactively, not in a script# help(gapminder) # get an overview of data structurestr(gapminder)
'data.frame': 10545 obs. of 9 variables:
$ country : Factor w/ 185 levels "Albania","Algeria",..: 1 2 3 4 5 6 7 8 9 10 ...
$ year : int 1960 1960 1960 1960 1960 1960 1960 1960 1960 1960 ...
$ infant_mortality: num 115.4 148.2 208 NA 59.9 ...
$ life_expectancy : num 62.9 47.5 36 63 65.4 ...
$ fertility : num 6.19 7.65 7.32 4.43 3.11 4.55 4.82 3.45 2.7 5.57 ...
$ population : num 1636054 11124892 5270844 54681 20619075 ...
$ gdp : num NA 1.38e+10 NA NA 1.08e+11 ...
$ continent : Factor w/ 5 levels "Africa","Americas",..: 4 1 1 2 2 3 2 5 4 3 ...
$ region : Factor w/ 22 levels "Australia and New Zealand",..: 19 11 10 2 15 21 2 1 22 21 ...
# get a summary of datasummary(gapminder)
country year infant_mortality life_expectancy
Albania : 57 Min. :1960 Min. : 1.50 Min. :13.20
Algeria : 57 1st Qu.:1974 1st Qu.: 16.00 1st Qu.:57.50
Angola : 57 Median :1988 Median : 41.50 Median :67.54
Antigua and Barbuda: 57 Mean :1988 Mean : 55.31 Mean :64.81
Argentina : 57 3rd Qu.:2002 3rd Qu.: 85.10 3rd Qu.:73.00
Armenia : 57 Max. :2016 Max. :276.90 Max. :83.90
(Other) :10203 NA's :1453
fertility population gdp continent
Min. :0.840 Min. :3.124e+04 Min. :4.040e+07 Africa :2907
1st Qu.:2.200 1st Qu.:1.333e+06 1st Qu.:1.846e+09 Americas:2052
Median :3.750 Median :5.009e+06 Median :7.794e+09 Asia :2679
Mean :4.084 Mean :2.701e+07 Mean :1.480e+11 Europe :2223
3rd Qu.:6.000 3rd Qu.:1.523e+07 3rd Qu.:5.540e+10 Oceania : 684
Max. :9.220 Max. :1.376e+09 Max. :1.174e+13
NA's :187 NA's :185 NA's :2972
region
Western Asia :1026
Eastern Africa : 912
Western Africa : 912
Caribbean : 741
South America : 684
Southern Europe: 684
(Other) :5586
# determine the type of object gapminder isclass(gapminder)
[1] "data.frame"
Processing data
You can accomplish the next steps (and pretty much anything) with just basic R commands and not use additional packages. However, things are often easier with packages. For data processing tasks, the packages from the tidyverse are very useful. You can do the following tasks with any commands/packages you like.
Write code that assigns only the African countries to a new object/variable called africadata. Run str and summary on the new object you created. You should now have 2907 observations, down from 10545. Depending on how you do this, you might also notice that all the different categories are still kept in the continent (and other) variables, but show 0. R does not automatically remove categories of what in R is called a factor variable (a categorical variable) even if they are empty. We don’t have to worry about that just now, but something to keep in mind, it can sometimes lead to strange behavior.
Take the africadata object and create two new objects (name them whatever you want), one that contains only infant_mortality and life_expectancy and one that contains only population and life_expectancy. You should have two new objects/variables with 2907 rows and two columns. Use the str, and summary commands to take a look at both. Make sure you add comments into your code to explain what each line of code is doing, and as needed, also add additional explanatory Markdown text to your Quarto file.
I find it the least confusing to call things which store values in Robjects (e.g., x is an object here: x <- 2 + 2) and reserve the word variable for a data variable, i.e., usually a column. However, it is common in programming to also refer to an object as a variable. Because of that, I sometimes use that terminology (inadvertently) too. So if I talk about a variable, you need to determine from the context if I mean a certain variable in the data (e.g. height or weight), or a variable in R (e.g. x or result) that stores some content.
Plotting
Using the new variables you created, plot life expectancy as a function of infant mortality and as a function of population size. Make two separate plots. Plot the data as points. For the plot with population size on the x-axis, set the x-axis to a log scale.
You should see a negative correlation between infant mortality and life expectancy, which makes sense. You should also see a positive correlation between population size and life expectancy. In both plots, especially the second one, you will see ‘streaks’ of data that seem to go together. Can you figure out what is going on here? Take another look at the africadata data we generated, which should give you a hint of what’s happening. Add descriptive text into your Quarto file to explain what you see and why.
More data processing
I’m sure you realized that the pattern we see in the data is due to the fact that we have different years for individual countries, and that over time these countries increase in population size and also life expectancy. Let’s pick only one year and see what patterns we find. We want a year for which we have the most data. You might have noticed that in africadata, there are 226 NA (i.e., missing values) for infant mortality. Write code that figures out which years have missing data for infant mortality. You should find that there is missing up to 1981 and then again for 2016. So we’ll avoid those years and go with 2000 instead. Create a new object by extracting only the data for the year 2000 from the africadata object. You should end up with 51 observations and 9 variables. Check it with str and summary.
More plotting
Let’s make the same plots as above again, this time only for the year 2000. Based on those plots, there seems to still be a negative correlation between infant mortality and life expectancy, and no noticeable correlation between population size and life expectancy. Let’s apply some statistical model to this data.
Simple model fits
Use the lm function and fit life expectancy as the outcome, and infant mortality as the predictor. Then repeat, now with the population size as the predictor variable. (Use the data from 2000 only.) Save the result from the two fits into two objects (e.g. fit1 and fit2) and apply the summary command to both, which will print various fit results to the screen Add comments into your Quarto file to explain what you did and found.
Sending updates to Github
Once you are done with your exercise, render/re-build your portfolio website. Make sure no error messages show up. A preview should show up, check that the page for this exercise looks the way you want it to. Once you are happy with how everything looks, commit your changes and push to the remote repository on GitHub.com. Check your portfolio website online to make sure you can now see the newly created R exercise document (in addition to your previously created About page).
Handing it off to your classmate
Based on the group setup you did above, tell the classmate who will be working on your project that it’s ready and where to find it. Assuming you are M1, you would tell M2 that things are ready. At the same time, you should be notified by another classmate that their repository is ready for you (again, if you happen to be M1, it would be the last person in the group, say M4 or M5).
Part 2
Part 2 is due by Thursday.
Adding to each other’s work
Once you’ve done your first part, you’ll contribute to another group member’s project.
You will work on your classmate’s repository using what is called the fork and pull-request workflow. The basic idea is as follows. First, you make a copy of someone’s GitHub repository. In GitHub terminology, that is called doing a fork of their repository. You can do that for any public repository.
Next, you implement your improvements in the forked repository. Once you are done, you ask the owner of the original repository to incorporate the updates you made in the fork into their main repository. This last part is called issuing a pull request. You are requesting that the other person pull your changes into their repository, hence the at times confusing (at least for me) terminology. I prefer to think of them as merge requests or sync requests, i.e. you are requesting that they merge or sync your changes into their repository. You’ll find the terminology merge request is used at times. If the person who controls the main repository likes your changes, they will merge your fork into the main branch. And just like that, you have contributed to some project becoming better! We will practice this fork and pull flow now.
Find the repository of the team member you will contribute to and fork their repository on Github.com. This places a copy of their repository into your online Github account. This fork is now your own repository, you have it forever, even if the person who owns the original repository deletes theirs.
Next, clone your fork to your local machine, as you have done previously with your own repositories. Once you finished cloning to your local machine, open the repository in Positron. Once you have the repository open, find the file for this exercise. Open it, make sure it runs/renders. Then add your part at the bottom of the file.
Start off by adding a comment that says something like This section contributed by YOURFULLNAME. This needs to be there for me to be able to grade your contribution.
More data exploration
Pick another dataset from dslabs, whichever one you want. Unfortunately, the dslabs package doesn’t have a nice website. But you can go to its offical CRAN entry and click on Reference Manual. The pdf lists the various other datasets and for each provides a brief explanation.
Once you chose one of the datasets, write R code to go through similar steps as above. Specifically, do the following:
Explore the dataset.
Do any processing/cleaning you want to do.
Make a few exploratory figures. Optionally, also some tables.
Run some simple statistical model(s). Your choice.
Report the results from the model(s).
For each step, add plenty comments to the code and explanatory Markdown text to the Quarto file.
Sending a pull request (PR)
Once you are done with your additions, make sure the whole website renders. Then commit and push your updates to your fork on GitHub. Note that this pushes to your fork (i.e. copy) of the repository.
Now it’s time to offer your contribution to your classmate to integrate into their repository.
This is called issuing a pull request. You are requesting that the other person pull your changes into their repository, hence the at times confusing (at least for us) terminology. Maybe thinking of it as merge requests or sync requests is better, i.e. you are requesting that they merge or sync your changes into their repository. You’ll find the terminology merge request is used at times. If the person who controls the main repository likes your changes, they will merge your fork into the main branch. And just like that, you have contributed to some project becoming better!
To open a pull request, go to your fork of the repo on GitHub. At the top, you should see something like This branch is N commits ahead of NNN:main. and next to it a Contribute button. Click on that buttion and choose Open pull request.
You will be taken to a page where you can provide a title and description of your changes. Be clear about what you changed and why.
Hopefully, you’ll see a green check-mark that says able to merge. If your classmate made changes to the same files you did, it could have created a merge conflict. Hopefully, this won’t be the case. If it is, you might want to put the file that has been edited by both into a safe location outside of your repository, then pull the latest version of the original repository into your local copy, re-apply your changes, commit and push again.
In either case (merge conflict or not), you can click the green button and Create a pull request. This should send a notification to the owner of the repository that a pull request was created.
Part 3
Part 3 is due by Friday.
Merging the PR
Once you receive a pull request notification from your teammate, go to the GitHub site for your repository. Click on Pull requests, then click on the request. Take a look. On the first page, it shows you their message and if there are conflicts with your version of the repository. Hopefully, you didn’t change things around while they did, so there shouldn’t be any conflicts. Click on the Files changed button, which will show you an overview of the code they changed. Removals are red, and additions are green.
On the main pull request site, you can do various things. If you don’t like the suggested edits, you can write a comment and close the pull request without merging their changes into your repository. If you like most of what they did, but there is something they need to adjust, write a comment and let them know. Close the pull request and ask them to send a new one. If you are ok with their changes (hopefully, this is the case here), you can merge the pull request and close it. Their updates are now part of your repository.
Once you finished the merging of their updates into your repository online, make sure to pull the latest version to your local computer. Make sure everything looks fine and everything runs.
Make any further needed changes/updates. Re-render everything. Once everything is good, push your updated portfolio repository to GitHub. Check the website to make sure everything looks good. I’ll be looking at the content shown on the website for assessment/grading purposes.
Since this will be part of your portfolio site, and you already posted a link to that previously, you don’t need to post anything, I know where to find it.