Quiz

Get the quiz sheet for this module from the general Assessments page. Fill it in, then submit to the online grading system before the deadline.

Exercise

This project asks you to write code to do a simple data analysis. This also involves group work, and of course GitHub.

The first part of the project is due by Wednesday, so your classmate can do their part before the Friday deadline.

Since this will be part of your portfolio site, and you already posted a link to that previously, you don’t need to post anything, I know where to find it. Therefore there is no exercise Slack channel for this module.

Group setup

Assign each member in your group an (arbitrary) number (I’m calling them M1, M2, …). For this exercise, everyone will first work on their own and finish this part by Wednesday. Then M1 will contribute to M2’s repository, M2 will work on M3, etc. The last person (M3/M4/M5, based on the number of people in our group), will work on M1’s repository. This way, everyone will work on their own and one group member’s repository. Details are given below.

Getting started

Open your portfolio website project in RStudio. Then open the file for the coding exercise, coding_exercise.qmd.

Documenting well is very important! Add lots of comments to your code/file. I suggest that your code should be more than half comments. For every block of code, you have a few lines of comments at the beginning explaining what the code block does, and then each line of code gets its separate line of comment with more details. Comment on both the how and why of your code. This much commentary might seem overkill initially. But as your code gets more complex, it will be very useful. Both your collaborators, and your future self looking at the code you wrote several weeks ago will be incredibly thankful for your comments!

If you write R code, your comments will be lines that start with #. For QUarto or R Markdown files, you can either add comments as Markdown text above/below your code, and/or add comments inside your R code chunks. Both is ideal. R Studio allows you to quickly turn sections of a document into comments or un-comment them (In the Code section of the R Studio menu). That can be useful for turning on/off code during testing, or hiding some parts of text that’s just meant for you but not for the reader.

Loading and checking data

We’ll look at and play with some data from the dslabs package. Write a code chunk using the library() function that loads the package (install the dslabs package first if you don’t have it yet).

It is good practice to load all packages at the beginning of your code. So if you are using some R package, instead of loading it with library just before you use it, place all your library commands at the beginning of your R script or Quarto file. Also, add a short comment explaining why you are loading a certain package. For a complex project, it might even make sense to list all packages you use in a readme file. Even better is to use something like the renv package which keeps track of all your packages and makes sure someone running your project at a later time gets exactly the same packages you use. While I’m not requiring renv for the course, I encourage you to check it out and if you are up for it, use it for the class exercises and the project.

All the code you write for this (and any other) project should be written into an R or Quarto file, not in the R console. The reason for that is that you want a permanent record of what you did, and the ability to modify and re-run your analysis easily. For this exercise, you can either write the code directly into the Quarto document, or if you prefer the setup of a separate R script and pulling code into the Quarto document (see the data analysis template for an example), you can do it that way too.

We’ll look at the gapminder dataset from dslabs. Write a code chunk using the help() function that pulls up the help page for the data to see what it contains. Then use the str() and summary() functions to take a look at the data. Use the class() function to check what type of object gapminder is.

To illustrate how that should look, you should have these lines of code so far for this exercise.

#load dslabs package
library("dslabs")
#look at help file for gapminder data
help(gapminder)
#get an overview of data structure
str(gapminder)
#get a summary of data
summary(gapminder)
#determine the type of object gapminder is
class(gapminder)

Processing data

You can accomplish the next steps (and pretty much anything) with just basic R commands and not using extra functionality from packages. However, things are often easier with packages. For data processing tasks, the packages from the tidyverse are very useful. You can do the following tasks with any commands/packages you like.

Write code that assigns only the African countries to a new object/variable called africadata. Run str and summary on the new object you created. You should now have 2907 observations, down from 10545. Depending on how you do this, you might also notice that all the different categories are still kept in the continent (and other) variables, but show 0. R does not automatically remove categories of what in R is called a factor variable (a categorical variable) even if they are empty. We don’t have to worry about that just now, but something to keep in mind, it can sometimes lead to strange behavior.

Take the africadata object and create two new objects (name them whatever you want), one that contains only infant_mortality and life_expectancy and one that contains only population and life_expectancy. You should have two new objects/variables with 2907 rows and two columns. Use the str, and summary commands to take a look at both.

I find it the least confusing to call things which store values in R objects (e.g., x is an object here: x <- 2 + 2) and reserve the word variable for a data variable, i.e., usually a column. However, it is common in programming to refer to an object as a variable as well. Because of that, I sometimes use that terminology (inadvertently) too. So if I talk about a variable, you need to determine from the context if I mean a certain variable in the data (e.g. height or weight), or a variable in R (e.g. x or result) that stores some content.

Plotting

Using the new variables you created, plot life expectancy as a function of infant mortality and as a function of population size. Make two separate plots. Plot the data as points. For the plot with population size on the x-axis, set the x-axis to a log scale.

You should see a negative correlation between infant mortality and life expectancy, which makes sense. You should also see a positive correlation between population size and life expectancy. In both plots, especially the second one, you will see ‘streaks’ of data that seem to go together. Can you figure out what is going on here? Take another look at the africadata data we generated, which should give you a hint of what’s happening.

More data processing

I’m sure you realized that the pattern we see in the data is due to the fact that we have different years for individual countries, and that over time these countries increase in population size and also life expectancy. Let’s pick only one year and see what patterns we find. We want a year for which we have the most data. You might have noticed that in africadata, there are 226 NA (i.e., missing values) for infant mortality. Write code that figures out which years have missing data for infant mortality. You should find that there is missing up to 1981 and then again for 2016. So we’ll avoid those years and go with 2000 instead. Create a new object by extracting only the data for the year 2000 from the africadata object. You should end up with 51 observations and 9 variables. Check it with str and summary.

More plotting

Let’s make the same plots as above again, this time only for the year 2000. Based on those plots, there seems to still be a negative correlation between infant mortality and life expectancy, and no noticeable correlation between population size and life expectancy. Let’s apply some statistics to this data.

A simple fit

Use the lm function and fit life expectancy as the outcome, and infant mortality as the predictor. Then use the population size as the predictor. (Use the data from 2000 only.) Save the result from the two fits into two objects (e.g. fit1 and fit2) and apply the summary command to both, which will print various fit results to the screen. Based on the p-values for each fit, what do you conclude?

Sending updates to Github

Once you are done with your exercise, re-build your website. Make sure no error messages show up. A preview should show up, check that the page for this exercise looks the way you want it to. Once you are happy with how everything looks, close RStudio. Go to GitKraken, commit your changes, and push to the remote repository. Check your portfolio website online to make sure you can now see the newly created R exercise document (in addition to your previously created About page). Of course, at any point, feel free to make further enhancements to your portfolio website.

Improving each other’s work

Once everyone has done their part (by Wednesday at the latest), you’ll contribute to another group member’s project.

Discussion

Describe one thing you learned in this module about R that you find rather annoying or confusing or dumb, and one thing about R that you find interesting or cool or awesome.

Then comment/reply to your classmates posts.