One of the strengths of R (and also a source of confusion) is that it is very flexible and almost always lets you do things in more than one way. R itself comes with some functionality. This is often referred to as
base R. Even with just this basic functionality, there are often many ways to accomplish a task. But the real power of R comes from its many packages. Packages (also called libraries in some other programming languages) contain additional functions and functionality that lets you fairly easily do things that would require a ton of coding effort if you tried to do it yourself. Someone basically wrote the functionality for you, and you can use it.
While there are tons of packages available, some are very commonly used. For data analysis tasks, the folks from R Studio have developed many packages that are very useful. One such set of packages, the most widely used set, is called the
tidyverse. By using those packages, a lot of coding applied to data analysis becomes easier, more readable, and more powerful. We will use the
tidyverse packages and their functionality a lot. That said, knowing some base R is very useful. In general, you can fairly easily mix and match.
tidyversecode to data wrangling problems
The concept of the tidyverse can be traced back to the concept of tidy data, which Hadley Wickham introduced in this article. Give the article a quick read to get the overall idea. You will see it a lot, so it’s good to be familiar with it.
The R packages developed by Hadley and others were eventually grouped together and now goes by the name “tidyverse”. If you want to learn more about the principles of those packages, you can read this short manifesto written by Hadley. Some of what he writes might not be fully understandable to you (e.g., functional programming or pipes), but you’ll get the overall idea. For our purpose, the important aspects to remember is that the tidyverse is a collection of R packages that are all structured similarly (from a user perspective), play nice with each other, and help you in your various analysis tasks as you go from messy data to data that is tidied up and ready for formal analysis.
Note that there are many more R packages that are not part of the core tidyverse, but that still follow the same principles and nicely work with other tidyverse packages.
To learn more and practice some of the tidyverse functionality, I suggest you go through the Work with Data and Tidy your Data sections of the R Studio primers. More or less the same content, presented a bit differently and non-interactively, can be found in the Tidy data chapter in R4DS. This might be a good reference to look up things. If you want to practice and learn more, read the tidyverse, Introduction to data wrangling, Reshaping data and Joining tables chapters of IDS. If you want some more data tyding practice, check out this tutorial by Garrett Grolemund or this short blog post by Joachim Goedhart.
These are a lot of resources, and I don’t expect you to work through all of them in detail. I do suggest you take a quick look at all of them, and then work through some of them based on your knowledge level. As we progress through the course and you are asked to do a lot of these tasks, you will likely want to re-visit these materials. And as always, Google is your friend.
As you’ll find out shortly, one feature of R code written in the
tidyverse style is the heavy use of the
magrittr pipe operator (the
%>% symbol). For instance, this is the kind of code that you might see when doing data wrangling (note that this code doesn’t work since it only shows the chain of functions, for it to work there would need to be arguments provided to each function, i.e. there needs to be something inside the
data %>% filter() %>% select() %>% fct_lump() %>% mutate() %>% ...
The idea is that you pipe the results from one operation into the next, and thus potentially build a long chain of commands. That style of coding makes it often quite easy to understand what the code is doing. For instance in the example code above, you first filter the data based on some row values, then select some columns, then combine some factor variable, then mutate a variable into a new one, and so on.
The problem, especially when you start out, is that things can (and will) go wrong at some of those steps, and it’s difficulty to figure out where the problem is. At least when starting out, it is in my opinion often better to save the result of some cleaning operation as a new intermediate variable. That lets you more easily check for bugs, and to see how the data changed from step to step and if it does what you think it should. So instead of using a long chain of pipes, you can write the code like so:
dat_small1 <- data %>% filter() dat_small2 <- dat_small1 %>% select() dat_small3 <- dat_small2 %>% fct_lump() ...
This code is not quite as easy to read, and it creates all these additional variables that you might not want or need. But I think at least as you are learning the different
tidyverse functions, it often helps to be able to inspect what happens at each step, and thus more easily spot when things go wrong. Once you get more comfortable with cleaning steps and coding in general, and make few mistakes, you can start chaining things together. But if you start out writing code that way, it’s much harder to follow along and find bugs.
There is also a nice project called Tidy Data Tutor that allows you to visualize the different steps in a data analysis chain of commands. Once you get used to chaining commands together in a long pipeline, or try to inspect someone else’s code, using that tool to look at each step can be quite useful. Give it a try.
There are lots of other useful R packages that make your coding life easier. We’ll make liberal use of them throughout this course. It is highly likely that for some of the course exercises or the course project, you will be using other packages that you find helpful. Being able to find packages that do what you need is important.
The quality of packages varies. In general, if they are on
bioconductor, they are reasonably stable. Any packages that involve RStudio, or are otherwise widely used and have many developers, suggests that things are tested fairly well. Other packages might be more experimental, and while they might work well, there might also be bugs. So always carefully inspect the results you get to make sure things are reliable.