Tidyverse and Friends

Author

Andreas Handel

Modified

2024-03-20

Overview

One of the strengths of R (and also a source of confusion) is that it is very flexible and almost always lets you do things in more than one way. R itself comes with some functionality. This is often referred to as base R. Even with just this basic functionality, there are often many ways to accomplish a task. But the real power of R comes from its many packages. Packages (also called libraries in some other programming languages) contain additional functions and functionality that lets you fairly easily do things that would require a ton of coding effort if you tried to do it yourself. Someone basically wrote the functionality for you, and you can use it.

While there are tons of packages available, some are very commonly used. For data analysis tasks, the folks from R Studio have developed many packages that are very useful. One such set of packages, the most widely used set, is called the tidyverse. By using those packages, a lot of coding applied to data analysis becomes easier, more readable, and more powerful. We will use the tidyverse packages and their functionality a lot. That said, knowing some base R is very useful. In general, you can fairly easily mix and match.

Learning Objectives

Know what the tidyverse is
Be able to apply tidyverse code to data wrangling problems
Be able to use R packages for specific needs

Tidy data

The concept of the tidyverse can be traced back to the concept of tidy data, which Hadley Wickham introduced in his Tidy Data paper. Give the article a quick read to get the overall idea. You will see the idea of tidy data show up repeatedly throughout this course.

Tidyverse philosophy

The R packages developed by Hadley and others were eventually grouped together and now go by the name tidyverse. If you want to learn more about the principles of those packages, you can read this short manifesto written by Hadley. Some of what he writes might not be fully understandable to you (e.g., functional programming or pipes), but you should get the overall idea. For our purpose, the important aspects to remember is that the tidyverse is a collection of R packages that are all structured similarly (from a user perspective). They play nice with each other, and help you in your various analysis tasks as you go from messy data to data that is tidied up and ready for formal analysis.

Note that there are many more R packages that are not part of the core tidyverse, but that still follow the same principles and nicely work with other tidyverse packages.

Learning the Tidyverse

To learn more and practice some of the tidyverse functionality, I suggest you look at the Tidy Data chapter in R4DS. More or less the same content, presented a bit differently, can be found in some of the entries in the Work with Data and Tidy your Data sections of the Posit Recipes.

If you want to practice and learn more, read the Tidyverse, Introduction to Data Wrangling, Reshaping Data and Joining Tables chapters of IDS. If you want some more data tidying practice, check out Garrett Grolemund’s tutorial or Joachim Goedhart’s short blog post.

These are a lot of resources, and I don’t expect you to work through all of them in detail. I do suggest you take a quick look at all of them, and then work through some of them based on your knowledge level. As we progress through the course and you are asked to do a lot of these tasks, you will likely want to re-visit these materials. And as always, Google is your friend.

Tidyverse concepts

As you’ll find out shortly, one feature of R code written in the tidyverse style is the heavy use of the pipe operator. The original pipe operator, the %>% symbol, was introduced in the magrittr package. Since then, base R got its own pipe operator, which is the symbol |>.

For instance, this is the kind of code that you might see when doing data wrangling (note that this code doesn’t work since it only shows the chain of functions, for it to work there would need to be arguments provided to each function, i.e. there needs to be something inside the ()):

data %>% filter() %>% 
         select() %>%
         fct_lump() %>%
         mutate() %>% ...

The idea is that you pipe the results from one operation into the next, and thus potentially build a long chain of commands. That style of coding makes it often quite easy to understand what the code is doing. For instance in the example code above, you first filter the data based on some row values, then select some columns, then combine some factor variable, then mutate a variable into a new one, and so on.

The problem, especially when you start out, is that things can (and will) go wrong at some of those steps, and it’s difficulty to figure out where the problem is. At least when starting out, it is in my opinion often better to save the result of some cleaning operation as a new intermediate variable. That lets you more easily check for bugs, and to see how the data changed from step to step and if it does what you think it should. So instead of using a long chain of pipes, you can write the code like so:

dat_small1 <- data %>% filter()
dat_small2 <- dat_small1 %>% select()
dat_small3 <- dat_small2 %>% fct_lump()
...

This code is not quite as easy to read, and it creates all these additional variables that you might not want or need. But I think at least as you are learning the different tidyverse functions, it often helps to be able to inspect what happens at each step, and thus more easily spot when things go wrong. Once you get more comfortable with cleaning steps and coding in general, and make few mistakes, you can start chaining things together, and make your chains longer. But if you start out writing code that way, it can be harder to follow along and find bugs.

A nice introduction to pipes is this section of the R4DS book.

There is also a nice project called Tidy Data Tutor that allows you to visualize the different steps in a data analysis chain of commands. Once you get used to chaining commands together in a long pipeline, or try to inspect someone else’s code, using that tool to look at each step can be quite useful. Give it a try.

Beyond the Tidyverse

There are lots of other useful R packages that make your coding life easier. We’ll make liberal use of them throughout this course. Later in the course, we’ll explore another set of packages called tidymodels. It is highly likely that for some of the course exercises or the course project, you will be using other packages that you find helpful. You are allowed and encourage to use whatever packages you find useful. Being able to find packages that do what you need is important.

Note about R packages (repeat)

As mentioned in one of the introductory pages, the quality of packages varies. In general, if they are on CRAN or bioconductor, they passed some quality checks. That does however not mean that the functions do the right thing, just that they run. Other packages might be more experimental, and while they might work well, there might also be bugs. In general, packages that are used by many people, packages that involve people who work at R-centric companies (e.g., Posit), and packages that have many developers/contributors and are actively maintained are good signs that it’s a stable and reliable package. That said, there are many packages that are developed by a single person and are only available from GitHub, and they are still very good packages. Ideally, for a new package, test it and see if it does things stably and correctly. If yes, you can start using it. Just always carefully inspect the results you get to make sure things are reliable. If at some point, you work with R in a non-academic setting and you might use R packages for jobs that need to run reliably for many years to come, choosing packages might be a bit more tricky and require more thought. For an academic/research setting, it’s usually ok to use almost any package, as long as it seems reliable and works.