R tidyverse, tidymodels and more

Overview

This unit briefly discusses collections of R packages that are especially suitable for modeling and data science work.

Goals

  • Be familiar with the tidyverse set of R packages.
  • Be familiar with the tidymodels set of R packages.
  • Know about other useful R packages for modeling and data science.

Reading

In the prior unit, we discussed R packages, what they are and how to install them. This unit focuses on a few R collections of packages that are widely used for modeling and data science work.

tidyverse R packages

Maybe the most widely used collection of R packages is the tidyverse. The tidyverse is a collection of R packages that share common design philosophies and are designed to work together seamlessly. If you want to learn more about the principles of those packages, you can read this short manifesto written by Hadley Wickham. Some of what he writes might not be fully understandable to you (e.g., functional programming or pipes), but you should get the overall idea. For our purpose, the important aspects to remember are that the tidyverse is a collection of R packages that are all structured similarly (from a user perspective). They play nice with each other, and help you in your various analysis tasks as you go from messy data to data that is tidied up and ready for formal analysis. In general, packages inside the tidyverse focus on data cleaning, processing and visualization.

To install all the core packages of the tidyverse, you can use the command install.packages("tidyverse"). We don’t recommend doing this. The reason we dislike it is that you are automatically installing a lot of packages that you might never use. Instead, we recommend installing only the specific tidyverse packages that you need. This also makes it clearer to you which functionality comes from which package. Keeping package use to a minimum also reduces the chance of package conflicts and improves reproducibility.

tidyverse concepts

As you’ll find out shortly, one feature of R code written in the tidyverse style is the heavy use of the pipe operator. The original pipe operator, the %>% symbol, was introduced in the magrittr package. Since then, base R got its own pipe operator, which is the symbol |>.

For instance, this is the kind of code that you might see when doing data wrangling (note that this code doesn’t work since it only shows the chain of functions, for it to work there would need to be arguments provided to each function, i.e. there needs to be something inside the ()):

data %>% filter() %>% 
         select() %>%
         fct_lump() %>%
         mutate() %>% ...

The idea is that you pipe the results from one operation into the next, and thus potentially build a long chain of commands. That style of coding makes it often quite easy to understand what the code is doing. For instance in the example code above, you first filter the data based on some row values, then select some columns, then combine some factor variable, then mutate a variable into a new one, and so on.

The problem, especially when you start out, is that things can (and will) go wrong at some of those steps, and it’s difficult to figure out where the problem is. At least when starting out, it is in my opinion often better to save the result of some cleaning operation as a new intermediate variable. That lets you more easily check for bugs, and to see how the data changed from step to step and if it does what you think it should. So instead of using a long chain of pipes, you can write the code like so:

dat_small1 <- data %>% filter()
dat_small2 <- dat_small1 %>% select()
dat_small3 <- dat_small2 %>% fct_lump()
...

This code is not quite as easy to read, and it creates all these additional variables that you might not want or need. But at least as you are learning the different tidyverse functions, it often helps to be able to inspect what happens at each step, and thus more easily spot when things go wrong. Once you get more comfortable with cleaning steps and coding in general, and make fewer mistakes, you can start chaining things together, and make your chains longer. But if you start out writing code that way, it can be harder to follow along and find bugs.

There is a nice project called Tidy Data Tutor that allows you to visualize the different steps in a data analysis chain of commands. Once you get used to chaining commands together in a long pipeline, or try to inspect someone else’s code, using that tool to look at each step can be quite useful.

tidymodels R packages

The tidymodels collection of R packages is very similar in concept and philosophy to the tidyverse set of packages, but it is designed for modeling and machine learning tasks. Similar to the tidyverse, the packages that make up tidymodels share common design philosophies and are intended to work together seamlessly.

The same recommendation applies here as for the tidyverse: Instead of installing the full tidymodels collection using install.packages("tidymodels"), we recommend installing only the specific tidymodels packages that you need.

Other useful R packages

In addition to the tidyverse and tidymodels packages, there are many other R packages that are useful for modeling and data science work. Some of these include:

  • data.table: A package for fast data manipulation and aggregation.
  • shiny: A package for building interactive web applications directly from R.
  • mlr3: A comprehensive framework for machine learning in R.

In general, it is useful to focus on well-documented and widely used packages, as they are more likely to be reliable and have a strong user community for support.

Coding style comment

The tidyverse way of writing code is different from base R coding style in many ways. The good news is that you can mix and match both styles. While it’s generally a good idea to keep a consistent coding style within a project, it is not strictly required and if deviations are necessary, for instance because you want to do something for which you need to use a package with a different coding style philosophy, you can easily switch within the same code.

If you use AI to write code, it is often useful to specify which packages you want it to use, and if you want to use a specific coding style.

Summary

In this unit, we briefly discussed the tidyverse and tidymodels collections of R packages, as well as other useful R packages for modeling and data science.

Further Resources

Test yourself

Why might you install only the specific tidyverse packages you need instead of the full bundle?

Installing just the packages you need keeps installs lighter and avoids extra dependencies while still letting you use tidyverse tools.

  • False
  • True
  • False
  • False

How does the base pipe |> relate to the magrittr pipe %>%?

Both pipes forward the left-hand result into the next function; %>% comes from magrittr, while |> is now available in base R.

  • False
  • False
  • True
  • False

What is the focus of the tidymodels collection?

Tidymodels is a set of packages for modeling and machine learning, designed to work with tidyverse-friendly syntax and conventions.

  • False
  • False
  • True
  • False

Practice

  • Install and load one tidyverse package you need (e.g., dplyr), then run ?mutate and try the first example.
  • Rewrite a simple data transformation twice: once with base R and once with the base pipe |> and tidyverse verbs; compare readability.
  • Use filter() and select() on a small data frame to practice chaining with |>.
  • Read the tidymodels.org “Get started” page and identify one modeling package you might try for a future project.