One of the strengths of R (and also a source of confusion) is that it
is very flexible and almost always lets you do things in more than one
way. R itself comes with some functionality. This is often referred to
as base R
. Even with just this basic functionality, there
are often many ways to accomplish a task. But the real power of R comes
from its many packages. Packages (also called libraries in some other
programming languages) contain additional functions and functionality
that lets you fairly easily do things that would require a ton of coding
effort if you tried to do it yourself. Someone basically wrote the
functionality for you, and you can use it.
While there are tons of packages available, some are very commonly
used. For data analysis tasks, the folks from R Studio have developed
many packages that are very useful. One such set of packages, the most
widely used set, is called the tidyverse
. By using
those packages, a lot of coding applied to data analysis becomes easier,
more readable, and more powerful. We will use the tidyverse
packages and their functionality a lot. That said, knowing some base R
is very useful. In general, you can fairly easily mix and match.
tidyverse
istidyverse
code to data wrangling
problemsThe concept of the tidyverse can be traced back to the concept of tidy data, which Hadley Wickham introduced in this article. Give the article a quick read to get the overall idea. You will see it a lot, so it’s good to be familiar with it.
The R packages developed by Hadley and others were eventually grouped together and now goes by the name “tidyverse”. If you want to learn more about the principles of those packages, you can read this short manifesto written by Hadley. Some of what he writes might not be fully understandable to you (e.g., functional programming or pipes), but you’ll get the overall idea. For our purpose, the important aspects to remember is that the tidyverse is a collection of R packages that are all structured similarly (from a user perspective), play nice with each other, and help you in your various analysis tasks as you go from messy data to data that is tidied up and ready for formal analysis.
Note that there are many more R packages that are not part of the core tidyverse, but that still follow the same principles and nicely work with other tidyverse packages.
To learn more and practice some of the tidyverse functionality, I suggest you go through the Work with Data and Tidy your Data sections of the R Studio primers. More or less the same content, presented a bit differently and non-interactively, can be found in the Tidy data chapter in R4DS. This might be a good reference to look up things. If you want to practice and learn more, read the tidyverse, Introduction to data wrangling, Reshaping data and Joining tables chapters of IDS. If you want some more data tyding practice, check out this tutorial by Garrett Grolemund or this short blog post by Joachim Goedhart.
These are a lot of resources, and I don’t expect you to work through all of them in detail. I do suggest you take a quick look at all of them, and then work through some of them based on your knowledge level. As we progress through the course and you are asked to do a lot of these tasks, you will likely want to re-visit these materials. And as always, Google is your friend.
As you’ll find out shortly, one feature of R code written in the
tidyverse
style is the heavy use of the
magrittr
pipe
operator (the %>%
symbol). For instance, this is the
kind of code that you might see when doing data wrangling (note that
this code doesn’t work since it only shows the chain of functions, for
it to work there would need to be arguments provided to each function,
i.e. there needs to be something inside the ()
):
data %>% filter() %>%
select() %>%
fct_lump() %>%
mutate() %>% ...
The idea is that you pipe the results from one operation into the next, and thus potentially build a long chain of commands. That style of coding makes it often quite easy to understand what the code is doing. For instance in the example code above, you first filter the data based on some row values, then select some columns, then combine some factor variable, then mutate a variable into a new one, and so on.
The problem, especially when you start out, is that things can (and will) go wrong at some of those steps, and it’s difficulty to figure out where the problem is. At least when starting out, it is in my opinion often better to save the result of some cleaning operation as a new intermediate variable. That lets you more easily check for bugs, and to see how the data changed from step to step and if it does what you think it should. So instead of using a long chain of pipes, you can write the code like so:
dat_small1 <- data %>% filter()
dat_small2 <- dat_small1 %>% select()
dat_small3 <- dat_small2 %>% fct_lump()
...
This code is not quite as easy to read, and it creates all these
additional variables that you might not want or need. But I think at
least as you are learning the different tidyverse
functions, it often helps to be able to inspect what happens at each
step, and thus more easily spot when things go wrong. Once you get more
comfortable with cleaning steps and coding in general, and make few
mistakes, you can start chaining things together. But if you start out
writing code that way, it’s much harder to follow along and find
bugs.
There is also a nice project called Tidy Data Tutor that allows you to visualize the different steps in a data analysis chain of commands. Once you get used to chaining commands together in a long pipeline, or try to inspect someone else’s code, using that tool to look at each step can be quite useful. Give it a try.
There are lots of other useful R packages that make your coding life
easier. We’ll make liberal use of them throughout this course. Later in
the course, we’ll explore another set of packages called tidymodels
. It is
highly likely that for some of the course exercises or the course
project, you will be using other packages that you find helpful. You are
allowed and encourage to use whatever packages you find useful. Being
able to find packages that do what you need is important.
As mentioned in one of the introductory pages, the quality of
packages varies. In general, if they are on CRAN
or
bioconductor
, they passed some quality checks. That does
however not mean that the functions do the right thing, just that they
run. Other packages might be more experimental, and while they might
work well, there might also be bugs. In general, packages that are used
by many people, packages that involve people who work at R-centric
companies (e.g., Posit), and packages that have many
developers/contributors and are actively maintained are good signs that
it’s a stable and reliable package. That said, there are many packages
that are developed by a single person and are only available from
GitHub
, and they are still very good packages. Ideally, for
a new package, test it and see if it does things stably and correctly.
If yes, you can start using it. Just always carefully inspect the
results you get to make sure things are reliable. If at some point, you
work with R in a non-academic setting and you might use R packages for
jobs that need to run reliably for many years to come, choosing packages
might be a bit more tricky and require more thought. For an
academic/research setting, it’s usually ok to use almost any package, as
long as it seems reliable and works.