In this unit, you will learn the concept of reproducible research, why it is important and helpful to build your analyses in a reproducible manner, and what tools you can use to implement an automated, reproducible workflow.
A hallmark of proper scientific research is that findings can be replicated/reproduced. Unfortunately, it is often the case these days that results can’t be replicated/reproduced by independent investigators/labs. Sometimes, even the same labs can’t reproduce their previous findings. You have probably heard about the (supposed) Reproducibility Crisis in science. If not, do a quick online search, you’ll find lots of articles saying there is an increasing problem, while others saying that it’s not getting worse, we are just detecting more. While sometimes there is fraud, most often there are more benign reasons preventing reproducibility.This video provides a short discussion of some of the current problems with reproducibility in science:
It’s hard and expensive to replicate/reproduce a full study, including all data collection, thus not routinely possible. It is easier to make sure the analysis part can be reproduced. Making the analysis easily reproducible doesn’t ensure the analysis is correct. However, it allows others to take a look at analyses, redo them, and thus more quickly spot and correct potential problems.
To make a fully reproducible analysis, you have to provide all the data and code, and the generation of results (figures and tables) needs to be fully automated. The scientific community is moving toward more reproducibility and transparency (e.g., mandating public access to data, computer code, etc.). An increasing number of funding agencies and journals require access to data and code.
While there is a strong movement toward Open Access, providing all the data is not always possible. However, the full automation of data processing, analysis, and result generation is always possible, and we will use this approach.
In this video, Roger Peng goes into some more details of the concept of reproducible research:
Roger Peng has additional videos related to reproducible research, a playlist of those videos can be found here.
Most importantly and fundamentally, document everything.
Do all processing and analysis in a scripted and well-documented manner. That means no Excel, no manual copy & paste, and no manual figure or table generation. All of these actions are not scripted or documented and as such, are not reproducible.
Some further things to pay attention are the use of open standards (open data standards, open-source software), recording of software versions used, time-stamping data, and setting a random number seed in your code.
A reproducible analysis should also be practically reproducible, not just theoretically. By that I mean if you provide code, but the code only runs on some specific computer system you used, then it’s not reproducible for others. Providing all data and code is a good first step, but your goal should also be to make reproducibility easy. This is beneficial for both the original producer of the results and the persons trying to reproduce it.
A reproducible analysis is generally automated. That can save you a lot of time. Initially, it seems that doing your analysis in a reproducible and automated manner takes more time and is more difficult because you have to learn some new tools. That is true. However, once you are used to it, you will be much more efficient.
Consider the case where you had some data in Excel, moved it into SAS to do an analysis, and make some raw figures, opened them in Photoshop and spend a few hours making them look good. Then you or your collaborators realize that some of the data that should have been included in the analysis was not (or some data should not have been included). You need to re-create the raw figure and re-work it in Photoshop, likely spending a good bit of time.
If you had an automated analysis, you could just press one (or a few) buttons and re-run everything. Also, automation makes it less likely that errors occur from copying data and intermediate results between programs. The case-study in the introductory unit is such an example, where everything was fully automated.
Making an analysis reproducible also means you to document all your steps and what you do well. So it not only helps others, but future you will be very thankful. The importance of documenting the process increases, as analyses get more complex.
Creating a reproducible and automated analysis used to be a good bit of extra work, but not anymore. A large number of software tools are available that make it fairly easy to set up a reproducible workflow.
While there are different tools and programming languages that allow reproducible research (e.g. Jupyter notebooks in Python, Mathematica notebooks, Sweave, Latex), we will focus on one tool, namely the Quarto system. (GitHub is also part of this reproducible workflow, but we discuss that separately).
The folks from Posit (formerly Rstudio) previously developed R Markdown. This allowed you to combine R code with Markdown text. More recently, they developed Quarto, which is an improved version of the whole system that allows using different programming languages and creating many different types of outputs (papers, presentations, websites, etc.).
R Markdown is still around and functional. For instance, this course website is written in R Markdown and lives on GitHub. But in my opinion, Quarto is the future. We’ll therefore focus on using it. Eventually I’ll switch all my materials over to Quarto. The good thing is that the transition is easy.
The idea behind Quarto is that you have a system that allows you to combine code with text and simple formatting to easily and reproducibly turn them into a lot of different output formats, e.g., HTML, PDF, Word, or slides. You can apply layout and styling to those documents, which is done separately from the content. This means you can quickly switch between outputs. Quarto calls various other pieces of software to - almost magically - turn text and code input into a variety of different output formats. The good thing is, you generally don’t need to care what goes on under the hood, it all happens (almost always) smoothly behind the scenes.
The input (for us) into a Quarto document are
R code and
Markdown text. Markdown is a simple way of formatting text to get some
reasonably nice layouts with minimal effort. If you have no experience
with Markdown, I suggest you go through this nice, short interactive
tutorial.. A good reference to look up formatting for Markdown until
you have it memorized is this online
Quarto and R Markdown Note: Quarto is fairly new. It didn’t exist yet last time I taught the course. I’m switching to Quarto because I think it’s the future. But you might still find me talking about R Markdown in some places. If you do, let me know so I can update. In general, either R Markdown or Quarto will do the job, but Quarto is the future, so it’s best to use/learn it.
The whole Quarto system is already incredibly flexible and powerful (and continues to grow). Because it is very feature rich and you can do a ton, trying to read all about it does not make much sense. The best is a learn it as you need it approach. The Quarto website has great documentation, you’ll likely be going there often.
To get started, follow the get started guide on the Quarto website. Install Quarto and - if you haven’t already - R Studio. Then go through the 3 starter tutorials. There is already a lot going on in those tutorials, and there is no need to memorize it. The point is to get an idea of what you can do and play around a bit. You’ll repeatedly be coming back to the Quarto documentation as you try to figure out how to do specific things. So if some of the material in the tutorials is not quite clear on your first pass-through, that’s ok. Just get an overall idea and play around so you feel comfortable. You’ll be using Quarto in the exercise (and pretty much all the time from now on).
We’ll cover that more later, but it is important to not do things
that only work on your computer but not on others. The most important
and frequent mistake I see is someone using a local path. If you load
C:\Users\myname\mydata then it won’t work for me.
Always load/save data relative to the project folder. The
here R package and its
here() command in R is very useful. You’ll see
it show up and you should use it.
Another good idea is to start your R code with
use_blank_slate() from the
For some more background on this, see this
article by Jenny Bryan, co-author of both the