In this unit, you will learn the concept of reproducible research, why it is important and helpful to build your analyses in a reproducible manner, and what tools you can use to implement an automated, reproducible workflow.
A hallmark of proper scientific research is that findings can be replicated/reproduced. Unfortunately, it is often the case these days that results can’t be replicated/reproduced by independent investigators/labs. Sometimes, even the same labs can’t reproduce their previous findings. You have probably heard about the (supposed) Reproducibility Crisis in science. If not, do a quick online search, you’ll find lots of articles saying there is an increasing problem, while others saying that it’s not getting worse, we are just detecting more. While sometimes there is fraud, most often there are more benign reasons preventing reproducibility.This video provides a short discussion of some of the current problems with reproducibility in science:
It’s hard and expensive to replicate/reproduce a full study, including all data collection, thus not routinely possible. It is easier to make sure the analysis part can be reproduced. Making the analysis easily reproducible doesn’t ensure the analysis is correct. However, it allows others to take a look at analyses, redo them, and thus more quickly spot and correct potential problems.
To make a fully reproducible analysis, you have to provide all the data and code, and the generation of results (figures and tables) needs to be fully automated. The scientific community is moving toward more reproducibility and transparency (e.g., mandating public access to data, computer code, etc.). An increasing number of funding agencies and journals require access to data and code.
While there is a strong movement toward Open Access, providing all the data is not always possible. However, the full automation of data processing, analysis, and result generation is always possible, and we will use this approach.
In this video, Roger Peng goes into some more details of the concept of reproducible research:
Note the concept of mixing text and code that Roger Peng talks about.
Roger Peng has additional videos related to reproducible research, a playlist of those videos can be found here.
Most importantly and fundamentally, document everything.
Do all processing and analysis in a scripted and well-documented manner. That means no Excel, no manual copy & paste, no manual figure, and table generation. All of these actions are not scripted or documented and as such, not reproducible.
Some further things to pay attention are the use of open standards (open data standards, open-source software), recording of software versions used, time-stamping data, and setting a random number seed in your code.
A reproducible analysis should also be practically reproducible, not just theoretically. By that I mean if you provide code, but the code only runs on some specific computer system you used, then it’s not reproducible for others. Providing all data and code is a good first step, but your goal should also be to make reproducibility easy. This is beneficial for both the original producer of the results and the persons trying to reproduce it.
A reproducible analysis is automated. That can save you a lot of time. Initially, it seems that doing your analysis in a reproducible and automated manner takes more time and is more difficult because you have to learn some new tools. That is true. However, once you are used to it, you will be much more efficient. Consider the case where you had some data in Excel, moved it into SAS to do an analysis, and make some raw figures, opened them in Photoshop and spend a few hours making them look good. Then you or your collaborators realize that some of the data that should have been included in the analysis was not (or some data should not have been included). You need to re-create the raw figure and re-work it in Photoshop, likely spending a good bit of time. If you had an automated analysis, you could just press one (or a few) buttons and re-run everything. Also, automation makes it less likely that errors occur from copying data and intermediate results between programs. The case-study in the introductory unit is such an example, where everything was fully automated.
Making an analysis reproducible also means you to document all your steps and what you do well. So it not only helps others, but future you will be very thankful. The importance of documenting the process increases, as analyses get more complex.
Creating a reproducible and automated analysis used to be a good bit of extra work, but not anymore. R, Github, and related tools have made it fairly easy to set up a reproducible workflow. We discuss Github separately, see that document. Since it controls and tracks any changes you make, and works nicely with collaborators, it is an excellent tool for reproducible work.
While there are different tools and programming languages that allow reproducible research (e.g. Jupyter notebooks in Python, Mathematica notebooks, Sweave, Latex), we will focus on one stack of tools, namely (R)markdown & Co.
Markdown is a language/platform that allows you to write nice-looking documents easily. The idea is that you write plain text documents with simple formatting, and then turn them into a lot of different output formats, e.g., HTML, PDF, Word, or slides. You can apply layout and styling to those documents, which is done separately from the content. This means you can quickly switch between outputs. In our flow, the software in the background that turns our text documents into different formats is called Pandoc. The good thing is, you don’t need to care, it all happens (almost always) behind the scenes. Markdown is rather easy to learn. If you have no experience with Markdown, I suggest you go through this nice, short interactive tutorial.. A good reference to look up formatting for Markdown until you have it memorized is this online cheat sheet.
The folks from Rstudio developed R Markdown. This allows you to combine R code with Markdown text. You can write a single Rmarkdown (Rmd) document which contains code and text. You then
knit the document, which uses an R package called
knitr, runs the code, produces results and turns everything into a markdown (md) document, then using the
rmarkdown package and
pandoc turns this into some output format of your choice, e.g. an html or pdf or Word or Powerpoint document. Again, most of the time, this happens without you needing to care about the details of this process. There are by now a lot of different output formats that R Markdown supports. As an example, this whole course is written in Rmarkdown and lives on GitHub. You can copy all the files and completely reproduce this course.
Note that if you want to produce pdf output, you need to have a (La)TeX system installed. It’s a free typesetting system that is a bit similar to Markdown, but more complicated and more powerful. I recommend MiKTeX for Windows and MacTeX for Mac. We don’t absolutely need it for the class, but it’s good to have and free, so I suggest you go ahead and install it.
One great thing about R Markdown is that it’s well documented. The R Studio R Markdown site and the R Markdown book are great resources. Another good source targeted at scientists is the online book R Markdown for Scientists. RStudio also an R Markdown cheat sheet in their collection of very useful cheat sheets.
Since developing knitr and R Markdown, Yihui Xie and colleagues have developed several other versions of the tool. There is bookdown which lets you nicely add references and write full books and scientific manuscripts (see the research example in the introductory lesson and a list of books written in bookdown on the bookdown website) and blogdown which lets you make webpages (see e.g. my group webpage which is done that way).
The whole R Markdown/Markdown/pandoc system has become incredibly flexible and powerful, and we’ll use it in this class.
Because R Markdown & Co are very feature rich and you can do a ton, trying to read all about it does not make much sense. For now, to get some idea of what it is, skim through chapters 27, 29 and 30 in R for Data Science and chapter 2 of the R Markdown book. Chapter 40 of IDS is also worth a look. You don’t need to read these materials in much detail (and I won’t quiz you on it), but if you are new to R Markdown, try to get enough information so you understand what it’s all about and how to get started. The way you learn R Markdown is by having a specific task you want to accomplish or a product you want to produce, and then look at various of the above listed documents to figure out how to do it. The exercise for this module will ask you to write some R Markdown, and you will be using it throughout this course, including for your class project.