In this unit, you will learn the concept of reproducible research, why it is important and helpful to build your analyses in a reproducible manner, and what tools you can use to implement an automated, reproducible workflow.
A hallmark of proper scientific research is that findings can be replicated/reproduced. Unfortunately, it is often the case these days that results can’t be replicated/reproduced by independent investigators/labs. Sometimes, even the same labs can’t reproduce their previous findings. You have probably heard about the (supposed) Reproducibility Crisis in science. If not, do a quick online search, you’ll find lots of articles saying there is an increasing problem, while others saying that it’s not getting worse, we are just detecting more. While sometimes there is fraud, most often there are more benign reasons preventing reproducibility.This video provides a short discussion of some of the current problems with reproducibility in science:
It’s hard and expensive to replicate/reproduce a full study, including all data collection, thus not routinely possible. It is easier to make sure the analysis part can be reproduced. Making the analysis easily reproducible doesn’t ensure the analysis is correct. However, it allows others to take a look at analyses, redo them, and thus more quickly spot and correct potential problems.
To make a fully reproducible analysis, you have to provide all the data and code, and the generation of results (figures and tables) needs to be fully automated. The scientific community is moving toward more reproducibility and transparency (e.g., mandating public access to data, computer code, etc.). An increasing number of funding agencies and journals require access to data and code.
While there is a strong movement toward Open Access, providing all the data is not always possible. However, the full automation of data processing, analysis, and result generation is always possible, and we will use this approach.
In this video, Roger Peng goes into some more details of the concept of reproducible research:
Roger Peng has additional videos related to reproducible research, a playlist of those videos can be found here.
Most importantly and fundamentally, document everything.
Do all processing and analysis in a scripted and well-documented manner. That means no Excel, no manual copy & paste, and no manual figure or table generation. All of these actions are not scripted or documented and as such, are not reproducible.
Some further things to pay attention are the use of open standards (open data standards, open-source software), recording of software versions used, time-stamping data, and setting a random number seed in your code.
A reproducible analysis should also be practically reproducible, not just theoretically. By that I mean if you provide code, but the code only runs on some specific computer system you used, then it’s not reproducible for others. Providing all data and code is a good first step, but your goal should also be to make reproducibility easy. This is beneficial for both the original producer of the results and the persons trying to reproduce it.
A reproducible analysis is generally automated. That can save you a lot of time. Initially, it seems that doing your analysis in a reproducible and automated manner takes more time and is more difficult because you have to learn some new tools. That is true. However, once you are used to it, you will be much more efficient.
Consider the case where you had some data in Excel, moved it into SAS to do an analysis, and make some raw figures, opened them in Photoshop and spend a few hours making them look good. Then you or your collaborators realize that some of the data that should have been included in the analysis was not (or some data should not have been included). You need to re-create the raw figure and re-work it in Photoshop, likely spending a good bit of time.
If you had an automated analysis, you could just press one (or a few) buttons and re-run everything. Also, automation makes it less likely that errors occur from copying data and intermediate results between programs. The case-study in the introductory unit is such an example, where everything was fully automated.
Making an analysis reproducible also means you to document all your steps and what you do well. So it not only helps others, but future you will be very thankful. The importance of documenting the process increases, as analyses get more complex.
Creating a reproducible and automated analysis used to be a good bit of extra work, but not anymore. R, GitHub, and related tools have made it fairly easy to set up a reproducible workflow. We discuss GitHub separately, see that document. Since it controls and tracks any changes you make, and works nicely with collaborators, it is an excellent tool for reproducible work.
While there are different tools and programming languages that allow reproducible research (e.g. Jupyter notebooks in Python, Mathematica notebooks, Sweave, Latex), we will focus on one stack of tools, namely (R)Markdown & Co.
Markdown is a language/platform that allows you to write nice-looking documents easily. The idea is that you write plain text documents with simple formatting, and then turn them into a lot of different output formats, e.g., HTML, PDF, Word, or slides. You can apply layout and styling to those documents, which is done separately from the content. This means you can quickly switch between outputs. In our flow, the software in the background that turns our text documents into different formats is called Pandoc. The good thing is, you don’t need to care, it all happens (almost always) behind the scenes. Markdown is rather easy to learn. If you have no experience with Markdown, I suggest you go through this nice, short interactive tutorial.. A good reference to look up formatting for Markdown until you have it memorized is this online cheat sheet.
The folks from Rstudio developed R Markdown. This allows you to
combine R code with Markdown text. You can write a single Rmarkdown
(Rmd) document which contains code and text. You then
the document, which uses an R package called
the code, produces results and turns everything into a markdown (md)
document, then using the
rmarkdown package and
pandoc turns this into some output format of your choice,
e.g. an html or pdf or Word or Powerpoint document. Again, most of the
time, this happens without you needing to care about the details of this
process. There are by now a lot of different
output formats that R Markdown supports. As an example, this whole
course is written in R Markdown and lives on GitHub.
You can copy all the files and completely reproduce this course.
Note that if you want to produce pdf output, you need to have a (La)TeX (pronounced like lay-tech or luh-tech, not like how it is spelled) system installed. It’s a free typesetting system that is a bit similar to Markdown, but more complicated and more powerful. The currently recommended LaTeX system is TinyTeX. Others are MiKTeX for Windows (which I generally use) and MacTeX for Mac. We don’t absolutely need LaTeX for the class, but it’s good to have if you ever want to create pdf output, and it is free, so I suggest you go ahead and install it.
One great thing about R Markdown is that it’s well documented. The R Studio R Markdown site and the R Markdown book are great resources. Another good source targeted at scientists is the online book R Markdown for Scientists. RStudio also has an R Markdown cheat sheet in their collection of very useful cheat sheets.
Since developing knitr and R Markdown, Yihui Xie and colleagues have developed several other versions of the tool. There is bookdown which lets you nicely add references and write full books and scientific manuscripts (see the research example in the introductory lesson and a list of books written in bookdown on the bookdown website) and blogdown which lets you make webpages (see e.g. my group webpage which is done that way).
The whole R Markdown/Markdown/pandoc system has become incredibly flexible and powerful, and we’ll use it in this class. Because R Markdown & Co are very feature rich and you can do a ton, trying to read all about it does not make much sense. The best is a learn it as you need it approach. Some good sources that you might want to glimpse at are the chapters in the Communicate section of the R for Data Science book and the beginning chapters of the R Markdown book. There’s also the newer R Markdown Cookbook, which is a good how do I do X resource.
Again, you don’t need to read these materials in much detail, but if you are new to R Markdown, try to get enough information so you understand what it’s all about and how to get started. The way you learn R Markdown is by having a specific task you want to accomplish or a product you want to produce, and then look at various of the above listed documents to figure out how to do it. You’ll get plenty of practice throughout this course. project.