READy workflow introduction

Author

Andreas Handel

Modified

2024-03-20

Overview

This unit provides an overview of the ideas for a data analysis that follows Reproducible, Efficient, Automated, and Documented (READy) workflow principles.

Learning Objectives

  • Understand the concept of reproducible research
  • Understand the importance of reproducible research for modern science

Introduction

The concept of reproducible research, or what I call here more generally the READy workflow1, is becoming increasingly important. This unit provides a few general ideas on the topic, with further units going into more detail on specific aspects that can help you to perform READy analyses.

Comic of two people in front of a chalkboard that has mathematical equations and the phrase 'then a miracle occurs' between equations. The person pointing at that phrase says 'I think you should be more explicit here in step two.'

Not a good example of reproducibility.

Reproducibility and Replicability

A hallmark of proper scientific research is that findings can be replicated and/or reproduced. The terms reproduciblity and replicability are sometimes used interchangably, sometimes they are considered slightly distinct.

If they are differentiated, replicability is generally considered broader. In one sentence, your research is replicable if someone else can follow your methods and materials to collect new data, analyze their data, and get the same results (or similar results, taking random variation into account).

Unfortunately, it is often the case these days that results can’t be replicated/reproduced by independent investigators/labs. Sometimes, even the same labs can’t reproduce their previous findings. You have probably heard about the (supposed) Reproducibility/Replication Crisis in science. If not, do a quick online search, you’ll find lots of articles saying there is an increasing problem, while others saying that it’s not getting worse, we are just detecting more. While sometimes there is fraud, most often there are more benign reasons preventing reproducibility.

This video provides a short discussion of some of the problems with reproducibility/replicability in science:

Reproducibility of Analyses

The term replicability is generally applied to a full study, including the data collection part. It is hard and expensive to replicate/reproduce a full study, including all data collection, thus not routinely possible. It is easier to make sure the analysis part can be reproduced. Making the analysis easily reproducible doesn’t ensure the analysis is correct. However, it allows others to take a look at analyses, redo them, and thus more quickly spot and correct potential problems.

To make a fully reproducible analysis, you have to provide all the data and code, and the generation of results (figures and tables) needs to be fully automated. The scientific community is moving toward more reproducibility and transparency (e.g., mandating public access to data, computer code, etc.). An increasing number of funding agencies and journals require access to data and code.

While there is a strong movement toward Open Access, providing all the data is not always possible. However, the full automation of data processing, analysis, and result generation is almost always possible.

In this video, Roger Peng goes into some more details of the concept of reproducible research:

Roger Peng has additional videos related to reproducible research in this playlist of videos

Why reproducibility is good for you

A reproducible analysis is generally automated. That can save you a lot of time. Initially, it seems that doing your analysis in a reproducible and automated manner takes more time and is more difficult because you have to learn some new tools. That is true. However, once you are used to it, you will be much more efficient.

Consider the case where you had some data in Excel, moved it into SAS to do an analysis, and make some raw figures, opened them in Photoshop and spend a few hours making them look good. Then you or your collaborators realize that some of the data that should have been included in the analysis was not (or some data should not have been included). You need to re-create the raw figure and re-work it in Photoshop, likely spending a good bit of time.

If you had an automated analysis, you could just press one (or a few) buttons and re-run everything. Also, automation makes it less likely that errors occur from copying data and intermediate results between programs. The case-study in the introductory unit is such an example, where everything was fully automated.

Making an analysis reproducible also means you to document all your steps and what you do well. So it not only helps others, but future you will be very thankful. The importance of documenting the process increases, as analyses get more complex.

How to do a reproducible analysis

For an analysis to be reproducible, it needs to be well-documented, generally also automated, and thus efficient. Therefore, those components go together and that’s why I combined them under the READy acronym.

Do all processing and analysis in a scripted and well-documented manner. That means no Excel or similar tools should be used that don’t leave a record of what is happening, no manual copy & paste, and no manual figure or table generation. These actions are not scripted or documented and as such, are not reproducible.

A reproducible analysis should also be practically reproducible, not just theoretically. By that I mean if you provide code, but the code only runs on some specific computer system you used, then it’s not reproducible for others. Providing all data and code is a good first step, but your goal should also be to make reproducibility easy. This is beneficial for both the original producer of the results and the persons trying to reproduce it.

Summary

While the READy concept might seem a bit strange, and maybe also sounds like a lot of work, you will quickly discover that in the long run, it saves you time and makes your analysis more robust and reliable. A READy-type approach is also more and more required, not just in industry or by agencies such as the FDA, but increasingly by federal funders of research such as NIH or NSF.

Further Resources

  • Reproducible analysis and Research Transparency are materials from a workshop on that topic. I didn’t read through it, and it’s already a bit old (e.g., pre-Quarto), but seems to contain some interesting and materials.

  • This course cover some of the same topics. I haven’t taken it, so I’m not sure what it contains. But it’s free to sign up, so you can always register and check it out.

Footnotes

  1. READy is an acronym I invented. If you look around online, you’ll find lots of discussions on the topic of reproducible or automated data analysis, and why documentation is important, but you’ll likely not find anything calling it READy.↩︎