Tools for a READy workflow
Overview
This unit covers some tools that can help with a READy workflow.
Learning Objectives
- Be familiar with some tools that can help with a READy workflow.
Introduction
There are many different ways to make a data analysis READy. The following is a brief discussion of some common tools.
Version control software
Version control software, such as Git/GitHub, is used for tracking project changes and is discussed in the Project Management unit.
Quarto
Quarto is tool that allows for easy scientific writing in a reproducible manner, by seamlessly integrating code with text. Quarto is further discussed in the Quarto unit.
Jupyter
Jupyter is another platform that supports combining code with text. I’m not very familiar with it. As far as I understand it, it doesn’t support the generation of different outputs that Quarto does. Since Quarto supports the same languages as Jupyter, I am not sure what the advantages are of using Jupyter. There might be some 😁.
Pluto
For the Julia programming language, there is Pluto.jl, which has similarities to Jupyter and Quarto. Again, I don’t have experience with it, so can’t say what – if any – advantages it has over Quarto. But if you do some coding in Julia, it might be worth checking out.
Sweave and such
There are some older tools, such as Sweave
that can also help in reproducible work. You might still occasionally run into old projects that use it. But in my opinion it is essentially obsolete and you should use Quarto or another of the tools listed above.
Pipeline tools
Quarto and similar programs generally chain together a series of commands to get from the initial result to the final product. For instance for a Quarto file, when you render it, it first runs any code, then processes the markdown file, then turns it into the final product.
More generally, tools that allow you to automatically run such chains of steps are often referred to as pipeline tools. In R, there is the targets
package. Outside of R, Unix/Linux type systems support Make
, which lets you combine multiple steps into often complicated workflows. If you find yourself needing a workflow that executes multiple different steps that go beyond what Quarto can execute, the targets
package might be worth checking out.
Other tools
In general, any software tool you use that allows you to document each step - usually in the form of code and comments - should allow others to reproduce your results. Many common software packages and languages such as Mathematica, Matlab, Julia, Python, etc. allow you to add documentation to your code and therefore support reproducibility.
Hall of shame
There is a long list of software that is commonly used for research and data analysis, but which one should stay away from. This encompasses any software that does not contain a full documentation of all steps taken to allow reproducibility.
Excel and similar spreadsheet software is a major culprit. Using such software for data entry and storage is perfectly fine, as long as one follows guidelines that make later analysis easy, as for instance described in this paper.
However, spreadsheet software is a poor choice for analysis tasks (though nevertheless widely used). The problem with standard spreadsheet software like Excel is that if you make a change to the data and save it, there is usually no track record of what happened. Therefore, any data processing that happens is not reproducible.
Statistical and data analysis software that relies on graphical user interfaces (GUI), such as e.g. Graphpad Prism or SPSS are equally problematic. Some of these software tools can record the commands you perform through a GUI and you can save those at the end. This does allow for reproducibility. But if that step is missing or not implemented, the analysis is again not reproducible.
Similarly, any software to produce graphics by hand (e.g., Photoshop) should not be used. It is ok to use it make schematics, but not to edit any results.
Similarly, results tables should be generated automatically and not produced by copying and pasting numbers from one location to another. Again, that means no Excel.
Summary
There are many useful tools that help you implement a READy workflow. The important part is that things are scripted, automated and repeatable, as well as documented. Any tools that support such a workflow are useful. Stay away from tools that require manual steps.
Further Resources
- Good enough practices in scientific computing is a paper that provides a nice discussion of the approaches, and some tools, that are useful for doing work in a way that follows READy guidelines.