Overview

In this unit, we’ll talk a bit more about reproducible research, and how that fits into the much larger concept of “open science.”

Learning Objectives

Understand the general idea of open science
Explain what makes research reproducible and/or replicable
Know what FAIR data standards are

Introduction

Way back at the beginning of the course, we briefly discussed reproducibility. Throughout the course, we emphasized the importance to do analyses in a reproducible (and thus generally automated/scripted) manner. In this module, we’ll discuss a bit more what reproducible research means and entails, current trends towards open science, and why you pretty much need to have a solid grasp on reproducible research strategies in the future.

Federal open science requirements

While the concepts of reproducible research and open science have been around for a while, they have recently increased in prominence. Several US government organizations have said that 2023 is the Year of Open Science, and a recent Office of Science and Technology Policy memo has recommended that all federal agencies take serious steps towards open science practices (more on what exactly this means later).

Federal agencies like the NIH have been requiring data sharing and public releases of papers (on PMC) for some time now, but these recommended measures are even stronger, and in my opinion, a very positive direction for science. It is quite likely that if you work with federal funding in the near future, you will be required to share your data and code publicly as soon as you decide to publish. The exact implementation of these protocols hasn’t been determined yet (nor has the degree to which these recommendations will be formally implemented), but the writing is on the wall, so to speak.

It is easier for you to get comfortable with reproducible research methods and open science now, rather than scrambling to learn it in the future!

The general idea of open science

There are many ways of defining and describing “open science”.

If you search online for “what is open science,” you will likely find various conceptual diagrams and descriptions. The papers Open science saves lives: lessons from the COVID-19 pandemic and From Open Data to Open Science provide pretty good conceptual discussions of open science, but I think the best of schematic summary is from NASA.

A circular graphic with 'open science' in the center, showing several components of open science around it ([source](https://www.earthdata.nasa.gov/esds/open-science/oss-for-eso-workshops)).

NASA defines three components that are central to open science.

Accessible – your research process and results need to be transparent. Open access, FAIR data, and code sharing all fall into this.
Reproducible – other people should be able to get the same results as you, and you should strive to make reusable products.
Inclusive – if your research is only accessible and reproducible for relatively well-off scientists at major institutions in the USA, it’s still not really open 🙂.

If you want a formal definition of “open science,” here’s the one from the OSTP, which is pretty comprehensive:

“The principle and practice of making research products and processes available to all, while respecting diverse cultures, maintaining security and privacy, and fostering collaborations, reproducibility, and equity.”

Open science is like spaghetti sauce–it has a lot of ingredients that have to be cooked together to get the best flavor. Let’s go into a bit more detail on each of these topics.

Accessibility and FAIR data

Open access publishing is a big topic, but for the purposes of this class I will just say that I think you should publish Open Access whenever possible. While you’re a student at UGA, our library has a fund that can help you with this. In the future, it seems likely that any federally funded projects in the US will require an open access fund as part of the budget, as publishing OA can often be prohibitively expensive. If you want to learn more check out wikipedia page, or if you really want to learn more, see Peter Suber’s book on OA.

The main purpose of this section is to talk about data sharing. Data sharing used to be a pretty contentious topic, but fortunately it is more normalized now than it has ever been before, and thanks to NIH (and other agency mandates), it will likely be a normal part of science before too long.

Of course just dumping messy data without any documentation into a depository is not great for accessibility and reproducibility. As some of you noticed during this course, even data from the CDC or other reputable organizations is often very poorly formatted and does not have a lot of metadata or documentation to help you out. To somewhat solve this issue, the FAIR data standards was created. FAIR stands for:

Findable: when someone goes to your paper or project, they should be able to easily get to the data source.
Accessible: users need to be able to access the data, and data are stored in a way that users can feasibly get them.
Interoperable: data should be in standardized formats, using standard vocabulary. You should be specific if your data is derived from another dataset or if the user will need other related datasets.
Reusable: data should be documented and version-controlled when you release it, with clear descriptions of what information is contained in the data, who can use the data, and for what purposes.

The FAIR standard was originally described in this 2016 publication. Another great resource is the GO FAIR initative. It is worth taking a short look at both these resources.

Of course dealing with things like data use agreements and individual privacy is paramount, so 100% achieving these goals all the time is not feasible. Sometimes you might need to use synthetic data based on your real data, or remove certain information from your data to protect privacy, or require a data use agreement before you can distribute the data to individuals. We should strive for data to be as FAIR as possible, but protecting individual privacy is equally important!

In general, a lot of authors still hide behind “I can’t share my data, it’s confidential” as an excuse for not having to share. Most of the time, it is possible to share de-identified data without confidentiality problems. Hopefully, the near future will make data sharing much more common and also easier.

Reproducibility (and replicability too)

We discussed reproducibility in this reading, revisit it if you need a refresher. In one sentence, your research is reproducible if another scientist can repeat your methods on your data and get the same results. That is, someone else can take your code and data, rerun all your code, and get the same answers that you got in your paper.

An important and related issue is replicability -— you’ve probably heard of the “replication crisis” in science. If not, see for instance this wikipedia page. In one sentence, your research is replicable if someone else can follow your methods and materials to collect new data, analyze their data, and get similar results. Of course in social and semi-social sciences (like epidemiology), when your results often depend on the target population of interest and are expected to vary over time, replicability can be a bit more nuanced than in an experimental field, where experiments should pretty much control for anything that you aren’t interested in.

Note that replicability is somewhat related to the analysis topic of generalizability/portability we discussed. The difference is that we focused on analysis approaches that would lead to replicable modeling/analysis results using other data sources. More broadly, one wants to be able to replicate overall findings if one were to repeat the full study, including collection of new data. This is important, but for purposes of analysis, the data is considered given, and the focus is on reproducibility/replicability of the data analysis part.

Inclusivity

The topic of inclusivity in data science is heavily tied into the idea of data ethics – critical thinking about the potential results and consequences of the products we create is crucial in data science. Many of you have some public health training, and are therefore likely familiar with the Belmont Report. Regardless of your scientific background, when working with data (especially human subjects data), the guiding principles of respect for persons, beneficience, and justice should be critical considerations for all of our research.

For some additional reading on inclusivity and data science ethics, see for instance these resources:

Elaine O Nsoesie, Sandro Galea, Towards better Data Science to address racial bias and health equity, PNAS Nexus, Volume 1, Issue 3, July 2022, pgac120, https://doi.org/10.1093/pnasnexus/pgac120
O’Neil, Cathy. Weapons of math destruction. Crown, 2016. ISBN: 978-0553418811.
Noble, Safiya Umoja. Algorithms of Oppression: How Search Engines Reinforce Racism. NYU Press, 2018. ISBN: 978-1479837243.
https://datapractices.org/manifesto/
Floridi L, Taddeo M. What is data ethics?. Philos Trans A Math Phys Eng Sci. 2016;374(2083):20160360. doi:10.1098/rsta.2016.0360

Practical considerations

It is all nice and good to tell people that their data needs to be FAIR, and that their science needs to be open. But the question is, why do it? Most people might agree that there are societal benefits. For instance having a resource like GenBank for genetic sequences has allowed many scientists beyond those who created the original sequence to use the data and answer important scientific questions. However, there generally need to also be benefits for individuals to entice them to spend time following Open Science standards. This is happening increasingly. On one hand, tools to do research in an Open Science framework are getting increasingly better. For instance the whole R + Quarto framework we’ve been using in class makes it rather easy to do things automated and reproducible. Many other similar tools and resources are becoming available. Another potential benefit of Open Science is increased visibility. If others can use your data and models, they will likely cite your work. They might even want to collaborate with you. It seems for those individuals who are good at sharing, the benefits outweigh the perceived risks (such as “being scooped”). And the final, ever increasing “benefit” is that publishers and funding agencies increasingly require it. So if you don’t follow Open Science standards, it might soon get hard to publish work or get funding. It’s less of a “benefit” and more of an “avoid the penalties” thing, but regardless, it is an important practical consideration.

Open Science and Reproducible Research

2023-05-08 13:59:08.382208