Open Research

Authors

Zane Billings

Andreas Handel

Modified

2025-01-06

Overview

This unit provides an introduction to the concept of open research/science.

Learning Objectives

Understand the general idea of open science
Explain what makes research reproducible and/or replicable
Know what FAIR data standards are

Introduction

While READy principles and openness are not necessarily the same, they are often connected.

The idea is that if you provide others access to your data, code and other components of analysis - the open part - it is generally only useful if this is done in a READy framework. Otherwise, it might become close to impossible for others to reproduce and further build on your work.

The general idea of open science

There are many ways of defining and describing “open science”.

If you search online for “what is open science,” you will likely find various conceptual diagrams and descriptions. The papers Open science saves lives: lessons from the COVID-19 pandemic and From Open Data to Open Science provide pretty good conceptual discussions of open science, but I think the best of schematic summary is from NASA.

A circular graphic with 'open science' in the center, showing several components of open science around it ([source](https://www.earthdata.nasa.gov/esds/open-science/oss-for-eso-workshops)). — Source: NASA

NASA defines three components that are central to open science.

Accessible – your research process and results need to be transparent. Open access, FAIR data, and code sharing all fall into this.
Reproducible – other people should be able to get the same results as you, and you should strive to make reusable products.
Inclusive – if your research is only accessible and reproducible for relatively well-off scientists at major institutions in the USA, it’s still not really open 🙂.

If you want a formal definition of “open science,” here’s the one from the OSTP, which is pretty comprehensive:

“The principle and practice of making research products and processes available to all, while respecting diverse cultures, maintaining security and privacy, and fostering collaborations, reproducibility, and equity.”

Federal open science requirements

While the concepts of reproducible research and open science have been around for a while, they have recently increased in prominence. Several US government organizations have said that 2023 is the Year of Open Science, and a recent Office of Science and Technology Policy memo has recommended that all federal agencies take serious steps towards open science practices (more on what exactly this means later).

Federal agencies like the NIH have been requiring data sharing and public releases of papers (on PMC) for some time now, but these recommended measures are even stronger, and in my opinion, a very positive direction for science. It is quite likely that if you work with federal funding in the near future, you will be required to share your data and code publicly as soon as you decide to publish. The exact implementation of these protocols hasn’t been determined yet (nor has the degree to which these recommendations will be formally implemented), but the writing is on the wall, so to speak.

It is easier for you to get comfortable with reproducible research methods and open science now, rather than scrambling to learn it in the future!

Accessibility and FAIR data

Open access publishing is a big topic, but for the purposes of this class I will just say that I think you should publish Open Access whenever possible. In the future, it seems likely that any federally funded projects in the US will require an open access fund as part of the budget, as publishing OA can often be prohibitively expensive. If you want to learn more, check out the Wikipedia page on Open Access, or if you really want to learn more, see Peter Suber’s book on OA.

The main purpose of this section is to talk about data sharing. Data sharing used to be a pretty contentious topic, but fortunately it is more normalized now than it has ever been before, and thanks to NIH (and other agency mandates), it will likely be a normal part of science before too long.

Of course just dumping messy data without any documentation into a depository is not great for accessibility and reproducibility. As some of you noticed during this course, even data from the CDC or other reputable organizations is often very poorly formatted and does not have a lot of metadata or documentation to help you out. To somewhat solve this issue, the FAIR data standards was created. FAIR stands for:

Findable: when someone goes to your paper or project, they should be able to easily get to the data source.
Accessible: users need to be able to access the data, and data are stored in a way that users can feasibly get them.
Interoperable: data should be in standardized formats, using standard vocabulary. You should be specific if your data is derived from another dataset or if the user will need other related datasets.
Reusable: data should be documented and version-controlled when you release it, with clear descriptions of what information is contained in the data, who can use the data, and for what purposes.

The FAIR standard was originally described in this 2016 publication. Another great resource is the GO FAIR initiative. It is worth taking a short look at both these resources.

Of course dealing with things like data use agreements and individual privacy is paramount, so 100% achieving these goals all the time is not feasible. Sometimes you might need to use synthetic data based on your real data, or remove certain information from your data to protect privacy, or require a data use agreement before you can distribute the data to individuals. We should strive for data to be as FAIR as possible, but protecting individual privacy is equally important!

In general, a lot of authors still hide behind “I can’t share my data, it’s confidential” as an excuse for not having to share. Most of the time, it is possible to share de-identified data without confidentiality problems. Hopefully, the near future will make data sharing much more common and also easier.

Inclusivity

The topic of inclusivity in data science is heavily tied into the idea of data ethics – critical thinking about the potential results and consequences of the products we create is crucial in data science. Many of you have some public health training, and are therefore likely familiar with the Belmont Report.

Regardless of your scientific background, when working with data (especially human subjects data), the guiding principles of respect for persons, beneficence, and justice should be critical considerations for all of our research.

For some additional reading on inclusivity and data science ethics, see for instance these resources:

Elaine O Nsoesie, Sandro Galea, Towards better Data Science to address racial bias and health equity, PNAS Nexus, Volume 1, Issue 3, July 2022, pgac120, https://doi.org/10.1093/pnasnexus/pgac120
O’Neil, Cathy. Weapons of math destruction. Crown, 2016. ISBN: 978-0553418811.
Noble, Safiya Umoja. Algorithms of Oppression: How Search Engines Reinforce Racism. NYU Press, 2018. ISBN: 978-1479837243.
https://datapractices.org/manifesto/
Floridi L, Taddeo M. What is data ethics?. Philos Trans A Math Phys Eng Sci. 2016;374(2083):20160360. doi:10.1098/rsta.2016.0360

Practical considerations

It is all nice and good to tell people that their data needs to be FAIR, and that their science needs to be open. But the question is, why do it? Most people might agree that there are societal benefits. For instance having a resource like GenBank for genetic sequences has allowed many scientists beyond those who created the original sequence to use the data and answer important scientific questions. However, there generally need to also be benefits for individuals to entice them to spend time following Open Science standards. This is happening increasingly. On one hand, tools to do research in an Open Science framework are getting increasingly better. For instance the whole R + Quarto framework we’ve been using in class makes it rather easy to do things automated and reproducible. Many other similar tools and resources are becoming available. Another potential benefit of Open Science is increased visibility. If others can use your data and models, they will likely cite your work. They might even want to collaborate with you. It seems for those individuals who are good at sharing, the benefits outweigh the perceived risks (such as “being scooped”). And the final, ever increasing “benefit” is that publishers and funding agencies increasingly require it. So if you don’t follow Open Science standards, it might soon get hard to publish work or get funding. It’s less of a “benefit” and more of an “avoid the penalties” thing, but regardless, it is an important practical consideration.

Summary

While open research is not the same as READy principles, there is a lot of overlap. With the increasing requirements for research (data as well as code) to be made openly accessible and FAIR, it is important to ensure your analyses are done in a way that it is easy to publicly share your results and for others to build on it. The READy principles will ensure this is the case.

Further Resources

The Center for Open Science, a nonprofit dedicated to promoting open science practices and community change, maintains a blog with several thought provoking readings.
Statistician Andrew Gelman sometimes writes about open science, mostly from a statistical perspective and has some nice comments on replication. (1) (2)
FOSTER is a European organization with a set of free courses on open science principles.