Open Research

Authors

Zane Billings

Andreas Handel

Modified

2026-01-06

Overview

This unit provides an introduction to the concept of open research/science.

Learning Objectives

  • Understand the general idea of open science
  • Explain what makes research reproducible and/or replicable
  • Know what FAIR data standards are

Introduction

While READy principles and openness are not necessarily the same, they are often connected.

The idea is that if you provide others access to your data, code and other components of analysis - the open part - it is generally only useful if this is done in a READy framework. Otherwise, it might become close to impossible for others to reproduce and further build on your work.

The general idea of open science

There are many ways of defining and describing “open science”. If you search online for “what is open science,” you will likely find various conceptual diagrams and descriptions. For instance, the Wikipedia Open Science page has a pretty good overview and diagram.

The basic idea is that - as much as possible - data, code, and any other related product is provided publicly and free of charge, in a way that is easily findable, accessible, and usable by others.

In general, a reproducible workflow makes it easier for others to find/access/and re-use your work, thus the connection between reproducibility and open science.

Federal open science requirements

While the concepts of reproducible research and open science have been around for a while, they have recently increased in prominence. Federal agencies like the NIH have been requiring data sharing and public releases of papers (on PMC) for some time now, but these recommended measures are even stronger, and in my opinion, a very positive direction for science. It is quite likely that if you work with federal funding in the near future, you will be required to share your data and code publicly as soon as you decide to publish. The exact implementation of these protocols hasn’t been determined yet (nor has the degree to which these recommendations will be formally implemented), but the writing is on the wall, so to speak.

It is easier for you to get comfortable with reproducible research methods and open science now, rather than scrambling to learn it in the future!

Accessibility and FAIR data

Open access publishing is a big topic, but for the purposes of this class I will just say that I think you should publish Open Access whenever possible. In the future, it seems likely that any federally funded projects in the US will require an open access fund as part of the budget, as publishing OA can often be prohibitively expensive. If you want to learn more, check out the Wikipedia page on Open Access, or if you really want to learn more, see Peter Suber’s book on OA.

The main purpose of this section is to talk about data sharing. Data sharing used to be a pretty contentious topic, but fortunately it is more normalized now than it has ever been before, and thanks to NIH (and other agency mandates), it will likely be a normal part of science before too long.

Of course just dumping messy data without any documentation into a depository is not great for accessibility and reproducibility. As some of you noticed during this course, even data from the CDC or other reputable organizations is often very poorly formatted and does not have a lot of metadata or documentation to help you out. To somewhat solve this issue, the FAIR data standards was created. FAIR stands for:

  1. Findable: when someone goes to your paper or project, they should be able to easily get to the data source.
  2. Accessible: users need to be able to access the data, and data are stored in a way that users can feasibly get them.
  3. Interoperable: data should be in standardized formats, using standard vocabulary. You should be specific if your data is derived from another dataset or if the user will need other related datasets.
  4. Reusable: data should be documented and version-controlled when you release it, with clear descriptions of what information is contained in the data, who can use the data, and for what purposes.

The FAIR standard was originally described in this 2016 publication. Another great resource is the GO FAIR initiative. It is worth taking a short look at both these resources.

Of course dealing with things like data use agreements and individual privacy is paramount, so 100% achieving these goals all the time is not feasible. Sometimes you might need to use synthetic data based on your real data, or remove certain information from your data to protect privacy, or require a data use agreement before you can distribute the data to individuals. We should strive for data to be as FAIR as possible, but protecting individual privacy is equally important!

In general, a lot of authors still hide behind “I can’t share my data, it’s confidential” as an excuse for not having to share. Most of the time, it is possible to share de-identified data without confidentiality problems. Hopefully, the near future will make data sharing much more common and also easier.

Inclusivity

The topic of inclusivity in data science is heavily tied into the idea of data ethics – critical thinking about the potential results and consequences of the products we create is crucial in data science. Many of you have some public health training, and are therefore likely familiar with the Belmont Report.

Regardless of your scientific background, when working with data (especially human subjects data), the guiding principles of respect for persons, beneficence, and justice should be critical considerations for all of our research.

For some additional reading on inclusivity and data science ethics, see for instance these resources:

  • Elaine O Nsoesie, Sandro Galea, Towards better Data Science to address racial bias and health equity, PNAS Nexus, Volume 1, Issue 3, July 2022, pgac120, https://doi.org/10.1093/pnasnexus/pgac120
  • O’Neil, Cathy. Weapons of math destruction. Crown, 2016. ISBN: 978-0553418811.
  • Noble, Safiya Umoja. Algorithms of Oppression: How Search Engines Reinforce Racism. NYU Press, 2018. ISBN: 978-1479837243.
  • https://datapractices.org/manifesto/
  • Floridi L, Taddeo M. What is data ethics?. Philos Trans A Math Phys Eng Sci. 2016;374(2083):20160360. doi:10.1098/rsta.2016.0360

Practical considerations

It is all nice and good to tell people that their data needs to be FAIR, and that their science needs to be open. But the question is, why do it? Most people might agree that there are societal benefits. For instance having a resource like GenBank for genetic sequences has allowed many scientists beyond those who created the original sequence to use the data and answer important scientific questions. However, there generally need to also be benefits for individuals to entice them to spend time following Open Science standards. This is happening increasingly. On one hand, tools to do research in an Open Science framework are getting increasingly better. For instance the whole R + Quarto framework we’ve been using in class makes it rather easy to do things automated and reproducible. Many other similar tools and resources are becoming available. Another potential benefit of Open Science is increased visibility. If others can use your data and models, they will likely cite your work. They might even want to collaborate with you. It seems for those individuals who are good at sharing, the benefits outweigh the perceived risks (such as “being scooped”). And the final, ever increasing “benefit” is that publishers and funding agencies increasingly require it. So if you don’t follow Open Science standards, it might soon get hard to publish work or get funding. It’s less of a “benefit” and more of an “avoid the penalties” thing, but regardless, it is an important practical consideration.

Summary

While open research is not the same as READy principles, there is a lot of overlap. With the increasing requirements for research (data as well as code) to be made openly accessible and FAIR, it is important to ensure your analyses are done in a way that it is easy to publicly share your results and for others to build on it. The READy principles will ensure this is the case.

Further Resources

  • The Center for Open Science, a nonprofit dedicated to promoting open science practices and community change, maintains a blog with several thought provoking readings.
  • Statistician Andrew Gelman sometimes writes about open science, mostly from a statistical perspective and has some nice comments on replication. (1) (2)