In this unit, we’ll talk a bit more about reproducible research, and how that fits into the much larger concept of “open science.”
Way back at the beginning of the course, we briefly discussed reproducibility. Throughout the course, we emphasized the importance to do analyses in a reproducible (and thus generally automated/scripted) manner. In this module, we’ll discuss a bit more what reproducible research means and entails, current trends towards open science, and why you pretty much need to have a solid grasp on reproducible research strategies in the future.
While the concepts of reproducible research and open science have been around for a while, they have recently increased in prominence. Several US government organizations have said that 2023 is the Year of Open Science, and a recent Office of Science and Technology Policy memo has recommended that all federal agencies take serious steps towards open science practices (more on what exactly this means later).
Federal agencies like the NIH have been requiring data sharing and public releases of papers (on PMC) for some time now, but these recommended measures are even stronger, and in my opinion, a very positive direction for science. It is quite likely that if you work with federal funding in the near future, you will be required to share your data and code publicly as soon as you decide to publish. The exact implementation of these protocols hasn’t been determined yet (nor has the degree to which these recommendations will be formally implemented), but the writing is on the wall, so to speak.
It is easier for you to get comfortable with reproducible research methods and open science now, rather than scrambling to learn it in the future!
There are many ways of defining and describing “open science”.
If you search online for “what is open science,” you will likely find various conceptual diagrams and descriptions. The papers Open science saves lives: lessons from the COVID-19 pandemic and From Open Data to Open Science provide pretty good conceptual discussions of open science, but I think the best of schematic summary is from NASA.
NASA defines three components that are central to open science.
If you want a formal definition of “open science,” here’s the one from the OSTP, which is pretty comprehensive:
“The principle and practice of making research products and processes available to all, while respecting diverse cultures, maintaining security and privacy, and fostering collaborations, reproducibility, and equity.”
Open science is like spaghetti sauce–it has a lot of ingredients that have to be cooked together to get the best flavor. Let’s go into a bit more detail on each of these topics.
Open access publishing is a big topic, but for the purposes of this class I will just say that I think you should publish Open Access whenever possible. While you’re a student at UGA, our library has a fund that can help you with this. In the future, it seems likely that any federally funded projects in the US will require an open access fund as part of the budget, as publishing OA can often be prohibitively expensive. If you want to learn more check out wikipedia page, or if you really want to learn more, see Peter Suber’s book on OA.
The main purpose of this section is to talk about data sharing. Data sharing used to be a pretty contentious topic, but fortunately it is more normalized now than it has ever been before, and thanks to NIH (and other agency mandates), it will likely be a normal part of science before too long.
Of course just dumping messy data without any documentation into a depository is not great for accessibility and reproducibility. As some of you noticed during this course, even data from the CDC or other reputable organizations is often very poorly formatted and does not have a lot of metadata or documentation to help you out. To somewhat solve this issue, the FAIR data standards was created. FAIR stands for:
The FAIR standard was originally described in this 2016 publication. Another great resource is the GO FAIR initative. It is worth taking a short look at both these resources.
Of course dealing with things like data use agreements and individual privacy is paramount, so 100% achieving these goals all the time is not feasible. Sometimes you might need to use synthetic data based on your real data, or remove certain information from your data to protect privacy, or require a data use agreement before you can distribute the data to individuals. We should strive for data to be as FAIR as possible, but protecting individual privacy is equally important!
In general, a lot of authors still hide behind “I can’t share my data, it’s confidential” as an excuse for not having to share. Most of the time, it is possible to share de-identified data without confidentiality problems. Hopefully, the near future will make data sharing much more common and also easier.
We discussed reproducibility in this reading, revisit it if you need a refresher. In one sentence, your research is reproducible if another scientist can repeat your methods on your data and get the same results. That is, someone else can take your code and data, rerun all your code, and get the same answers that you got in your paper.
An important and related issue is replicability -— you’ve probably heard of the “replication crisis” in science. If not, see for instance this wikipedia page. In one sentence, your research is replicable if someone else can follow your methods and materials to collect new data, analyze their data, and get similar results. Of course in social and semi-social sciences (like epidemiology), when your results often depend on the target population of interest and are expected to vary over time, replicability can be a bit more nuanced than in an experimental field, where experiments should pretty much control for anything that you aren’t interested in.
Note that replicability is somewhat related to the analysis topic of generalizability/portability we discussed. The difference is that we focused on analysis approaches that would lead to replicable modeling/analysis results using other data sources. More broadly, one wants to be able to replicate overall findings if one were to repeat the full study, including collection of new data. This is important, but for purposes of analysis, the data is considered given, and the focus is on reproducibility/replicability of the data analysis part.
The topic of inclusivity in data science is heavily tied into the idea of data ethics – critical thinking about the potential results and consequences of the products we create is crucial in data science. Many of you have some public health training, and are therefore likely familiar with the Belmont Report. Regardless of your scientific background, when working with data (especially human subjects data), the guiding principles of respect for persons, beneficience, and justice should be critical considerations for all of our research.
For some additional reading on inclusivity and data science ethics, see for instance these resources:
It is all nice and good to tell people that their data needs to be
FAIR, and that their science needs to be open. But the question is, why
do it? Most people might agree that there are societal benefits. For
instance having a resource like GenBank for genetic sequences has
allowed many scientists beyond those who created the original sequence
to use the data and answer important scientific questions. However,
there generally need to also be benefits for individuals to entice them
to spend time following Open Science standards. This is happening
increasingly. On one hand, tools to do research in an Open Science
framework are getting increasingly better. For instance the whole
R
+ Quarto
framework we’ve been using in class
makes it rather easy to do things automated and reproducible. Many other
similar tools and resources are becoming available. Another potential
benefit of Open Science is increased visibility. If others can use your
data and models, they will likely cite your work. They might even want
to collaborate with you. It seems for those individuals who are good at
sharing, the benefits outweigh the perceived risks (such as “being
scooped”). And the final, ever increasing “benefit” is that publishers
and funding agencies increasingly require it. So if you don’t follow
Open Science standards, it might soon get hard to publish work or get
funding. It’s less of a “benefit” and more of an “avoid the penalties”
thing, but regardless, it is an important practical consideration.
Everything we discussed throughout this course, and this short introduction, hopefully conveyed the message that Open Science is an important topic to think about and strive toward. The importance of the topic has been increasing in recent years. It is useful to stay somewhat informed. Here are a few places you can find more open science resources.