In this unit, we’ll discuss how to structure a data analysis to make it as efficient as possible.
You want to set up your analysis in such a way that it makes sense to you and others and allows for a good and efficient workflow. The main components of your analysis will be data, code, results (tables, figures, etc.) and products (reports, interactive apps, slides, etc.). It is a good idea to have separate folders for each of those elements inside your main project folder. Your project folder could be a Github repository (a good idea) or not. Since you might not want to share your analysis publicly, sometimes using a private repository is useful. (Though don’t worry about getting scooped. Everyone is so busy with their own stuff, nobody really cares what you are working on 😁. In general, I recommend private repositories if you have data that can’t be made public.) Starting your project as an R project is also a good idea.
Inside your folders, you can have subfolders, e.g., separate folders for figures and tables. Or you could have subfolders for different types of analyses. There is no one correct way to set up things, but you should think of a logical and consistent structure before you start your project.
You will want different R and/or Quarto scripts for the cleaning/wrangling/exploring part and the analysis part. The number of scripts depends on your project and your preference. In general, keeping things modular is useful. If you had one file that did the analysis and created a report, that would be ok for a small project. But then if you wanted to make a set of slides based on your results, you’ll have to find a way to include the code in those slides. It would be easier to have code produce and save results such as figures, which can then be included in both a report/manuscript and slides.
Another consideration is computational time. For simple projects, your code will likely run fast. Once your analysis or data become large, parts of your code might run very long. You will then want to structure your scripts such that the computationally-intensive part is only run when absolutely needed. You definitely want to avoid a scenario where you have to wait minutes or hours as you play around with a figure to make it look the way you want.
As an example, and hopefully useful guide for your class project, I
created a public
Github repository called dataanalyis-template which is meant as a
template for doing a data analysis project. It has different folders for
organizing your project. Various README
files are provided
to explain what each folder should contain. The template also contains
several example files to show how the whole project workflow (or any
data analysis workflow for that matter) can work. This is, of course,
only one way to structure things. You are welcome to figure out your own
setup and structure. Overall, figure out what setup works best for you,
while keeping in mind that it should be easily
understandable/reproducible for a reader (or your colleague, PI,…).
You’ll be exploring this template as part of the exercise.
One problem that I encounter every time I teach a
course like this is the use of paths that are specific only to the
user’s computer and do not work on someone else’s machine. Do not set
paths or load files from paths that only exist on your machine! Instead,
you should only use relative paths. A relative path is
a file path that is relative to some directory. So what should
not be part of your code is the command
setwd()
(because your working directory is likely different
from everyone else) or anything that involves a full path name
(e.g. C:/myusername/mydesktop/myfolder/
) since nobody you
share the code with will have that folder.
For a short video tutorial, on relative paths, see this video. Note that
the video is several years old and he doesn’t use R Projects. I
strongly recommend using R Projects. By having any project you
work on as an R project, relative paths will always be relative to the
main directory in which your .Rproj
file is located. As
long as someone loads the project by clicking on the .Rproj
file, and you only use relative paths, things should work well on any
computer. While R projects solve most of the problem, some issues
remain, e.g. differences of how relative paths are treated in Rmd versus
R files.
To solve those issues, you should use the here
package,
together with R Projects. It helps solve issues with relative paths
being dealt with differently in an Rmd file versus an R file versus the
console. This
blog post is a great, short explanation of why one should use the
here
package and how to do it.
However you do things, make sure that for your exercises and especially final project, someone else can clone your repository (or otherwise copy your project if it’s not on Github) to their computer and run everything, without having to have exactly your setup of folders.
Several efforts to develop further tools to help improve reproducible research within the R system exist. The few I know about are listed below. I have not tried to use any of them, but feel free to try/use them as part of this class.
projects
R package - meant to provide a framework for
rather sophisticated projects.While there is no substitute for clear thinking and being careful in your analysis, things such as having a clear structure can help with this. The importance of good structure to help you achieve optimal results is well appreciated in many contexts. For a short discussion of this, see this article by Seth Godin. For a more in-depth discussion of this idea, the Checklist Manifesto by Atul Gawande is a great read.