In this unit, we’ll discuss how to structure a data analysis to make it as efficient as possible.
You want to set up your analysis in such a way that it makes sense to you and others and allows for a good and efficient workflow. The main components of your analysis will be data, code, results (tables, figures, etc.) and products (reports, interactive apps, slides, etc.). It is a good idea to have separate folders for each of those elements inside your main project folder. Your project folder could be a Github repository (a good idea) or not. Since you might not want to share your analysis publicly, sometimes using a private repository is useful. (Though don’t worry about getting scooped. Everyone is so busy with their own stuff, nobody really cares what you are working on 😁. In general, I recommend private repositories if you have data that can’t be made public.) Starting your project as an R project is also a good idea.
Inside your folders, you can have subfolders, e.g., separate folders for figures and tables. Or you could have subfolders for different types of analyses. There is no one correct way to set up things, but you should think of a logical and consistent structure before you start your project.
You will want different R or R Markdown scripts for the cleaning/wrangling/exploring part and the analysis part. The number of scripts depends on your project and your preference. In general, keeping things modular is useful. If you had one file that did the analysis and created a report, that would be ok for a small project. But then if you wanted to make a set of slides based on your results, you’ll have to find a way to include the code in those slides. It would be easier to have code produce and save results such as figures, which can then be included in both a report/manuscript and slides.
Another consideration is computational time. For simple projects, your code will likely run fast. Once your analysis or data become large, parts of your code might run very long. You will then want to structure your scripts such that the computationally-intensive part is only run when absolutely needed. You definitely want to avoid a scenario where you have to wait minutes or hours as you play around with a figure to make it look the way you want.
As an example, and hopefully useful guide for your class project, I created a public Github repository called dataanalyis-template which is meant as a template for doing a data analysis project. It has different folders for organizing your project. Various
README files are provided to explain what each folder should contain. The template also contains several example files to show how the whole project workflow (or any data analysis workflow for that matter) can work. This is, of course, only one way to structure things. You are welcome to figure out your own setup and structure. Overall, figure out what setup works best for you, while keeping in mind that it should be easily understandable/reproducible for a reader (or your colleague, PI,…). You’ll be exploring this template as part of the exercise.
One problem that I encounter every time I teach a course like this is the use of paths that are specific only to the user’s computer and do not work on someone else’s machine. Do not set paths or load files from paths that only exist on your machine! Instead, you should only use relative paths. A relative path is a file path that is relative to some directory. So what should not be part of your code is the command
setwd() (because your working directory is likely different from everyone else) or anything that involves a full path name (e.g.
C:/myusername/mydesktop/myfolder/) since nobody you share the code with will have that folder.
For a short video tutorial, on relative paths, see this video. Note that the video is several years old and he doesn’t use R Projects. I strongly recommend using R Projects. By having any project you work on as an R project, relative paths will always be relative to the main directory in which your
.Rproj file is located. As long as someone loads the project by clicking on the
.Rproj file, and you only use relative paths, things should work well on any computer. While R projects solve most of the problem, some issues remain, e.g. differences of how relative paths are treated in Rmd versus R files.
To solve those issues, you should use the
here package, together with R Projects. It helps solve issues with relative paths being dealt with differently in an Rmd file versus an R file versus the console. This blog post is a great, short explanation of why one should use the
here package and how to do it.
However you do things, make sure that for your exercises and especially final project, someone else can clone your repository (or otherwise copy your project if it’s not on Github) to their computer and run everything, without having to have exactly your setup of folders.
Several efforts to develop further tools to help improve reproducible research within the R system exist. The few I know about are listed below. I have not tried to use any of them, but feel free to try/use them as part of this class.
projectsR package - meant to provide a framework for rather sophisticated projects.
workflowrR package - meant to provide a structure for reproducible data analysis projects.
While there is no substitute for clear thinking and being careful in your analysis, things such as having a clear structure can help with this. The importance of good structure to help you achieve optimal results is well appreciated in many contexts. For a short discussion of this, see this article by Seth Godin. For a more in-depth discussion of this idea, the Checklist Manifesto by Atul Gawande is a great read.