Project Management
Overview
For this unit, we will discuss ways to make sure working collaboratively on a project is efficient and that project progress is tracked.
Learning Objectives
- Know why tracking a project is useful
- Be able to use basic features of Git/GitHub for version control
Introduction
While the term Project Management is very broad, here I only mean management of your actual files and folders. Time- and people-management, which are obviously very important components of any project, are beyond the scope of this unit.
It is generally useful to have some software that can record and track your progress. This is useful even if you are the only one working on a project, since it allows you to to go back in history to older versions of the project. Thus, if you made some changes that you don’t like, or accidentally deleted some files, you can revert back.
These tracking tools become even more important if you work with collaborators. You want to avoid editing the same files at the same time and it should be clear who did what. Without a tracking system, this is hard to do.
There are different tools available to track your progress. They are often known as version control systems (coming from the versions of code in software development, where those tools originated.)
We’ll only consider a few of those systems. Wikipedia has a long list if you are interested in further such tools.
Git/GitHub
Git/GitHub is a very common method of working collaboratively on projects. A prior unit provides a brief introduction to Git/GitHub.
The following only mentions (again) some topics worth keeping in mind when using Git/GitHub for data analysis work.
File sizes
GitHub has a problem with large files. Anything above 100MB will lead to a failure which is annoying and tricky to resolve. In general, files above 50mb will work, be will be slow and give you a warning. Unfortunately, once you have a failed attempt to sync a large file, it can be quite tricky to fix the repository to work again. Therefore keep large (>100MB) files out of your GitHub repositories. If you have to work with such files, try to reduce them first before placing into the GitHub repository. As alternative, place those files in another sync service (e.g. Dropbox, OneDrive, GoogleDrive) and load them from there. Fortunately, there are some nice R packages that can usually make retrieving those files reproducible, even if they are in a file syncing service outside of your repo. Finally, your overall repo must be less than 5GB in size, so consider storing your huge data files and audio/video files somewhere else.
There is an option to have larger files with GitHub using Git Large File Storage. This can be a bit tricky, if you think it’s worth trying, see the documentation here. GitHub Pro users (including students using the Student Developer Kit) have free access to limited LFS features.
Confidentiality
By default, GitHub repositories are public. Since you might not want to share your analysis publicly, sometimes using a private repository is useful. I think you generally don’t need to worry about getting scooped. Everyone is so busy with their own stuff, nobody really cares what you are working on 😁. However, you might often have data or other information as part of your project that you don’t want to share publicly.
Therefore, for most data analysis projects, it is probably good practice to start with a private repository, so you can have potentially confidential information in the repo. You can add collaborators to the private repository. This way only those individuals who you want to give access will be able to see, and potentially contribute to the project.
Eventually, to ensure your project is publicly reproducible, you want to share some version of the data and the code. Be aware that if you make a repository public, others can also see the full history. This can be important if you want to make your final results public - after you made sure everything is anonymous enough for sharing. If at any time you worked in the repository with the full dataset and tracked it, others can go back in your GitHub/Git history and see it. The best option is likely to start a new public repository and copy only the elements you want to share into that repository, while keeping your original repository private.
Gitlab & others
As mentioned above, the underlying software is called Git. GitHub is essentially a hosting platform (owned by Microsoft). GitHub is the most popular Git hosting platform, but not the only one. A few alternatives are Gitlab, Bitbucket and others. While there are some differences between these platforms, as long as the underlying system is Git
, the workflows are comparable.
Subversion & others
Subversion (SVN) is an alternative to Git
. Since it is a different underlying tool to manage your project - not just another hosting platform - there are differences between Git
and SVN
that require getting used to. Depending on the level of use, differences might seem small (just different terms for procedures, e.g., instead of git pull
you do svn update
), but as you dig deeper, differences in approach can become more substantial.
Yet other tools exist, e.g., Mercurial and Fossil, though they are not very widely used.
In my opinion, since Git/GitHub are very common, have good documentation and user-friendly clients, most of the time those are probably your best choice. If you really get into working with large data files and need ways to track them, it might be worth looking around a bit. E.g., SVN often works better with large files.
Summary
You want to use some tool that tracks your project. This is useful for solo work since you can revert, and very useful for teamwork. Git/GitHub is the most common tool to use for such tracking purposes. Others exist but you likely do not need to consider them.
Further Resources
There’s a lot of information out there on version control in general, and more specifically Git and GitHub. The version control article on Wikipedia is a good starting point for learning some more of the concepts. The GitHub documentation has a lot of great resources for working with Git and GitHub, at varying levels of difficulty.