This document gives a brief explanation of Git, GitHub and related tools, and describes why and how we will use them for this course.
Git is software that implements what is called a version control system. The main idea is that as you (and your collaborators) work on a project, the software tracks, and records any changes made by anyone. GitHub is distinct from Git. GitHub is in some sense on online interface/website, and Git the underlying engine (a bit like RStudio and R). Since we will only be using Git through GitHub, I tend to not distinguish between the two. Throughout the course, I will usually refer to all of it as just GitHub. Note that other interfaces/platforms for Git exist, e.g., GitLab or Bitbucket.
You want to use GitHub to avoid this:
GitHub gives you an organized way to track your projects. It is very well suited for collaborative work. Historically, version control was used for software development. It has since become broader and is now used for many types of projects, including data science projects. If you want to learn a bit more about Git/GitHub and why it is a great tool for data analysis, check out this article by Jenny Bryan.
GitHub is ideal if you have a project with a fair number of files, most of those files are text files (such as code, LaTeX, (R)markdown, etc.) and different people work on different parts of the project.
GitHub is less useful if you have a lot of non-text files (e.g. Word or PowerPoint) and different team members might want to edit the same document at the same time. In that instance, a solution like Google Docs, Word+Dropbox, Word+Onedrive, etc. might be better.
GitHub is also not ideal if you have very large files (e.g., large data files). And if you have data, you need to be careful since by default, GitHub repositories (the GitHub name for your projects) are public. You can set them private, but need to be careful that you don’t accidentally expose confidential data to the public. It is in general not a good idea to have confidential data on GitHub. First anonymize your data, then you can place it in a public or private repository.
Git is a piece of software. It can be installed from here. If you use GitKraken as described below, you don’t need to install Git separately. However, it might still be worth doing so, in case at some point you need to do some Git operation outside of GitKraken.
GitHub is an online platform. You don’t need to install anything, but if you don’t have a GitHub account yet, you need to create one here. Note that GitHub is widely used professionally. You might use it beyond this class and for your future career. You might also want to allow other people to see your GitHub presence. You should therefore use a future-proof, professional sounding user name for your GitHub account. You can use your UGA or a different email for GitHub. Whichever email you use is where you will receive GitHub activity related notifications (including our GitHub class activities).
The most powerful and flexible way of using Git/GitHub is to open a command-line terminal (e.g., the one that comes with the Git install) and type a bunch of commands. This is often very confusing and we will not be using this approach.
The approach I recommend for this class, and the one I assume you will be using, is to interact with GitHub through a graphical interface. Specifically, for this class I recommend a software called GitKraken.
If you are already an experienced GitHub user and have your own preferred setup (e.g. the command line or some other 3rd party software), or for some other reason you don’t want to use GitKraken, you can instead use another graphical client. GitHub itself provides a graphical interface with basic functionality. RStudio also has Git/GitHub integration, for R projects. Other graphical interfaces exist. I assume if you decide to do it your way, you are experienced enough to translate my GitKraken-centric instructions to your setup. While my instructions throughout this course assume you use GitKraken, you can do everything without using it. But then you have to figure out on your own how to do it 😁.
Download and install GitKraken. The free version can do pretty much everything we need. One important limitation is that the free version does not allow you to access private repositories. As student, you can (and should) upgrade to the Pro version for free, see the GitHub developer pack section below on how to do it. By using GitKraken for your GitHub workflow, you do not need to go through the command line (unless things get so bad that you can’t fix them inside GitKraken). You also do not need to install Git separately, you get it with GitKraken. For this course, I assume you will be using the Pro version of GitKraken.
Once you have your GitHub account set up and GitKraken installed, make sure you connect the two.
I strongly recommend you get the GitHub student developer pack. This pack as a lot of nice free goodies that you might be interested in. I encourage you to explore all the different products and deals. For this class, we definitely want the following:
GitHub Pro account: This will give you your own free private repositories. We will mostly use public repositories, but there might be times you want to have a private repository, e.g. for data you don’t want to be public. Normally, if you want private repositories, you have to pay. But with the student pack (or as an academic), you get them for free.
GitKraken Pro: Free GitKraken Pro access. While the free version of GitKraken might work ok for this course, you can’t access private repositories with it. Often, being able to use private repositories is useful.
There are a lot of other goodies that come with the student pack. We won’t use them in the course, but some might be of interest to you.
Note that there is a similar offering for teachers/educators, which you can get here.
Git/GitHub has a lot of specialized terminology that takes getting used to. The GitHub folks posted a handy page with short descriptions of some of the important terms. Some of the terms you will encounter soon are Repository (also often called repo), User, Organization, Branch, Fork, Push, Pull, Commit and Stage. For some reason, the last term is not explained in the list linked to above. Staging is the step you need to do after you made changes and before you commit them. For pretty much all of my work, I find this a not too helpful extra step, but you need to know what it means and you need to do/use it as part of the GitHub workflow.
In some way, you can think of GitHub as a more controlled, a bit more complex alternative to systems such as Dropbox, OneDrive, GoogleDrive, etc. With the GitHub workflow, you can work locally on different machines, and everything is backed up in the cloud (on github.com). The (sometimes annoying) difference to Dropbox & Co is that the syncing between your local computer and github.com is not automatic. So don’t forget to pull before you start work on a repository and push once you are done!
I am a heavy user of both GitHub and Dropbox. I use Dropbox for a lot of regular files, e.g. MS Office documents, pdfs, images, etc. Some of my research projects, especially those where I collaborate with people that don’t use GitHub, are run through Dropbox. Dropbox or a similar service are convenient since all the syncing happens automatically. For anything where I want a more structured and organized approach, e.g. coding projects, some research projects or course materials like this site, I use GitHub.
You can do any setup you want. However, don’t mix and match. Because GitHub works similar to Dropbox, it is a bad idea to locally store your GitHub repositories in a folder that is synced by Dropbox. If you do, the Dropbox sync process and the GitHub sync process can conflict, potentially leading to syncing mess. For any GitHub repository, store it locally in a folder that is not synced across your computers by another software (such as Dropbox, OneDrive, Box, etc.)
Similarly, don’t store a GitHub repository inside another repository. While this is technically allowed, it can easily lead to a mess. Each repository should be separate and not connected to any other syncing service.
We will make heavy use of GitHub for this course. In the past, I created a dedicated organization for the course, but I decided we actually don’t need it. Instead, everyone will just be working through their own GitHub accounts. The advantage of this setup is that once the course ends, everything you did for the course automatically stays with you.
GitHub can be confusing. Start slow. Using the graphical interface (GitKraken) makes getting started fairly easy. I’m also trying to give specific instructions whenever we do something new with GitHub. Both the GitHub documentation and GitKraken support page are good resources for information.
Another source worth looking at for GitHub information is this chapter of IDS. There is also a lot of good beginner material online.