This document gives a brief explanation of Git, GitHub and related tools, and describes why and how we will use them for this course.
Git is software that implements what is called a version control system. The main idea is that as you (and your collaborators) work on a project, the software tracks, and records any changes made by anyone. GitHub is distinct from Git. GitHub is in some sense on online interface/website, and Git the underlying engine (a bit like RStudio and R). Since we will only be using Git through GitHub, I tend to not distinguish between the two. Throughout the course, I will usually refer to all of it as just GitHub. Note that other interfaces/platforms for Git exist, e.g., GitLab or Bitbucket.
You want to use GitHub to avoid this:
GitHub gives you an organized way to track your projects. It is very well suited for collaborative work. Historically, version control was used for software development. It has since become broader and is now used for many types of projects, including data science projects. If you want to learn a bit more about Git/GitHub and why it is a great tool for data analysis, check out this article by Jenny Bryan.
GitHub is ideal if you have a project with (possibly many) smallish files, and most of those files are text files (such as R code, LaTeX, Quarto/Markdown, etc.) and different people work on different parts of the project.
GitHub is less useful if you have a lot of non-text files (e.g. Word or PowerPoint) and different team members might want to edit the same document at the same time. In that instance, a solution like Google Docs, Word+Dropbox, Word+Onedrive, etc. might be better.
GitHub also has a problem with large files. Anything above around 100MB can lead to very slow syncing and sometimes outright failure. Unfortunately, once GitHub has chocked on a large file, it can be quite tricky to fix the repository to work again. Therefore keep large (>100MB) files out of your GitHub repositories. If you have to work with such files, try to reduce them first before placing into the GitHub repository. Or as alternative, place those files in another sync service (e.g. Dropbox, OneDrive, GoogleDrive) and load them from there.
Finally, if you have data, you need to be careful since by default, GitHub repositories (the GitHub name for your projects) are public and everyone can see them. You can set them private, but you need to be careful that you don’t accidentally expose confidential data to the public. It is in general not a good idea to have confidential data on GitHub. First anonymize your data enough that it is not at risk of identifiability, then you can place it in a private repository. If you put it public, be very careful that you only make things public that are ok to be made public.
Git is a piece of software. It can be installed from here. If you use GitKraken as described below, you don’t need to install Git separately. However, it might still be worth doing so, in case at some point you need to do some Git operation outside of GitKraken (e.g, through RStudio or a terminal).
GitHub is an online platform. You don’t need to install anything, but if you don’t have a GitHub account yet, you need to create one here. Note that GitHub is widely used professionally. You might use it beyond this class and for your future career. You might also want to allow other people to see your GitHub presence. You should therefore use a future-proof, professional sounding user name for your GitHub account. You can use your UGA or a different email for GitHub. Whichever email you use is where you will receive GitHub activity related notifications (including our GitHub class activities).
The most powerful and flexible way of using Git/GitHub is to open a command-line terminal (e.g., the one that comes with the Git install) and type a bunch of commands. This is often very confusing and we will not be using this approach.
The approach I recommend for this class, and the one I assume you will be using, is to interact with GitHub through a graphical interface. Specifically, for this class I recommend a software called GitKraken.
If you are already an experienced GitHub user and have your own preferred setup (e.g. the command line or some other 3rd party software), or for some other reason you don’t want to use GitKraken, you can instead use another graphical client. GitHub itself provides a graphical interface with basic functionality. RStudio also has Git/GitHub integration, for R projects. Other graphical interfaces exist. I assume if you decide to do it your way, you are experienced enough to translate my GitKraken-centric instructions to your setup. While my instructions throughout this course assume you use GitKraken, you can do everything without using it. But then you have to figure out on your own how to do it 😁.
Download and install GitKraken. The free version can do pretty much everything we need. One important limitation is that the free version does not allow you to access private repositories. As student, you can (and should) upgrade to the Pro version for free, see the GitHub developer pack section below on how to do it. By using GitKraken for your GitHub workflow, you do not need to go through the command line (unless things get so bad that you can’t fix them inside GitKraken). You also do not need to install Git separately, you get it with GitKraken. For this course, I assume you will be using the Pro version of GitKraken.
Once you have your GitHub account set up and GitKraken installed, make sure you connect the two.
I strongly recommend you get the GitHub student developer pack. This pack gives you a lot of nice free goodies that you might be interested in. I encourage you to explore all the different products and deals. For this class, we definitely want the following:
GitHub Pro account: This will give you your own free private repositories. We will mostly use public repositories, but there might be times you want to have a private repository, e.g. for data you don’t want to be public. Normally, if you want private repositories, you have to pay. But with the student pack (or as an academic), you get them for free.
GitKraken Pro: Free GitKraken Pro access. While the free version of GitKraken might work ok for this course, you can’t access private repositories with it. Often, being able to use private repositories is useful.
There are a lot of other goodies that come with the student pack. We won’t use them in the course, but some might be of interest to you.
Git/GitHub has a lot of specialized terminology that takes getting used to. The GitHub folks posted a handy page with short descriptions of some of the important terms. Some of the terms you will encounter soon are Repository (also often called repo), User, Organization, Branch, Fork, Push, Pull, Commit and Stage. For some reason, the last term is not explained in the list linked to above. Staging is the step you need to do after you made changes and before you commit them. For pretty much all of my work, I find this a not too helpful extra step, but you need to know what it means and you need to do/use it as part of the GitHub workflow.
Each repository can have multiple branches (think of it as
alternative versions of all files, allowing different people to work on
different aspects of a project with minimal interference). In this
course, we won’t really use branches. We will mainly just work with a
single branch. The name of the main branch can be anything, but these
days the default is
main. Until not too long ago, the main
branch was referred to as
master. This has mostly been
discontinued, but you might still see occasionally some Github
repositories where the main branch is called
might happen that in these course documents, I still occasionally refer
to the main branch by
master (let me know if you spot the
old terminology so I can update). Again, the name can be in principle
anything. Just be aware that
usually refer to the main branch.
In some way, you can think of GitHub as a more controlled (and more complex) alternative to systems such as Dropbox, OneDrive, GoogleDrive, etc. With the GitHub workflow, you can work locally on different machines, and everything is backed up in the cloud (on github.com). The (sometimes annoying) difference to Dropbox & Co is that the syncing between your local computer and github.com is not automatic. So don’t forget to pull before you start work on a repository and push once you are done!
I am a heavy user of both GitHub and Dropbox. I use Dropbox for a lot of regular files, e.g. MS Office documents, pdfs, images, etc. Some of my research projects, especially those where I collaborate with people that don’t use GitHub, are run through Dropbox. Dropbox or a similar service are convenient since all the syncing happens automatically. For anything where I want a more structured and organized approach, e.g. coding projects, some research projects or course materials like this site, I use GitHub.
You can do any setup you want. However, don’t mix and match. Because GitHub works similar to Dropbox, it is a bad idea to locally store your GitHub repositories in a folder that is synced by Dropbox. If you do, the Dropbox sync process and the GitHub sync process can conflict, potentially leading to syncing mess. For any GitHub repository, store it locally in a folder that is not synced across your computers by another software (such as Dropbox, OneDrive, Box, etc.)
Similarly, don’t store a GitHub repository inside another repository. While this is technically allowed, it can easily lead to a mess. Each repository should be separate and not connected to any other syncing service.
We will make heavy use of GitHub for this course. Instead, everyone will just be working through their own GitHub accounts. The advantage of this setup is that once the course ends, everything you did for the course automatically stays with you.
Privacy note: As part of this course, you will be asked to create several public GitHub repositories, including a public website, which at the end of the course will contain your portfolio of data analysis products. I consider this a good way of creating products that are public and which you can list on your resume and show potential future employers. However, as a student, under the Family Educational Rights and Privacy Act (FERPA) you have the right to have all your student-related information be kept private, including the fact that you are taking this class. Keeping everything private can be done using private GitHub repositories. But it means you won’t be able to create a public website/portfolio, and you can’t follow the general instructions in the exercises. Instead I’ll need to give you special instructions. Some of the group exercises also won’t work. If for some reason you want to keep the fact that you are taking this class a secret, and thus do not want to use public GitHub repositories/create a publicly accessible online portfolio please let me know. If I don’t hear from you, I consider this as you being consenting with producing public materials as part of this course, and thus make your participation in this course public.
Frustration note: Git/GitHub can be finicky, even for seasoned users. Sooner or later, you’ll run into some issues and strange error messages. If that happens, don’t panic. You’ll figure out how to fix things. The internet and Jenny Bryan’s book happygitwithr are great resources. And/or ask on Slack.
GitHub can be confusing. Start slow. Using the graphical interface (GitKraken) makes getting started fairly easy. I’m also trying to give specific instructions whenever we do something new with GitHub. Both the GitHub documentation and GitKraken support page are good resources for information.
Another source worth looking at for GitHub information is this chapter of IDS. There is also a lot of good beginner material online.