Git/GitHub Introduction
Overview
This document gives a brief explanation of Git, GitHub and related tools, and describes why and how we will use them for this course.
Learning Objectives
- Know what Git and GitHub are.
- Know why one might want to use them.
- Create and set up a GitHub account.
- Set up a Git/GitHub workflow using GitKraken or an equivalent setup.
What are Git and GitHub
Git is software that implements what is called a version control system. The main idea is that as you (and your collaborators) work on a project, the software tracks, and records any changes made by anyone. GitHub is distinct from Git. GitHub is in some sense on online interface/website, and Git the underlying engine (a bit like RStudio and R). Since we will only be using Git through GitHub, I tend to not distinguish between the two. Throughout the course, I will usually refer to all of it as just GitHub. Note that other interfaces/platforms for Git exist, e.g., GitLab or Bitbucket.
Why use Git/GitHub
You want to use GitHub to avoid this:
GitHub gives you an organized way to track your projects. It is very well suited for collaborative work. Historically, version control was used for software development. It has since become broader and is now used for many types of projects, including data science projects. If you want to learn a bit more about Git/GitHub and why it is a great tool for data analysis, check out this article by Jenny Bryan.
What to (not) use Git/GitHub for
GitHub is ideal if you have a project with (possibly many) smallish files, and most of those files are text files (such as R code, LaTeX, Quarto/Markdown, etc.) and different people work on different parts of the project.
GitHub is less useful if you have a lot of non-text files (e.g. Word or PowerPoint) and different team members might want to edit the same document at the same time. In that instance, a solution like Google Docs, Word+Dropbox, Word+Onedrive, etc. might be better.
GitHub also has a problem with large files. Anything above around 100MB can lead to very slow syncing and sometimes outright failure. Unfortunately, once you have a failed attempt to sync a large file, it can be quite tricky to fix the repository to work again. Therefore keep large (>100MB) files out of your GitHub repositories. If you have to work with such files, try to reduce them first before placing into the GitHub repository. Or as alternative, place those files in another sync service (e.g. Dropbox, OneDrive, GoogleDrive) and load them from there.
Finally, if you have data, you need to be careful since by default, GitHub repositories (the GitHub name for your projects) are public and everyone can see them. You can set them private, but you need to be careful that you don’t accidentally expose confidential data to the public. It is in general not a good idea to have confidential data on GitHub. First anonymize your data enough that it is not at risk of identifiability, then you can place it in a private repository. If you put it public, be very careful that you only make things public that are ok to be made public.
Getting Git/GitHub
Git is a piece of software. It can be installed from the Git website. If you use GitKraken as described below, you don’t need to install Git separately. However, it might still be worth doing so, in case at some point you need to do some Git operation outside of GitKraken (e.g, through RStudio or a terminal).
GitHub is an online platform. You don’t need to install anything, but if you don’t have a GitHub account yet, you need to create one on GitHub.com. Note that GitHub is widely used professionally. You might use it beyond this class and for your future career. You might also want to allow other people to see your GitHub presence. You should therefore use a future-proof, professional sounding user name for your GitHub account. You can use your UGA or a different email for GitHub. Whichever email you use is where you will receive GitHub activity related notifications (including our GitHub class activities).
How to use Git/GitHub
The most powerful and flexible way of using Git/GitHub is to open a command-line terminal (e.g., the one that comes with the Git install) and type a bunch of commands. This is often very confusing and we will not be using this approach.
The approach I recommend for this class, and the one I assume you will be using, is to interact with GitHub through a graphical interface. Specifically, for this class I recommend a software called GitKraken.
If you are already an experienced GitHub user and have your own preferred setup (e.g. the command line or some other 3rd party software), or for some other reason you don’t want to use GitKraken, you can instead use another graphical client. GitHub itself provides a graphical interface with basic functionality. RStudio also has Git/GitHub integration, for R projects. Other graphical interfaces exist. I assume if you decide to do it your way, you are experienced enough to translate my GitKraken-centric instructions to your setup. While my instructions throughout this course assume you use GitKraken, you can do everything without using it. But then you have to figure out on your own how to do it 😁.
GitKraken
Download and install GitKraken. The free version can do pretty much everything we need. One important limitation is that the free version does not allow you to access private repositories. As student, you can (and should) upgrade to the Pro version for free, see the GitHub developer pack section below on how to do it. By using GitKraken for your GitHub workflow, you do not need to go through the command line (unless things get so bad that you can’t fix them inside GitKraken). You also do not need to install Git separately, you get it with GitKraken. For this course, I assume you will be using the Pro version of GitKraken.
Once you have your GitHub account set up and GitKraken installed, make sure you connect the two.
GitHub developer pack
Please get the GitHub student developer pack. This pack gives you a lot of nice free goodies that you might be interested in. I encourage you to explore all the different products and deals. For this class, we most want the following:
GitHub Pro account: This will give you your own free private repositories. We will mostly use public repositories, but there might be times you want to have a private repository, e.g. for data you don’t want to be public. Normally, if you want private repositories, you have to pay. But with the student pack (or as an academic), you get them for free.
GitKraken Pro: Free GitKraken Pro access. While the free version of GitKraken might work ok for this course, you can’t access private repositories with it. Often, being able to use private repositories is useful.
GitHub Copilot: Which gives you AI support for writing code.
There are a lot of other goodies that come with the student pack. We won’t use them in the course, but some might be of interest to you.
Git/GitHub terminology
Git/GitHub has a lot of specialized terminology that takes getting used to. The GitHub folks posted a handy page with short descriptions of some of the important terms. Some of the terms you will encounter soon are Repository (also often called repo), User, Organization, Branch, Fork, Push, Pull, Commit and Stage. For some reason, the last term is not explained in the list linked to above. Staging is the step you need to do after you made changes and before you commit them. For pretty much all of my work, I find this a not too helpful extra step, but you need to know what it means and you need to do/use it as part of the GitHub workflow.
Each repository can have multiple branches (think of it as alternative versions of all files, allowing different people to work on different aspects of a project with minimal interference). In this course, we won’t really use branches. We will mainly just work with a single branch. The name of the main branch can be anything, but these days the default is main
. Until not too long ago, the main branch was referred to as master
. This has mostly been discontinued, but you might still see occasionally some Github repositories where the main branch is called master
. It might happen that in these course documents, I still occasionally refer to the main branch by master
(let me know if you spot the old terminology so I can update). Again, the name can be in principle anything. Just be aware that master
and main
usually refer to the main branch.
Git/GitHub and other cloud based sync options
In some way, you can think of GitHub as a more controlled (and more complex) alternative to systems such as Dropbox, OneDrive, GoogleDrive, etc. With the GitHub workflow, you can work locally on different machines, and everything is backed up in the cloud (on github.com). The (sometimes annoying) difference to Dropbox & Co is that the syncing between your local computer and github.com is not automatic. So don’t forget to pull before you start work on a repository and push once you are done!
I am a heavy user of both GitHub and Dropbox. I use Dropbox for a lot of regular files, e.g. MS Office documents, pdfs, images, etc. Some of my research projects, especially those where I collaborate with people that don’t use GitHub, are run through Dropbox. Dropbox or a similar service are convenient since all the syncing happens automatically. For anything where I want a more structured and organized approach, e.g. coding projects, some research projects or course materials like this site, I use GitHub.
You can do any setup you want. However, don’t mix and match. Because GitHub works similar to Dropbox, it is a bad idea to locally store your GitHub repositories in a folder that is synced by Dropbox. If you do, the Dropbox sync process and the GitHub sync process can conflict, potentially leading to syncing mess. For any GitHub repository, store it locally in a folder that is not synced across your computers by another software (such as Dropbox, OneDrive, Box, etc.)
Similarly, don’t store a GitHub repository inside another repository. While this is technically allowed, it can easily lead to a mess. Each repository should be separate and not connected to any other syncing service.
GitHub for this course
We will make heavy use of GitHub for this course. Instead, everyone will just be working through their own GitHub accounts. The advantage of this setup is that once the course ends, everything you did for the course automatically stays with you.
Privacy note: As mentioned on the Course Syllabus page, this course is set up such that you work “in the open” through public GitHub repositories. Please make sure to read the Privacy/FERPA section of the syllabus and let me know if you have any concerns. If I don’t hear from you, I consider this as you consenting with producing public materials as part of this course, and thus make your participation in this course public.
Frustration note: Git/GitHub can be finicky, even for seasoned users. Sooner or later, you’ll run into some issues and strange error messages. If that happens, don’t panic. You’ll figure out how to fix things. The internet and Jenny Bryan’s book happygitwithr are great resources. And/or ask on Slack.
Further Resources and help
GitHub can be confusing. Start slow. Using the graphical interface (GitKraken) makes getting started fairly easy. I’m also trying to give specific instructions whenever we do something new with GitHub. Both the GitHub documentation and GitKraken support page are good resources for information.
If the quick install instructions above are not detailed enough, follow these in Jenny’s paper or her great online book happygitwithr.
Another source worth looking at for GitHub information is this chapter of IDS. There is also a lot of good beginner material online.