Overview

This document gives a brief explanation of Github and how we will use it for this course.

Learning Objectives

  • Know what Git and Github are.
  • Know why one might want to use them.
  • Have created and set up a Github account.
  • Joining the class organization on Github.

What are Git and Github

Git is what is called a version control system. The main idea is that as you (and your collaborators) work on a project, the software tracks, and records any changes made by anyone. Technically Github is distinct from Git. Github is in some sense the interface and Git the underlying engine (a bit like RStudio and R). Since we will only be using Git through Github, I tend to not distinguish between the two. In the following, I refer to all of it as just Github. Note that other interfaces to Git exist, e.g., Bitbucket, but Github is the most widely used one.

Why use Git/Github

You want to use Github to avoid this:

Github gives you a clean way to track your projects. It is also very well suited to collaborative work. Historically, version control was used for software development. However, it has become broader and is now used for many types of projects, including data science projects.

To learn a bit more about Git/Github and why you might want to use it, read this article by Jenny Bryan. Note her explanation of what’s special with the README.md file on Github.

What to (not) use Git/Github for

Github is ideal if you have a project with a fair number of files, most of those files are text files (such as code, LaTeX, (R)markdown, etc.) and different people work on different parts of the project.

Github is less useful if you have a lot of non-text files (e.g. Word or Powerpoint) and different team members might want to edit the same document at the same time. In that instance, a solution like Google Docs, Word+Dropbox, Word+Onedrive, etc. might be better.

How to use Git/Github

Git and Github is fundamentally based on commands you type into the command line. Lots of online resources show you how to use the command line. This is the most powerful, but also most confusing way to use Github. I pretty much never use the command line. Instead, I use a graphical interface, and I recommend you do the same (at least initially). There are several options for such graphical interfaces. Github itself provides a grapical interface with basic functionality. RStudio also has Git/Github integration. Of course this only works for R project Github integration. There are also third party Github clients, which in my opinion provide the most powerful, flexible and best means to use Github. Those clients have many advanced features, most of which you won’t need initially but might eventually. I currently use and recommend (in general and for this course) a software called Gitkraken. You can get a free version of Gitkraken that can do pretty much everything we need. The one important limitation is that the free version does not allow you to access private repositories. As student, you can (and should) upgrade to the Pro version for free, see the Github developer pack section below on how to do it. By using Gitkraken for your Github workflow, you do not need to go through the command line (unless things get so bad that you can’t fix them inside Gitkraken). You also do not need to install Git separately, you get it with Gitkraken. For this course, I assume you will be using the Pro version of Gitkraken. If you are already an experienced Github user and have your own preferred setup (e.g. the command line or some other 3rd party software), you are certainly welcome to stick with your workflow. I assume if you are at that level of Github skills, you are experienced enough to translate my Gitkraken-centric instructions to your setup.

Git/Github terminology

Git/Github has a lot of specialized terminology that takes getting used to. The Github folks posted a handy page with short descriptions of some of the important terms. Some of the terms you will encounter soon are Repository (also often called repo), User, Organization, Branch, Fork, Push, Pull, Commit and Stage. For some reason, the last term is not explained in the list linked to above. Staging is the step you need to do after you made changes and before you commit them. For pretty much all of my work, I find this a not too helpful extra step, but you need to know what it means and you need to do/use it as part of the Github workflow.

Git/Github and other cloud based sync options

You should think of Github as a more controlled, a bit more complex alternative to systems such as Dropbox, OneDrive, GoogleDrive, Box, etc. With the Github workflow, you can work locally on different machines, and everything is backed up in the cloud (on github.com). The (sometimes annoying) difference to Dropbox & Co is that the syncing between your local computer and Github.com is not automatic. So don’t forget to pull before you start work on a repository and push once you are done!

Because Github works similar to Dropbox, I have heard/read people say that it is a bad idea to locally store your Github repositories in a folder that is synced by Dropbox. If you do, the Dropbox sync process and the Github sync process can conflict, leading to a mess. So for any Github repository, store it locally in a folder that is not synced across your computers by another software (such as Dropbox, OneDrive, Box, etc.)

I am a heavy user of both Github and Dropbox. I use Dropbox for a lot of regular files, e.g. MS Office documents, pdfs, images, etc. Some of my research projects, especially those where I collaborate with people that don’t use Github, are run through Dropbox. Dropbox is convenient since all the syncing happens automatically. For anything where I want a more structured and organized approach, e.g. coding projects, some research projects or course materials like this site, I use Github. It seems to be using both the Github workflow and a Dropbox/OneDrive/etc. workflow, depending on the type of project, is best. I just would try to avoid mix and match within the same project, especially don’t have Github repositories inside folders that are auto-synced by some other software.

Github developer pack

I strongly recommend you get the Github student developer pack. This will give you your own free private repositories (we have private repositories as part of the class organization, but that doesn’t apply to any projects you do outside the class). Normally, if you want private repositories, you have to pay. But with the student pack, you get them for free. Sometimes, having private repositories is a good idea, especially if you want to use Github for a project involving data that you don’t want to be publicly accessible.

With the Github student developer pack, you also get 1 year of free Gitkraken Pro access. While the free version of Gitkraken should work ok for this course, you can’t access private repositories with it. Often, being able to use private repositories is useful.

The Github student developer pack also comes with lots of other tools that we won’t need for this course, but that might be of interest to some of you and you could explore and use them if you want to get geeky with your data projects.

Github for this course

For this course, I created what Github calls an organization with the name epid8060fall2019. The advantage of this is that we can have repositories inside that organization that only members can see. Once you joined the group, you can create private repositories, i.e., can only be seen by members of the organization. There are several repositories in that organization that you will use at some point, and you will be asked to create our own at some point.

To get up and running, here are the steps I recommend you follow:

  1. If you do not already have a Github.com account, create one. Note that Github is widely used professionally, and you might want to allow other people to see your Github presence. I, therefore, recommend using a future-proof, professional user name. You can user your UGA or a different email for Github. Whichever email you use is where you will receive Github activity related notifications (including our Github class activities).
  2. Post your Github user name into the Github topics discussion group.
  3. Once I have your user name, I’ll invite you to our course organization, epid8060fall2019. Wait for the invitation (it will come to the email you used to sign up for Github) and join the organization.
  4. Download and install Gitkraken, link it with your Github account.
  5. Get the github developer pack and upgrade your Github account and Gitkraken to the Pro version.
  6. Do the Introducing Ourselves project for this course, for which you will be using Github.

Further Resources and help

Github can be confusing. Start slow. Using the graphical interface (Gitkraken) makes getting started fairly easy. I’m also trying to give specific instructions whenever we do something new with Github.

If the quick install instructions above are not detailed enough, follow these in Jeny’s paper or her great online book happygitwithr. Note that if you install Gitkraken, you do not have to install Git, since Gitkraken has it. It might still be a good idea to install Git in case you want to use Github from RStudio.

Another source worth looking at for Github information is chapter 39 of IDS. There is also a lot of good beginner material online. As mentioned, RStudio has built-in Github support. I prefer to use Gitkraken for all my Github push/pull/sync, etc. and then work on my project in RStudio separately. But if you prefer the workflow with the built-in Github, feel free to use it. Jenny’s book explains how to connect Github and RStudio (but that is not required for using Github with R projects).