This document gives a brief explanation of Github and how we will use it for this course.
Git is what is called a version control system. The main idea is that as you (and your collaborators) work on a project, the software tracks, and records any changes made by anyone. Technically Github is distinct from Git. Github is in some sense the interface and Git the underlying engine (a bit like RStudio and R). Since we will only be using Git through Github, I tend to not distinguish between the two. In the following, I refer to all of it as just Github. Note that other interfaces to Git exist, e.g., Bitbucket, but Github is the most widely used one.
You want to use Github to avoid this:
Github gives you a clean way to track your projects. It is also very well suited to collaborative work. Historically, version control was used for software development. However, it has become broader and is now used for many types of projects, including data science projects.
To learn a bit more about Git/Github and why you might want to use it, read this article by Jenny Bryan. Note her explanation of what’s special with the
README.md file on Github.
Github is ideal if you have a project with a fair number of files, most of those files are text files (such as code, LaTeX, (R)markdown, etc.) and different people work on different parts of the project.
Github is less useful if you have a lot of non-text files (e.g. Word or Powerpoint) and different team members might want to edit the same document at the same time. In that instance, a solution like Google Docs, Word+Dropbox, Word+Onedrive, etc. might be better.
Git and Github is fundamentally based on commands you type into the command line. Lots of online resources show you how to use the command line. This is the most powerful, but also most confusing way to use Github. I pretty much never use the command line. Instead, I use a graphical interface, and I recommend you do the same (at least initially). There are several options for such graphical interfaces. Github itself provides a grapical interface with basic functionality. RStudio also has Git/Github integration. Of course this only works for R project Github integration. There are also third party Github clients, which in my opinion provide the most powerful, flexible and best means to use Github. Those clients have many advanced features, most of which you won’t need initially but might eventually. I currently use and recommend (in general and for this course) a software called Gitkraken. You can get a free version of Gitkraken that can do pretty much everything we need. The one important limitation is that the free version does not allow you to access private repositories. As student, you can (and should) upgrade to the Pro version for free, see the Github developer pack section below on how to do it. By using Gitkraken for your Github workflow, you do not need to go through the command line (unless things get so bad that you can’t fix them inside Gitkraken). You also do not need to install Git separately, you get it with Gitkraken. For this course, I assume you will be using the Pro version of Gitkraken. If you are already an experienced Github user and have your own preferred setup (e.g. the command line or some other 3rd party software), you are certainly welcome to stick with your workflow. I assume if you are at that level of Github skills, you are experienced enough to translate my Gitkraken-centric instructions to your setup.
Git/Github has a lot of specialized terminology that takes getting used to. The Github folks posted a handy page with short descriptions of some of the important terms. Some of the terms you will encounter soon are Repository (also often called repo), User, Organization, Branch, Fork, Push, Pull, Commit and Stage. For some reason, the last term is not explained in the list linked to above. Staging is the step you need to do after you made changes and before you commit them. For pretty much all of my work, I find this a not too helpful extra step, but you need to know what it means and you need to do/use it as part of the Github workflow.
You should think of Github as a more controlled, a bit more complex alternative to systems such as Dropbox, OneDrive, GoogleDrive, Box, etc. With the Github workflow, you can work locally on different machines, and everything is backed up in the cloud (on github.com). The (sometimes annoying) difference to Dropbox & Co is that the syncing between your local computer and Github.com is not automatic. So don’t forget to pull before you start work on a repository and push once you are done!
Because Github works similar to Dropbox, I have heard/read people say that it is a bad idea to locally store your Github repositories in a folder that is synced by Dropbox. If you do, the Dropbox sync process and the Github sync process can conflict, leading to a mess. So for any Github repository, store it locally in a folder that is not synced across your computers by another software (such as Dropbox, OneDrive, Box, etc.)
I am a heavy user of both Github and Dropbox. I use Dropbox for a lot of regular files, e.g. MS Office documents, pdfs, images, etc. Some of my research projects, especially those where I collaborate with people that don’t use Github, are run through Dropbox. Dropbox is convenient since all the syncing happens automatically. For anything where I want a more structured and organized approach, e.g. coding projects, some research projects or course materials like this site, I use Github. It seems to be using both the Github workflow and a Dropbox/OneDrive/etc. workflow, depending on the type of project, is best. I just would try to avoid mix and match within the same project, especially don’t have Github repositories inside folders that are auto-synced by some other software.
I strongly recommend you get the Github student developer pack. This will give you your own free private repositories (we have private repositories as part of the class organization, but that doesn’t apply to any projects you do outside the class). Normally, if you want private repositories, you have to pay. But with the student pack, you get them for free. Sometimes, having private repositories is a good idea, especially if you want to use Github for a project involving data that you don’t want to be publicly accessible.
With the Github student developer pack, you also get 1 year of free Gitkraken Pro access. While the free version of Gitkraken should work ok for this course, you can’t access private repositories with it. Often, being able to use private repositories is useful.
The Github student developer pack also comes with lots of other tools that we won’t need for this course, but that might be of interest to some of you and you could explore and use them if you want to get geeky with your data projects.
For this course, I created what Github calls an organization with the name epid8060fall2019. The advantage of this is that we can have repositories inside that organization that only members can see. Once you joined the group, you can create private repositories, i.e., can only be seen by members of the organization. There are several repositories in that organization that you will use at some point, and you will be asked to create our own at some point.
To get up and running, here are the steps I recommend you follow:
Github topicsdiscussion group.
Github can be confusing. Start slow. Using the graphical interface (Gitkraken) makes getting started fairly easy. I’m also trying to give specific instructions whenever we do something new with Github.
If the quick install instructions above are not detailed enough, follow these in Jeny’s paper or her great online book happygitwithr. Note that if you install Gitkraken, you do not have to install Git, since Gitkraken has it. It might still be a good idea to install Git in case you want to use Github from RStudio.
Another source worth looking at for Github information is chapter 39 of IDS. There is also a lot of good beginner material online. As mentioned, RStudio has built-in Github support. I prefer to use Gitkraken for all my Github push/pull/sync, etc. and then work on my project in RStudio separately. But if you prefer the workflow with the built-in Github, feel free to use it. Jenny’s book explains how to connect Github and RStudio (but that is not required for using Github with R projects).