Additional Topics

Overview

In this unit, we discuss a few additional topics related to Git/GitHub that are worth mentioning in the context of modeling and data science work.

Goals

  • Be aware of the concept of merge conflicts and how to handle them.
  • Know how to deal with potential confidentiality issues when using Git/GitHub.

Reading

The sections below cover a few additional topics that are relevant when working with Git and GitHub in the context of modeling and data science work.

Merge conflicts

Sooner or later, as you work with GitHub, you will encounter a dreaded merge conflict. This happens if the same file has been changed by multiple individuals - or yourself on multiple computers if you forgot to properly do your push/pull before and after each work session. A merge conflict happens when Git cannot automatically figure out how to combine the changes made in the different versions of a file. When this happens, Git will mark the file as having a conflict and will not allow you to push/pull/merge your changes until the conflict is resolved.

In general, to minimize conflicts, it is good to regularly create issues and push/pull. You should definitely do that any time you stop working on a project. But sometimes doing updates in-between is also good. It is better to change a few files and work on just one topic, then commit and push. After that, start the next topic.

This is also true if you work with someone else and send them your updates as pull requests. By breaking them up into smaller units, it is more likely that conflicts are avoided or localized.

If you do end up with merge conflicts – we have all been there – here are a few ways of dealing with them.

Force push

If you are sure your local version is the correct one, you can perform a force push. A force push will overwrite the remote repo with the local repo forcibly. This means that the remote will be updated with any changes that exist on your local computer. If any changes have been made to the code since you began editing (i.e. someone else has made a commit to the repo while you were working on it) they will be destroyed by the force push since they are not on your local repo. Use with care.

Discard changes

If you decide that the remote version is more up-to-date and what you did locally can be discarded, you can simply discard your local changes and pull the remote version again. This will overwrite your local files with the remote version, effectively discarding any local changes you made.

One option is to copy the parts you changed to a safe location outside the repository, then discard your local changes, pull the remote, and then re-apply your changes manually. This way, you don’t lose your work, but you can still get the remote version and resolve the merge conflict without much hassle.

Manually resolving the conflict

If you want to keep updates from both your local and the remote repository, you have to resolve the merge conflict.

GitHub Desktop and Positron provide tools to help you resolve conflicts, they basically show you the two versions of the document and you can decide which parts you want. This works well for text files (code, Rmd, md, etc.). However, it doesn’t work well for other files (Word, Excel, images, Rds files, etc.).

Confidentiality

The main principle when working with GitHub is that by default everything you put in a Git repository is public and permanent. This can cause a problem for confidential or sensitive data. The next sections discuss best practices for dealing with confidential information while using GitHub.

Anonymizing data

Often, the parts of the data you need for modeling and analysis don’t need to contain confidential or identifying information, like names or dates of birth. It is often a good idea to process your raw data to remove or anonymize such information before putting it into a Git repository. Once you have done that, you can put the anonymized data into a repository. If you want to be extra safe, you might want to opt for a private repository (see next section). If you put it in a public repository, be very careful that you only make things public that are ok to be made public.

Using a private repo

As mentioned, the default for GitHub is to have repositories that are publicly viewable by anyone. However, it is easy to set up a private repository. Private repositories are only viewable by you and people you explicitly give access to. This is a good option if you need to work with sensitive data that cannot be anonymized. Note that private repositories are not free unless you have a GitHub Pro account (which is free for students via the GitHub Student Developer Pack). If you are working in an organization (e.g., your university or company), they might have a GitHub Enterprise account that allows for private repositories.

You can switch repositories from public to private and vice versa in the repository settings on GitHub. However, be careful when switching from private to public. Make sure that no sensitive data is in the repository before making it public.

Keeping files local

It might be that you do need to work with data that cannot be anonymized and that you cannot or don’t want to put into a private repository. In that case, you should keep such data files outside of your Git repository. You can still use GitHub for your code and non-sensitive files, but keep the sensitive files on your local machine only. You can then load the sensitive data from your local machine when running your code. This way, the sensitive data never enters the Git repository.

Git allows you to specify certain files and folders that are not being tracked, and thus not being pushed to your repository on GitHub.com. This is done via a special file called .gitignore placed in the root of your repository. You can add patterns to this file to exclude files or directories that contain sensitive information. For instance, if you have a raw data folder that you want to not track with Git/GitHub, you would place this into your .gitignore file:

data/raw/

This would exclude the entire raw folder inside the data folder from being tracked by Git. You can add as many patterns as you want to the .gitignore file.

An important drawback of this approach is that if you work on multiple machines, or work with collaborators, everyone needs to make sure to have the same sensitive files outside of the repository on their local machines. This can be a logistical challenge.

A possible workaround is to place these files into another Sync service, such as Dropbox, OneDrive, or Google Drive that you consider to be safe. You can then load the files from there when running your code. This way, the sensitive files are not in the Git repository, but you can still share them with collaborators via the Sync service. (This approach also works well for large files that you can’t sync with Git/GitHub.)

Remember that Git tracks the full project history. So if at any time you added a file with sensitive information, even if you delete it later, it’s still in the history and can be found. If that happens, you basically need to completely delete the GitHub repository (after copying all important files to a safe location) and starting over with a new repository, taking care to keep sensitive files out of this new repo.

Large files

GitHub is not suited for tracking large files. If you try to push/pull files larger than say 50MB, things might not work right, and will definitely fail at files >100MB. Therefore, don’t try to track large files with GitHub! Large files are a major reason newbies have problems with their GitHub repository!

If you need to work with large files, there are a few options.

Git LFS

One option is to use Git LFS (Large File Storage). Git LFS is an extension to Git that allows you to track large files without bloating your repository. Instead of storing the actual file in the repository, Git LFS stores a pointer file that references the large file stored elsewhere. When you clone the repository, Git LFS automatically downloads the large files for you.

Git LFS requires some effort at setting it up. It is only worth it if you know you’ll be regularly dealing with large files.

Reduce files

If you have large raw data files but won’t need all of the content for your project, try to reduce their size before putting them into the GitHub repository. You basically perform some cleaning and processing outside the repo, with the aim of getting the data you need for our project into a smaller size. Then copy that reduced dataset into your repo and use it as a starting point. You might also want to change the format of the data to something that is compressed/optimized. For instance if you have raw data in CSV format, you can remove parts you don’t need for your project and save the rest as an Rds or other compressed format. That file might be small enough to be tracked by GitHub.

Prevent large intermediate files

Some modeling workflows create large intermediate files (e.g., model output files). If these files are not needed for further analysis, make sure to keep them out of the Git repository by adding them to the .gitignore file. The only drawback is that if you need these files later, you’ll have to re-generate them from scratch, which could involve running the model again. Based on model complexity, this could be time-consuming.

Keeping files local

You can also follow the steps outlined above in the Confidentiality section to keep some files out of the GitHub workflow.

For instance, you can place large files into a special folder in your GitHub repository (e.g. one called largefiles) and then add an entry to the .gitignore file to tell GitHub to ignore this folder. The problem is of course the same as mentioned above: if someone else wants to work on your project, they won’t automatically have those large files. If the files are generated by your code (e.g., they are the result of running a simulation), they can just re-run your code and get themselves a local copy of these files. If that’s not possible, either because the files are input (such as data) or it takes too long to re-run the code, you will have to manually share these files/folder with them.

Starting over

Sometimes, you might have gotten yourself into a deep GitHub mess. For instance you just can’t properly resolve your merge conflicts. Or you tried to track large files, even though you were told not to do so. Or you accidentally placed a confidential file into a public repo and pushed it online.

Fear not, there is always an option to start over. Here is how you can do that.

First, move the local main repo folder and its contents to a safe location on your computer, and maybe rename it to be safe, e.g. call it myrepo-local-old. Then re-clone the repo from GitHub.com to your local computer and also move it to a safe location, again maybe renaming it to myrepo-remote-old. You now have local copies of all content in a safe location.

Next, delete the entire repository both locally and on GitHub.com. To do that on GitHub.com, go to your repo, then go to Settings (the gear icon), scroll all the way down to the bottom, and find the Delete this repository button in the Danger Zone section. Click on it and follow the instructions.

Now, recreate an empty repository on GitHub.com. You probably want to give it the same name as before, but don’t have to. Clone this new empty repository to your local computer. Then copy all the files you want to keep into this new local repository folder. Make sure to not copy over any files that caused problems before (e.g., large files, confidential files, etc.). Also make sure you have the right .gitignore file. You can also do this in bits and pieces, committing and pushing as you go along, to make sure everything works.

Finally, once everything is in your new repo on your local machine, and everything runs/works/renders, push the new local repository to GitHub.com.

If everything works, you can now delete the old copies you saved on your computer. You might decide to keep them around for a while, since by creating a new repo you lost the history, so you can’t go back to prior versions in the new repo.

Summary

Using GitHub requires paying attention to common pitfalls like merge conflicts and confidentiality issues. As long as you have a plan for those, you can usually work around those issues and still use GitHub for your work.

Further Resources

Test yourself

What does a force push do?

Force push replaces the remote state with your local state and can destroy others’ changes.

  • False
  • True
  • False
  • False

How can you keep sensitive files out of a GitHub repository?

Sensitive files should stay outside the repo, and .gitignore prevents tracking.

  • False
  • False
  • True
  • False

What is a key risk if you ever commit sensitive data to a public repository?

Git keeps full history, so sensitive data can persist even if you delete the file later.

  • False
  • False
  • False
  • True

Practice