Guidance and Tips for R/RStudio/GitHub

Author

Andreas Handel

Modified

2024-03-20

Overview

This document is a collection of guides and tips related to getting started and using R, R Studio and GitHub, based on stumbling blocks that I have noticed students encounter somewhat regularly.

General Coding

Here are a few brief suggestions for good coding practices in R (and any other language.)

Write modular code. The idea is that you break things into logical components, and structure your code that way. For instance R Markdown chunks are little modules. Each chunk should do some actions that make sense to be grouped together. For instance one or a few data cleaning steps can be in a code chunk. Then for the next set of actions (e.g. making a plot), you start a new chunk. You should also add comments explaining not only each line of code, but also what those modules do. Structuring your codes into sections/modules that do specific things makes it easier to follow what is going on.

Write DRY code. DRY stands for Don’t Repeat Yourself. The idea is that if you have code that does the same thing more than once (or maybe more than a few times), you should rewrite your code in such a way that you write a single bit (module) of code that you call repeatedly. The most natural way this is done is with writing a function.

Write functions. Pretty much everything you use in R (or other languages) is a function. A function is a bit of code that you give some inputs (also called arguments), the function does something with those inputs, and usually returns something back to you (not always, sometimes the function might write a file to your computer). For instance if you define some object/variable x and send it to the function sum(), then sum(x) returns the sum of the elements in x. It likely also does a few other things, such as checking that everything contained in x is a number that can be summed. While writing your own functions is a bit more advanced, at some point you should do it. Functions are automatically modular and allow for DRY coding. They are therefore important components of most code. It takes a bit of practice, but once you understand how functions work, you should be able to write your own as part of your code. Functions can either live in separate files, or in the same file (usually at the top) from which you call it. We generally call code that just runs, without being passed in things and returning stuff, a script to differentiate it from a function. If you are completely new to functions, you might want to check out the Functions section in Hands-On Programming with R.

Have consistent style. There are often many ways to implement a certain operation with code. There are also many ways how you style your code (e.g., give variables names, use indentation). It’s important to write code that looks consistent and is written in a way that’s easy to read and comprehend by others and your future self. For R, the tidyverse has a style guide which is a good one to use. But you don’t have to follow it, you can have your own style.

R packages

Sometimes when you install or update a package, you might get the message Do you want to install from sources the packages which need compilation. What that means is that there is an R package available that has just been updated and exists as source but not yet turned into what’s called a binary format that’s pre-compiled for whatever operating system you use. There is almost never a reason to get the very latest source version. And if you want to install from source, you might need Rtools to do so. Therefore, I suggest that when you get this question/message, you check No.
Sometimes during install or updating of a package, you might get a package is loaded/in use message. You are usually offered to restart R. Agree to that. It might fix things, but not always. The best way to install new packages if you run into those problems is to completely shut down RStudio, then restart. Next, immediately go to the console (bottom left) and install the needed packages with install.packages('PACKAGENAME') without doing anything else. That should minimize any potential conflicts.
Sometimes when you try to install a package you get the message package ‘PACKAGENAME’ is not available for this version of R. This is a bit confusing, it sounds like you need to upgrade your R version. What really happens most of the time is that you mis-typed the package name, and that package just doesn’t exist (for no version of R). So first make sure you are spelling the package name correctly, including all capitalization being correct. That fixes things almost always. It is rare that there is a package that you want, but not for your version of R (unless you haven’t updated R in years.)

R Coding

Vectorize. In R, most things you do can be vectorized. What that means is that if you have multiple elements in an object, e.g., multiple entries in a vector, or many rows and columns in a data frame, you can use functions that process each entry at once. For instance if you use the filter function from dplyr like this: filter(mydata, age > 50), the function looks at the age for each row in the column called age and keeps those that are larger then 50. If you wanted to do the same in another programming language, like C/C++, you would need to write code that explicitly runs through each row of mydata and looks at age for each row and decides to keep it or not. That would be multiple lines of code. There are other languages that are vectorized like R, so it’s not just R. But it’s an important feature of R and whenever possible, you should try to apply your operations on all elements at the same time instead of writing for loops. (Though sometimes, loops are easier to read and code, so don’t feel like you can’t ever use them again).

Don’t mix and match too much. You have already or will likely figure out soon that there are often many ways to do something in R, especially once you start using R packages, such as the tidyverse set of packages. While it is perfectly ok to write some code that uses just the basic R commands, and other code that makes use of packages, you do need to be somewhat careful when you mix and match. As an example, there are the basic R commands cbind and rbind to combine data frames by rows or columns. There are also the commands bind_cols and bind_rows in the dplyr package. They do more or less the same thing, but have slightly different behavior. For instance the returned object might be of a different class (e.g., if you cbind two matrices, the result is a matrix, if you use bind_cols, you get a tibble/data frame.) Sometimes, these slight differences do not matter, but other times they might. It is therefore important to pay attention if you switch between different ways of doing things in R, and if possible stick with a given approach (e.g. all/most base R commands or mostly tidyverse).

GitHub

Avoid large files

GitHub is not suited for tracking large files. If you try to push/pull files larger than say 20MB, things might not work. Therefore, don’t try to track large files with GitHub! Large files are the #1 reason students have problems with their GitHub repository!

If you do have large files for a project, there are a few options:

Shrink the large file outside of GitHub. For instance if you have raw data in CSV format, you can remove parts you don’t need for your project and save the rest as an Rds or other compressed format. That file might be small enough to be tracked by GitHub.
You can use Git LFS to track them. But that’s a bit more advanced.
You can place the large file into a special folder in your GitHub repository (e.g. one called largefiles) and then add an entry to the ’.gitignore` file to tell GitHub to ignore this folder. The problem with that is that if someone else wants to work on your project, they won’t automatically have those large files. If the files are generated by your code (e.g., they are the result of running a simulation), they can just re-run your code and get themselves a local copy of these files. If that’s not possible, either because the files are input (such as data) or it takes too long to re-run the code, you will have to manually share these files/folder with them.

Resolve merge conflicts

At times, you will get merge conflicts. For instance you might have made changes to your local repository that you didn’t mean to. Or someone else (or you on a different computer) added changes to the Github.com repository and things don’t merge with your local repo. If you have some local updates that you want to keep, here is a method that often works.

At that stage, it gets a bit tricky. GitKraken has a good tool to help you resolve conflicts, and it works well for text files (code, Rmd, md, etc.). It doesn’t work well for other files (word, Excel, etc.). Start by using the GitKraken tools, compare files and decide which version of any conflicting ones to keep (or to combine things).

If the GitKraken tools don’t help, you can do some manual intervention as follows:

Move the files you changed and that create the problems to some “safe” location (outside of your repository).

2a. If you are using GitKraken and you haven’t committed your local changes, there is a red trash button symbol in the top left. If you click on it, you can discard your changes. once you have done that, you can sync with the remote. Your local and remote should be in sync again, and you should be able to move to step 3.

2b. If you already committed your changes, you can right-click in GitKraken on the latest version on the remote commit (those symbols with comments in the main window). Then pick reset main/master to this commit and then choose a hard. This overwrites anything you’ve done locally with the remote version on Github.com, so make sure any local changes you want to keep have been copied to a safe place (step 1). Your local and remote should be in sync again, and you should be able to move to step 3.

Once all is ok again and your local repository and the remote one are in sync, you can copy the files (or the content from those files) you placed in a safe location in step 1 back into your repository. Then continue with the usual work flow.

In general, to minimize conflicts, it is good to regularly create issues and push/pull. You should definitely do that any time you stop working on a project. But sometimes doing updates in-between is also good. It is better to change a few files and work on just one topic, then commit and push. After that, start the next topic.

This is also true if you work with someone else and send them your updates as pull requests. By breaking them up into smaller units, it is more likely that conflicts are avoided or localized.

Start over

Sometimes the previous fixes doesn’t work. For instance if you tried to track large files, even though you were just told not to do so, you might end up in a situation where there’s no simple fix, other than to start over. Here is how you can do that:

Rename the repo on GitHub.com to something like myrepo_old. You can do that in the settings of your repository. Also give your local repo/folder the same name.
Create a new repo on GitHub.com with the name of the prior one. Clone this new repo to your local computer.
Copy all the good stuff from your old repo into the new one. Make sure to not copy over the bits that cause the problem.
Push the updated local repo to GitHub.com.
Once certain that everything works, delete the old repo both online and on your local computer. You might decide to keep it around for a while, since by creating a new repo you lost the history, so you can’t go back to prior versions in the new repo.