Quiz

Get the quiz sheet for this module from the general Assessments page. Fill it in, then submit to the online grading system before the deadline.

Exercise

For this exercise, you will practice with GitHub and do parts of a very small data analysis. You will start doing group work.

Group setup

If you haven’t already, find your group. You can find that information in the pinned post in the announcements channel. Get in touch with your group members. You will need to exchange GitHub user names. Assign each group member an (arbitrary) number (I’m calling them M1, M2, …). You will start working in your portfolio repository and finish this part by Wednesday. Then M2 will contribute to M1’s repository, M3 will work on M2s, etc. M1 will contribute to the last person in the group (M3/M4/M5, based on the number of people in our group). This way, everyone will work on their own and one group member’s repository.

Because there are multiple parts to this exercise, the due dates are adjusted, see below.

Part 1

Take a look inside the starter-analysis-exercise folder. You will find files and folders that provide a template for a data analysis.

The Github repository with the full template is called dataanalyis-template, you’ll be using it for your class project later in the course.

Go to the github repository for the dataanalysis-template and read the information in the README.md file (shown at the bottom of the repository website). You can ignore the bits about renv, we are not using that just yet.

Then take a look at the different folders, files, readme comments that you find inside the starter-analysis-exercise folder. Feel free to run the various R and Quarto scripts. The idea is that you get some familiarity with the whole setup to prepare you for the next steps.

Now go ahead and open the exampledata.xlsx file (with Excel or a similar program) and adds 2 more columns to the data. One column should be something numeric, the other can be something consisting of (a few) different categories. As a (boring) example, you could add eye color and waist size. Feel free to come up with more creative attributes/variables to add. Also add descriptions of your new variables to the Codebook sheet. Once done, save the new data as exampledata2.xlsx, then commit your changes (write a meaningful commit message) and push them to GitHub.com.

You need to have this done by Tuesday evening.

Part 2

Based on the group setup you did above, tell the classmate who will be working on your project that it’s ready and where to find it. Assuming you are M1, you would tell M2 that things are ready. At the same time, you should be notified by another classmate that their repository is ready for you (again, if you happen to be M1, it would be the last person in the group, say M4 or M5).

You will work on your classmate’s repository using what is called the Fork and pull-request workflow. You’ll be using that multiple times throughout this class.

The basic idea is as follows. First, you make a copy of someone’s GitHub repository. In GitHub terminology, that is called doing a fork of their repository. You can do that for any public repository.

Next, you implement your improvements in the forked repository. Once you are done, you ask the owner of the original repository to incorporate the updates you made in the fork into their main repository. This last part is called issuing a pull request. You are requesting that the other person pull your changes into their repository, hence the at times confusing (at least for me) terminology. I prefer to think of them as merge requests or sync requests, i.e. you are requesting that they merge or sync your changes into their repository. You’ll find the terminology merge request is used at times. If the person who controls the main repository likes your changes, they will merge your fork into the main branch. And just like that, you have contributed to some project becoming better! We will practice this fork and pull flow now.

Find the repository of the team member you will contribute to and Fork their repository on Github.com. You can find links to everyone’s portfolio repository in the Introductions channel. Then clone it to your local machine, as you have done previously with your own repositories. In fact, this fork is now your own repository, you have it forever, even if the person who owns the original repository deletes theirs.

Go into the starter-analysis-exercise folder and look at the exampledata2.xlsx file and information in Codebook to understand what new variables your classmate created.

Find processingfile.qmd inside /code/processing_code/, make a copy of the file, call it processingfile2.qmd. Update the code in processingfile2.qmd such that it now loads the new data file called exampledata2.xlsx. Take a look at the new data. If necessary, add a few lines of code to clean the new data as needed. Have the code save the updated data to processeddata2.rds.

Next, open eda.qmd inside /code/eda_code (EDA stands for exploratory data analysis). Make a copy, edit the new file such that it creates a boxplot with the new categorical variable (whatever it is) on the x-axis, and height on the y-axis. Also create a scatterplot with weight on the x-axis and the new numerical variable on the y-axis. Save both figures to files.

Once you wrote your code, confirm that it runs without problems and errors, and produces the expected results. Once everything is good, commit and push your changes to the remote. Note that this pushes to your fork (i.e. copy) of the repository. Now it’s time to offer your contribution to your classmate to integrate into their repository.

Let’s contribute your code to your classmate’s project. Go to the GitHub.com website for your (forked) repository. In the top left, you should see the New Pull Request button. (Underneath, it should say something like this branch is N commits ahead of NNN). Click the New Pull Request button. A page comes up showing what you want to do, which is to send your changes to the original repository. You are requesting that the other person pull your changes into their repository. Hopefully, you’ll see a green check-mark that says able to merge. If your classmate made changes to the same files you did, it could have created a merge conflict. Hopefully, this won’t be the case. In either case (merge conflict or not), you can click the green button and Create a pull request. You are presented with a box in which you want to add a meaningful title, so the other person knows what you changed and some more explanation in the box. Then click Create pull request.

You need to have this done by Thursday evening.

Part 3

By Thursday evening, everyone should have received a pull request from the classmate who worked on their repository.

Go to the GitHub site and to your online portfolio repository. Click on Pull requests, then click on the request (there should only be one). Take a look. On the first page, it shows you their message and if there are conflicts with your version of the repository. Hopefully, you didn’t change things around while they did, so there shouldn’t be any conflicts. Click on the Files changed button, which will show you an overview of the code they changed. Removals are red, and additions are green.

On the main pull request site, you can do various things. If you don’t like the suggested edits, you can write a comment and close the pull request without merging their changes into your repository. If you like most of what they did, but there is something they need to adjust, write a comment and let them know. Close the pull request and ask them to send a new one. If you are ok with their changes (hopefully, this is the case here), you can merge the pull request and close it. Their updates are now part of your repository.

Once you finished the merging of their updates into your repository online, pull the latest version to your local computer.

Next, find statistical_analysis.R inside /code/analysis_code. Make a copy. Edit the code such that it fits a third linear model with Height as outcome and the 2 new variables as predictors. Save the result into resulttable3.rds.

Now we are ready to include the new findings. Open Manuscript.qmd. Add code to display the two new figures and the table from your new model fit. Briefly explain what it shows and what it means.

Once all is done, fully rebuild your website. Commit and push your changes. Go to your website URL to make sure the updated

You need to have this done by Friday evening. Since this will be part of your portfolio site, and you already posted a link to that previously, you don’t need to post anything, I know where to find it. I will assess both the contribution of the repository owner and the classmate who added to this.

Open the file that contains the coding exercise. It should look fairly similar to yours, though probably somewhat different since there are many ways to do things in R, and their comments might be different.

Add code and text below the part your classmate did. Add a heading to indicate where your section starts and also add your name. Specifically, have a heading that says # This section added by YOURFULLNAME. I need this so I can grade accordingly.

Write some more code to do some more analysis of the dataset. This is fairly open-ended. I want you to create a few more plots of the data, and fit another model to some parts of data and display the fitting results in a table. You can e.g. use the broom package to convert output from lm into a data frame you can show as a table. What exactly you explore, plot and fit is up to you.

I mentioned this before, but it doesn’t hurt repeating. At times, you will get merge conflicts. At that stage, it gets a bit tricky. GitKraken has a good tool to help you resolve conflicts, and it works well for text files (code, Rmd, md, etc.). It doesn’t work well for other files (word, Excel, etc.). Sometimes you have to temporarily move one of the conflicting files out of the repository, then do the merge, then manually see how the files changed and do the merge yourself. To minimize conflicts, it is good practice to make multiple pull requests with small changes instead of a single large pull request. If you changed 20 files and 2 of them create a conflict, the person has to reject your complete pull request if they don’t want those conflicts. It is better to change a few files and work on just one topic, then issue a pull request. After that, start the next set, and issue another pull request. By breaking them up, it is more likely that conflicts are avoided or localized.

More information

Some more information on forks and pulls and what to do if things don’t go right can be found in happygitwithR. Note that a lot of the commands described for use on the command line (e.g. stash) can be applied graphically through GitKraken.

Github also has branches. Those are similar to forks but meant for more internal use. For instance, if you have a project and want to implement something new, but it might not work, you can create a branch, work in that branch, and once everything is ok, you can merge the branch into the main/master. This is useful if you write software that others are using, and you don’t want to break the whole thing. It is also helpful if you work with collaborators on a project. To be able to use branches, you need to be an owner or member of a repository. In contrast, you can fork any public repository.

Another fork and pull exercise

This is optional. You can do it at any time during this course (and more than once) 😁.

Help improve the course with your contributions! Find something wrong/unclear/worth improving with this course (e.g. a typo, something confusing, a broken link, a suggestion for a new reference, or anything else). Go to the Github repository for this course. Follow the steps outlined above: Fork the course to your personal account, clone it to your local computer, implement your updates, push it back to GitHub, then initiate a pull request. I will get a notification of your pull request. If things look ok and no conflicts exist, I will merge your improvements into the course. And just like that, you have contributed to improving this course! (And of course, you will be listed in the Acknowledgments section of the main course page.)

Another option for helping to improve the course website is to file a GitHub Issue. Feel free to do so any time during the course to let me know of anything that needs fixing.

Discussion

Look online and find an example of a research project that provides (or claims to provide) all materials to allow reproduction of results, similar to Dr. Brian McKay’s project I shared with you. If you are able, download the materials and see if you can reproduce things. I suggest you focus on projects that are done with our set of tools (R/Quarto, etc.), but that’s not required. Report the project you found and your experience being able (or not) to reproduce it as a post in the Module 2 Discussion channel.

Then take a look at a few of your classmates postings and explore/comment on what they found.