We’ll get into the details of what data analysis is and how to do it very soon. But to start with, I want to provide a few hopefully inspiring and motivating examples of data analysis and use. You’ll be doing similar projects yourself soon! In fact, you will be fully reproducing several analyses below.
There are lots of good videos online discussing different aspects of the vast topic of data analysis. TED is an especially good source for such videos.
Arguably one of the best communicators of public health data was the late Hans Rosling. He has several great entertaining and enlightening TED talks. Here is one of them: A statistics talk that has been viewed millions of times!
In this entertaining and empowering talk, Talithia Williams argues for the utility of careful assessment of data in your personal life.
In this talk, Sebastian Wernicke gives some cautionary suggestions about trusting data too much and argues for critical assessment as an important part of any data analysis and decision process.
The TED website has a large number of data related talks. Feel free to explore as much as you like. If you find especially good ones, post them on the class discussion board.
It is by now very common that individuals post data analyses online in a reproducible manner by providing all the code and data to recreate it.
An interesting initiative started by Thomas Mock is TidyTuesday, where every Tuesday a new dataset is released here and individuals are encouraged to analyze it. Later in this course, you will be participating in some TidyTuesdays yourself. Tidy Tuesday has been running for more than 3 years.
Individuals who participate in TidyTuesday often post their projects online and link to it on social media. We will look at an example by David Robinson. David does screencast recordings showing him analyze the data. Here is a recent example:
You don’t need to watch all of the almost 2 hour long recording 😄, but do watch the first few minutes to see him coding/analyzing in real-time. You might just get hooked and keep watching. You might not be familiar with some or most of the R coding, but you will still get the overall idea of how he does the analysis. And his thinking out loud is informative as well. If you like the ability to look over his shoulder while he does data analysis, he has a bunch more of those screencasts.
Dave does his work reproducibly and shares the code as R Markdown files. For specific files, see the links provided in the video description. For convenience, here is the link to download the code for his analysis. Just save the file after it opens in your browser. It should be named
2021_05_18_salary_survey.Rmd. Click on it, and it should open in RStudio.
Once the file is open in RStudio, you might see at the top of the file a message suggesting that you need to install several packages to run the file. Do so. If you don’t get that auto-suggestion, you need to install the packages by hand. Every package that is called with the
library() command needs to be installed first.
This is a good example of an almost works reproducible example. It’s quite rare that things work completely without adjustments on different machines, and how to do that best is a huge topic in data analysis (e.g., for drug companies that need full reproducibility for licensing). In this example, I had to make the following adjustments: * Change
output: html_output to
output: html_document in the YAML section. * Install the
ranger package, which is not listed under the
library statements but is needed.
With those changes, I was able to run the whole script by hitting the
Knit button in RStudio. If you are still missing a package, you will get an error message. If things work out, the whole script will run a while (depending on the speed of your computer), and you should get a document/report which reproduces the complete analysis he did in the video! If things don’t work out, post your error message to Slack so someone can help troubleshoot. One potential problem could arise from the fact that David uses parallel computing to make things run faster. If you get an error message, you can try commenting out the line of code that says
doParallel::registerDoParallel(cores = 4) and see if it works without it. It was ok with it on my windows machine, but might not work always. (If on Windows, you might be asked about firewall network permissions. That’s a quirk of how parallel computing works on a single machine. Just say “yes” to both public and private networks.)
Again, most or all of the R code might not make sense to you just yet. That’s ok. The main point is to show you a nice example of a data analysis done in a way that others (you) can reproduce with just a few clicks. As part of this class, you will be producing those kinds of reproducible R Markdown files.
Fully open and reproducible analyses are also becoming common in academic research, with authors publishing everything needed to reproduce their paper and results. In my research group, we try to adhere to this approach as much as possible. As an example, I’m sharing a recent project/paper by Brian McKay, a former grad student in my group. You can find the published paper here. All materials to reproduce the full project and all results are available as supplementary material here. Take a look at the
Usage Notes section on the website.
Next, click on
Download Dataset, and unzip the file. You should end up with a folder called
ProcSocB Supplemental Material.
Inside that folder you can see subfolders containing all the data and code to fully reproduce Brian’s analysis and manuscript. Find and click on
Virulence_Trade-off.Rproj. This should open in RStudio. To reproduce the manuscript, go into the
Manuscript folder, open
Manuscript.Rmd. You’ll see some code at the top which loads various datasets containing results. Further down, you’ll find the text of the manuscript, combined with simple R commands that dynamically place numbers and figures into the text. Re-create the manuscript by clicking on the
knit button. There might be error messages related to missing packages. If that happens, install those packages (e.g. ‘bookdown’). If you get stuck, post your problem to the discussion boards.
Feel free to explore further by opening various R scripts and running them. The
Usage Notes section on the website where you downloaded the zip file, or the Supplementary-Material.docx document explain what the different pieces do and how to use them.
As you can tell, this is more complex and complete compared to the single R Markdown file example above. Depending on the project you work on, a single file might work fine. For any more detailed data analysis, a structure like the one used in Brian’s project is a good idea. You will use a setup like this for the course project.
This example hopefully gives you an idea how nice and easy it is to write a whole paper with just a few clicks of a button. It is a combination of R scripts and R Markdown files (in this case a version called bookdown), which allow a fully automated generation of a complete data analysis and manuscript, including all figures, tables, references, etc. This is an incredibly efficient workflow and one you’ll learn and use in this class. For instance if there was a mistake somewhere in the data, instead of going through the manuscript and manually changing numbers, Brian could just re-run the whole code, and it will produce new results based on the updated data in a fully automated manner. It’s a bit more work upfront until you get the hang of it, but once you have it set up, it can save an enormous amount of time and avoid potential copy & paste mistakes.