We’ll get into the details of what data analysis is and how to do it very soon. But to start out, I want to provide a few hopefully inspiring and motivating examples of data analysis and use. You’ll be doing similar projects yourself soon! In fact, you will be reproducing several analyses below.
There are lots of good videos online discussing different aspects of the vast topic of data analysis. TED is an especially good source for such videos.
Arguably one of the best communicators of public health data was the late Hans Rosling. He has several great entertaining and enlightening TED talks. Here is one of them: A statistics talk that has been viewed millions of times!
You can watch more TED talks by Hans Rosling here. The software he uses to illustrate data is called Gapminder and available online on the Gapminder website.
In this entertaining and empowering talk, Talithia Williams argues for the utility of careful assessment of data in your personal life.
In this talk, Sebastian Wernicke gives some cautionary suggestions about trusting data too much and argues for critical assessment as an important part of any data analysis and decision process.
The TED website has a large number of data related talks. Feel free to explore as much as you like. If you find especially good ones, post them on the class discussion board.
It is by now very common that individuals post data analyses online in a reproducible manner by providing all the code and data to recreate it.
An interesting initiative started by Thomas Mock is TidyTuesday, where every Tuesday a new dataset is released and individuals are encouraged to analyze it. Later in this course, you will be participating in some TidyTuesdays yourself. Tidy Tuesday has by now been running for many years.
Individuals who participate in TidyTuesday often post their projects online and link to it on social media. We will look at an example by David Robinson. David does screencast recordings showing him analyze the data.
You don’t need to watch all of the almost 2 hour long recording 😄, but do watch the first few minutes to see him coding/analyzing in real-time. You might just get hooked and keep watching. You might not be familiar with some or most of the R coding, but you will still get the overall idea of how he does the analysis. And his thinking out loud is informative as well. See for yourself:
If you like the ability to look over his shoulder while he does data analysis, he has a bunch more of those screencast recordings.
There are others who live-stream data analysis, including competitions, on platforms such as Twitch. Check out Jesse Mostipak, aka Kiersi or Nick Wan if interested.
Dave does his work reproducibly and shares the
code as R Markdown files. For specific files, see the links provided
in the video description. For convenience, here
is the link to download the code for his analysis. Just save the
file after it opens in your browser. It should be named
2021_05_18_salary_survey.Rmd
. Click on it, and it should
open in RStudio.
Once the file is open in RStudio, you might see at the top of the
file a message suggesting that you need to install several packages to
run the file. Do so. If you don’t get that auto-suggestion, you need to
install the packages by hand. Every package that is called with the
library()
command needs to be installed first.
This is a good example of an almost works reproducible
example. It’s quite rare that things work completely without adjustments
on different machines, and how to do that best is a huge topic in data
analysis (e.g., for drug companies that need full reproducibility for
licensing). In this example, I had to make the following adjustments: *
Change output: html_output
to
output: html_document
in the YAML section. * Install the
ranger
package, which is not listed under the
library
statements but is needed.
With those changes, I was able to run the whole script by hitting the
Knit
button in RStudio. If you are still missing a package,
you will get an error message. If things work out, the whole script will
run a while (depending on the speed of your computer), and you should
get a document/report which reproduces the complete analysis he did in
the video! If things don’t work out, post your error message to Slack so
someone can help troubleshoot. One potential problem could arise from
the fact that David uses parallel computing to make things run faster.
If you get an error message, you can try commenting out or deleting the
line of code that says
doParallel::registerDoParallel(cores = 4)
and see if it
works without it. It worked ok with that line of code on my windows
machine, but might not always work. (If on Windows, you might be asked
about firewall network permissions. That’s a quirk of how parallel
computing works on a single machine. Just say “yes” to both public and
private networks.)
Again, most or all of the R code might not make sense to you just yet. That’s ok. The main point is to show you a nice example of a data analysis done in a way that others (you) can reproduce with just a few clicks. As part of this class, you will be producing those kinds of reproducible R Markdown files (well, we’ll be doing Quarto files, which is the improved successor of R Markdown).
Fully open and reproducible analyses are also becoming common in
academic research, with authors publishing everything needed to
reproduce their paper and results. In my research group, we try to
adhere to this approach as much as possible. As an example, I’m sharing
a recent project/paper by Dr. Brian McKay, a former grad student in my
group. You can find the published paper here. All materials to
reproduce the full project and all results are available as
supplementary material here. Take a look at
the Usage Notes
section on the website.
Next, click on Download Dataset
, and unzip the file. You
should end up with a folder called
ProcSocB Supplemental Material
.
Inside that folder you can see subfolders containing all the data and
code to fully reproduce the analysis and manuscript. Find and click on
Virulence_Trade-off.Rproj
. This should open in RStudio. To
reproduce the manuscript, go into the Manuscript
folder,
open Manuscript.Rmd
. You’ll see some code at the top which
loads various datasets containing results. Further down, you’ll find the
text of the manuscript, combined with simple R commands that dynamically
place numbers and figures into the text. Re-create the manuscript by
clicking on the knit
button. There might be error messages
related to missing packages. If that happens, install those packages
(e.g. ‘bookdown’). If you get stuck, post your problem to the discussion
boards.
Feel free to explore further by opening various R scripts and running
them. The Usage Notes
section on the website where you
downloaded the zip file, or the Supplementary-Material.docx
document explain what the different pieces do and how to use them.
As you can tell, this is more complex and complete compared to the single R Markdown file example above. The details on how you structure your project and folders/files are up to you, but they should provide a way to organize things so it’s easy for you and others to understand what is going on where.
This example hopefully gives you an idea how nice and easy it is to write a whole paper with just a few clicks of a button. It is a combination of R scripts and R Markdown files (in this case a version called bookdown), which allow a fully automated generation of a complete data analysis and manuscript, including all figures, tables, references, etc. This is an incredibly efficient workflow and one you’ll learn and use in this class. For instance if there was a mistake somewhere in the data, instead of going through the manuscript and manually changing numbers, one could just re-run the whole code, and it will produce new results based on the updated data in a fully automated manner. It’s a bit more work upfront until you get the hang of it, but once you have it set up, it can save an enormous amount of time and avoid potential copy & paste mistakes.