# Overview

For this unit, we will discuss where to find data and how to get different types of data into R.

# Objectives

• Know ways of acquiring data
• Be familiar with some important online sources for data
• Understand how to get data into R

# Data Sources

An obvious way to get data to answer whatever question you have is to collect it yourself. This can lead to high-quality data tailored to the question you have. It is also likely expensive and time-consuming to acquire. If you do decide to collect data, it is a good idea to sketch out planned analyses as carefully as possible. It would be annoying if you forgot to obtain a piece of information that you later decided was crucial. The gold standard of data collection in human research are clinical trials that want FDA approval. These trials have to pre-register the question(s), the analysis plan, and the data collection plan. This approach means everything is run at a high level of rigor and quality. Even if you do not try to get FDA approval with your study, precisely specifying the analyses you plan to run is the best practice and will minimize p-hacking and other sources of bias (if you are unfamiliar with the term p-hacking, see e.g., the Pitfalls section on the General Resources page.) Unfortunately, pre-registration is still very uncommon in most areas of research, and often not feasible.

If you cannot or do not want to collect data yourself, you could instead team up with someone who does. The advantage of working directly with a data collector is that you have a subject matter expert as a collaborator, and you can ask them questions.

Data that is publicly available, or that you can get after requesting it and being approved by the data owner, constitutes the largest pool of data these days. Going this route gives you access to a lot of different datasets. The drawback is that the data was not collected to answer your specific question, and there is usually nobody you can ask for clarification. Also, the quality of such data varies. As more and more data is collected on almost every aspect of our world, these sources of data keep increasing. It is hard to keep track of places to get (good) data. I have been trying to collect a few sources of data on this website I maintain with various links to resources, among them is a list with links to various data sources. There are likely tons of other good data sources. The tricky bit is sometimes getting the data and understanding enough about what the data is and how it was collected to allow for reasonable analysis.

You might have already, or will soon notice that there are datasets that come with R, and even more with R packages. For instance, this page lists what is likely only a small fraction. There is even a Reddit group dedicated to R datasets. The good and the bad about datasets that come with R packages is that they are often fairly clean/tidy. That’s unfortunately not how most “real world” datasets look like. Getting dirty and messy datasets and wrangling them into a form that is suitable for statistical analysis is part of most workflows and often takes a lot of time. If you are lucky, you might get or find a dataset that is already fairly clean and use it to answer an actual question of interest. Most of the time however, this does not happen and you need to spend a good bit of time getting your data into the right shape.

# Data to-dos

No matter the source, you should try to get the data as raw as possible so that you can be in control of as many cleaning and processing steps as possible. That is (a bit) more work for you, but it gives you more flexibility to decide what to do with the data as you process it. If for instance, you get data with age in years, you can leave it as is, or decide to categorize it as young/old. Categorizing is generally a bad idea, but we’ll talk about that later. If you get data in an already processed form and someone already did this categorization for you, you cannot choose.

Wherever you get data from, document it as much as possible. When did you get it? Where and how? Did it come with other meta-information, e.g., a codebook? Where is it? Are there other things about the data that one should know about? Write down everything in some document.

Treat the raw data as you would a very fragile object. Ideally, do not touch it. Do not edit it. You want to only read the data into R, even if it is in a bad format, and then apply fixes/cleaning in R. If this is not possible, and you need to make edits to the data in whatever format you got it (e.g., Excel, SAS), make a copy of the data files and place those copies in a separate folder, AND ONLY EDIT THOSE COPIES. Also, write down the edits you made.

A large portion of data is entered and stored in spreadsheets, such as Excel. If you collaborate with others who produce and enter the data, especially if it is a frequent collaboration, having a discussion with them about best practices for data entry and storage could potentially save you a lot of time. If this is a situation you find yourself in, I recommend you read, and then ask your collaborators to read and implement the excellent paper Data Organization in Spreadsheets by Broman and Woo.

# Data import in R

Several functions in base R can be used for data import. (For example, read.table() or its variants is enough for many projects.) Some packages further expand the functionality for importing data into R. Since data come in so many types and forms, we will not be able to cover all of it. We will only look at a few examples. I am providing links to resources to get you started importing different kinds of data. If you have data of different type, Google is your friend 😄.

Here are some of the most prominent ones:

The packages I listed above are all from the tidyverse. There are many more packages with additional functionality, and many base R commands.

To learn some more, you can check out the Importing Data chapter in IDS. The Web scraping chapter in IDS provides additional information on how to pull online data into R.

# Checking the data

If you are lucky, your data reads in properly on the 1st try. Often, that is not the case. Getting data into R in such a form that you can start using it often requires a few tries.

Sometimes, the data might not load at all. You will get an error message and then have to figure out what to do to get the data into R. In some cases, you might not be able to get the data into R without editing it in its native format. For instance, reading proprietary data (e.g., SAS or Excel) doesn’t always work too well. In those instances, it is sometimes better to export the data from those programs as a comma-separated or tab-separated values file (CSV/TSV) and then read it into R.

Another example is if you get an Excel spreadsheet where the person used color to code some information. You might need to fix that and recode the color in the spreadsheet before reading it in. If you need to, make edits on the copies of the raw data until it is in a form that you can load into R.

If the data loads, it might do so with or without error or warning messages. In either case, you will want to look at the data to make sure what you expected to be there is there. Check if the data has the right number of rows and columns. Do other quick checks to ensure things look good enough to start working with the data. The str command is handy for that, as is glimpse from the dplyr package. As long as everything is there, no matter how messy, you are ok and can now use R to clean up and explore the data, which we will cover next.

# Examples of data import

If you want to see a few examples of live-coding which show how to read data into R, you can check out these videos by Jeff Leek.