For this unit, we will discuss where to find data and how to get different types of data into R.
An obvious way to get data to answer whatever question you have is to collect it yourself. This can lead to high-quality data tailored to the question you have. It is also likely expensive and time-consuming to acquire. If you do decide to collect data, it is a good idea to sketch out planned analyses as carefully as possible. It would be annoying if you forgot to obtain a piece of information that you later decided was crucial. The gold standard of data collection in human research are clinical trials that want FDA approval. These trials have to pre-register the question(s), the analysis plan, and the data collection plan. This approach means everything is run at a high level of rigor and quality. Even if you do not try to get FDA approval with your study, precisely specifying the analyses you plan to run is the best practice and will minimize p-hacking and other sources of bias (if you are unfamiliar with the term p-hacking, see e.g., the Pitfalls section on the General Resources page.) Unfortunately, pre-registration is still very uncommon in most areas of research, and often not feasible.
If you cannot or do not want to collect data yourself, you could instead team up with someone who does. The advantage of working directly with a data collector is that you have a subject matter expert as a collaborator, and you can ask them questions.
Data that is publicly available, or that you can get after requesting it and being approved by the data owner, constitutes the largest pool of data these days. Going this route gives you access to a lot of different datasets. The drawback is that the data was not collected to answer your specific question, and there is usually nobody you can ask for clarification. Also, the quality of such data varies. As more and more data is collected on almost every aspect of our world, these sources of data keep increasing. It is hard to keep track of places to get (good) data. I have been trying to collect a few sources of data on this website I maintain with various links to resources, among them is a list with links to various data sources. There are likely tons of other good data sources. The tricky bit is sometimes getting the data and understanding enough about what the data is and how it was collected to allow for reasonable analysis.
You might have already, or will soon notice that there are datasets that come with R, and even more with R packages. For instance, this page lists what is likely only a small fraction. There is even a Reddit group dedicated to R datasets. The good and the bad about datasets that come with R packages is that they are often fairly clean/tidy. That’s unfortunately not how most “real world” datasets look like. Getting dirty and messy datasets and wrangling them into a form that is suitable for statistical analysis is part of most workflows and often takes a lot of time. If you are lucky, you might get or find a dataset that is already fairly clean and use it to answer an actual question of interest. Most of the time however, this does not happen and you need to spend a good bit of time getting your data into the right shape.
No matter the source, you should try to get the data as raw as possible so that you can be in control of as many cleaning and processing steps as possible. That is (a bit) more work for you, but it gives you more flexibility to decide what to do with the data as you process it. If for instance, you get data with age in years, you can leave it as is, or decide to categorize it as young/old. Categorizing is generally a bad idea, but we’ll talk about that later. If you get data in an already processed form and someone already did this categorization for you, you cannot choose.
Wherever you get data from, document it as much as possible. When did you get it? Where and how? Did it come with other meta-information, e.g., a codebook? Where is it? Are there other things about the data that one should know about? Write down everything in some document.
Treat the raw data as you would a very fragile object. Ideally, do not touch it. Do not edit it. You want to only read the data into R, even if it is in a bad format, and then apply fixes/cleaning in R. If this is not possible, and you need to make edits to the data in whatever format you got it (e.g., Excel, SAS), make a copy of the data files and place those copies in a separate folder, AND ONLY EDIT THOSE COPIES. Also, write down the edits you made.
A large portion of data is entered and stored in spreadsheets, such as Excel. If you collaborate with others who produce and enter the data, especially if it is a frequent collaboration, having a discussion with them about best practices for data entry and storage could potentially save you a lot of time. If this is a situation you find yourself in, I recommend you read, and then ask your collaborators to read and implement the excellent paper Data Organization in Spreadsheets by Broman and Woo.
Several functions in base R can be used for data import. (For example,
read.table() or its variants is enough for many projects.) Some packages further expand the functionality for importing data into R. Since data come in so many types and forms, we will not be able to cover all of it. We will only look at a few examples. I am providing links to resources to get you started importing different kinds of data. If you have data of different type, Google is your friend 😄.
Here are some of the most prominent ones:
readrpackage is good for importing CSV and similar spreadsheet type data. For some more introductory material on the
readrpackage, see chapter 11 of R4DS.
readris not suitable for data in Excel format, for that use the
googlesheets4package can be used.
havenpackage is good for dealing with SPSS, SAS and Stata data.
The packages I listed above are all from the
tidyverse. There are many more packages with additional functionality, and many base R commands.
If you are lucky, your data reads in properly on the 1st try. Often, that is not the case. Getting data into R in such a form that you can start using it often requires a few tries.
Sometimes, the data might not load at all. You will get an error message and then have to figure out what to do to get the data into R. In some cases, you might not be able to get the data into R without editing it in its native format. For instance, reading proprietary data (e.g., SAS or Excel) doesn’t always work too well. In those instances, it is sometimes better to export the data from those programs as a comma-separated or tab-separated values file (CSV/TSV) and then read it into R.
Another example is if you get an Excel spreadsheet where the person used color to code some information. You might need to fix that and recode the color in the spreadsheet before reading it in. If you need to, make edits on the copies of the raw data until it is in a form that you can load into R.
If the data loads, it might do so with or without error or warning messages. In either case, you will want to look at the data to make sure what you expected to be there is there. Check if the data has the right number of rows and columns. Do other quick checks to ensure things look good enough to start working with the data. The
str command is handy for that, as is
glimpse from the
dplyr package. As long as everything is there, no matter how messy, you are ok and can now use R to clean up and explore the data, which we will cover next.
If you want to see a few examples of live-coding which show how to read data into R, you can check out these videos by Jeff Leek.