Data formats in R
Overview
For this unit, we will briefly discuss different types of variables and data formats in R.
Learning Objectives
- Be familiar with the main data formats in R
- Know advantages and disadvantages of different R data/variable formats
Introdution
To efficiently work with data in R, you need to understand how different types of data are represented in and handled by R.
The following is a summary of the most important data types in R. Some are specific to R, but if you know other languages you will find that a lot of the formats are either the same or very similar. Thus, learning these format types is useful, no matter what language you are using to write your code.
Basic data types
Characters/strings
A string is a collection of characters. You will often hear the labels “character” and “string” used interchangeably, though strictly speaking, a string is a collection of characters. Everything can be encoded as a character string. Unfortunately, you cannot do a lot of analysis with strings. For instance, if you have the numbers 2 and 3, you can add and subtract. But if you code those numbers as characters “2” and “3”, you can’t do much with them. Thus, transforming characters into other, more useful categories (if applicable) is a common task. It is common that you read in some data and there is a variable which should be numeric, but some entries are not (e.g., the original spreadsheet shows something like “<10”). In this case, R reads all of these variables as characters. You then have to go in, clean the “<10” value, and convert the rest to numeric. Sometimes you do want to work with strings directly. There are many tools and packages in R that are helpful, including base R command. The stringr
package is particularly useful.
If you do any kind of analysis of text, you’ll be dealing with strings/characters. But even if you are not planning to analyze text, it is quite likely that you will need to work with strings at some point during a data analysis, even if it is only to find specific values, clean up variable names, etc. Thus, learning more about this topic is a good idea. A very powerful, and also very confusing way to deal with strings is to use what are called regular expressions (or regex for short). This concept applies to any programming language, not just R. Being at least somewhat familiar with the concept of regular expressions is useful. The rebus
package is particularly useful for building regular expressions without knowing the highly specific syntax. However, learning regex applies to many different languages and is an incredibly powerful tool in your data science repertoire.
If you have no experience manipulating strings, I suggest you work through the Strings chapter (14) of R4DS, and do the exercises. The string processing chapter (25) of IDS contains further good material that is worth working through. Another good source is the Character Vectors chapter in the STAT 545 book by Jenny Bryan. Take a look at those various sources, decide which one is right for your level and go through some of them. And/or consult them as needed.
Factors
Factors is the name that R gives to categorical variables. They can be ordered/ordinal or not. You need to make sure variables that should be coded as a factor are, and that those that shouldn’t be aren’t. For instance, you might have a variable with entries of 0, 1, and 2. Those could be numeric values, e.g., the number of siblings a person has. Or it could be a factor coding for 3 types of ethnicity (unordered), or 3 levels of socioeconomic status (ordered). You need to make sure it is coded as factor or numeric, based on what you know about the variable. An excellent package to work with factors is the forcats
package.
To learn some more about factors, you might want to go through the Factors chapter of R4DS, and do the exercises. Factors are the most complicated basic data format, so it is worth spending extra time making sure you understand a bit how factors work.
Logicals
You can think of a logical variable as a type of categorical variable with 2 categories, TRUE and FALSE. Alternatively, in R, 0 is interpreted as FALSE and 1 as TRUE (and vice versa). R will sometimes treat logicals as numbers, and sometimes as categorical variables, so it is often more straightforward to code these variables as either numeric or categorical to be safe. You will use those logical values often when checking your data, e.g., if you want to see if your variable x
is greater than 5, then the R command x > 5
will return either TRUE or FALSE, based on the value of x
.
Numeric value
Numeric values can be integers or general numeric values (double). You generally do not need to care too much how exactly your numeric values are coded. Often, you can treat integers as general numeric value. (In R, a general numeric variable is called double
– there is no single
.) You might rarely come across a case where some analytic method or other bits of code requires integers to be specified as such. In R, you can use the as.integer()
function to convert general numeric values to integers. You don’t really need any other special packages in R to deal with numeric values. Note that when you type an integer value, e.g. x <- 2
, into R, this is considered numeric by default. If you want to make sure it is treated as integer, add an L
, e.g. x <- 2L
.
One issue to be careful about is if you do comparisons. For instance if x
and y
each contain some number and you expect it to be the same, but there was some rounding going on, it could be that if you do x == y
you get a FALSE
because you had x = 2
and y = 1.99999997
- the latter not being exactly 2 because some rounding was happening. If this is the case, the safer approach is to check for abs(x - y) < 1-e10
or some other small number; or alternatively you can use the shortcut all.equal(x, y)
. Basically you are looking if x
and y
are the same within some rounding error.
Dates and times
While dates and times are a type of continuous numeric variable, you should assign the date class explicitly in R, which allows you to do more with them. Dates can be trick to work with in base R (which usually calls them POSIX
variables). The lubridate
package is a good package to work with dates, and is more user friendly. Others exist.
To learn some more about dates and times in R, check out the Dates and times chapter of R4DS as well as the Parsing Dates and Times chapter of IDS.
The nice thing about having explicit date formats is that you can do things like subtract one date from another to get the days between the two dates and R
will automatically account for things like leap-years.
Excel is notoriously bad at storing dates. It has some internal format that’s different than what you see in the spreadsheet. That means that at times, if you read an Excel file into R, the date is messed up. In my experience, the most robust option is to explicitly format the date (and really every column) in Excel as text. Then you can read this into R and re-code as date/factor/etc.
Special data types
R has several symbols that represent a special type of object. Here’s a quick summary.
TRUE
and FALSE
are words that are interpreted as logical values. The abbreviations T
and F
are also treated as logical values (so you should avoid overwriting those names). R also allows you to use 0 and 1 as logical (1=TRUE, 0=FALSE) but that can be confusing.
NA
(not available) is R’s default way of encoding a missing value. If something is missing, you can’t do useful things with it. For instance if you have this vector x1 <- c(1,3,NA)
and try to do mean(x1)
, you get NA again. For some functions, there is an ignore setting, e.g. mean(x1, na.rm=TRUE)
removes the NA, then takes the mean. You can check if you have NA
with is.na(x1)
. You’ll likely encounter NA
frequently in your data analysis, and you need to be careful when you try to perform operations in the presence of NA
.
Inf
and -Inf
can also show up when you do numerical operations, for instance divide by zero. If you perform further mathematical operations with infinity, you almost always end up with infinity again. Generally, if you end up with one of those, it’s an indication that you did some mathematical operation that you probably shouldn’t be doing (commonly divide by zero, or log of a negative number).
NaN
(Not a Number) happens when you do some mathematical operation that can’t be resolved to any number or infinity. For instance if you do 0/0
you’ll get NaN
. Like Inf
, NaN
will generally mess up any further computations, so if you encounter it you should figure out where it came from and figure out a fix.
NULL
indicates something that exists but is empty. For instance if you want to set up a new list object to hold some future results, but want to start it empty, you can do x <- list(NULL)
. And then you can later add elements to that list. You’ll likely not need to care too much about NULL
initially, but if you start writing more code it is often a useful object.
For most of these special objects, there are functions is()
and is.*()
, where *
is a specific data type, that let you check. For instance is.nan(x)
will tell you if any element in x
contains NaN
.
Compound data structures in R
The basic data types in R are usually combined into larger objects. The following briefly describes the most important ones of these data structures.
Vectors
Vectors are a simple collection of elements in a single row or column. In R, the easiest way to create vectors is with the concatenate
command, c()
. An example is x1 <- c(3,12,5)
. A single vector can contain only one element type (e.g., all characters or all numeric). If you try to mix and match, everything ends up as a character. Type x1 <- c(3,12,5)
and x2 <- c(6, 5, 'h')
into R and apply the class()
command to both x1
and x2
and note the difference. It is worth mentioning that vectors can be of length 1. (Unlike some other languages, there is no separate concept of a scalar in R, a scalar is just a length one vector.) For instance if you define x3 <- 1
and then check it with is(x3)
you’ll see that is is classified as a numeric vector — of length 1, which you can find out with length(x3)
.
Matrices
A matrix is a collection of elements in rows and columns. A matrix can contain only one element type. You can think of a matrix as a collection of horizontal vectors
stacked on top of each other or vertical vectors
next to each other.
Data frames
A data frame has the same shape as a matrix, i.e., it is a collection of elements in rows and columns. The critical difference is that each column of a data frame can contain elements of different types. This makes it ideal for storing data, with each row and observation and each variable in a column, and different columns potentially with different data types. E.g., column 1 could be age and numeric, and column 2 could be gender and be categorical, etc. Data frames also have a name for each column. As the name might imply, a data frame is a very common way to store data. It is worth mentioning a version of a data frame called a tibble
that is often used in the tidyverse
. It is similar but not quite the same as the basic data frame. Most of the time, you can use one format or the other. You can read more about tibbles
on its package website.
Lists
Lists are the most flexible data types in R. You can combine different elements as in data frames. Further, each element can be of varying length. For instance, you could have the first list element contain a person’s name, the second list element their age, the third their address. You can even have other elements inside lists, for instance, you could have a data frame as a list element containing the names and ages of the person’s parents. Lists are very flexible, and if you get deeper into data analysis, you’ll be working with them. The downside is that because they are more flexible, they can also be a bit more confusing to work with. With enough practice, you’ll figure it out. Also note that almost every function in R that returns something a bit more complicated to you (e.g., the result from a linear fit), returns it as a list.
Other data structures
The ones described above are the most important ones, but there are many others. Often, specific R packages come with their own structures. You are usually told how to use specific data structures, and there are generally versions of the structures just described.
Summary
Every programming language has certain types of “containers” in which you can store data. They might be called differently in different languages, but they often look the same. For data analysis in R
the most important types are data frames (and it’s cousin, the tibble) and lists. And of course all the basic data structures, which are generally the building blocks of your data frames and lists.
Further resources
For some more information, you can check out this video by Jeff Leek where he talks about the types of data and structures I described above. He also shows some R/coding examples and discusses the important concept of missing values and NA
.
To learn more, check for instance the R Objects Chapter of Hands-On Programming with R or chapter 2 of IDS.