For this unit, we will discuss different types of data and how data type influences the analysis approach.
Broadly speaking, we can define data as anything that (potentially) contains information. Data can be images, sound, video, text, or a combination of any of these. You most likely encounter data in spreadsheets, with observations as rows and variables as columns. However, data is getting much more varied and complex. Data from fitness devices such as Fitbits, Tweets, Facebook posts, purchasing behavior, movement, etc. are all streams of data that can contain useful information.
The kind of data determines the amount of processing that needs to be done before analysis. Somehow you need to turn your data into something that you can analyze. While analysis of images, video, and text is undoubtedly interesting, it is not the focus of this course. But you are still welcome to use such sources of data for your project.
In this course, we focus on the data source that you are most likely to encounter in your analyses. And that data source is quite likely the “(messy) spreadsheet” type, containing bits of information collected on individuals. You are of course welcome to play around with other data types during this course, e.g. for your course project.
If you want to hear someone else’s definition and explanation of what data is, you can watch this video by Jeff Leek.
We usually refer to pieces of data/information (e.g., gender and age) as variables. Different types of variables exist, and depending on the type, the analysis will be different. The main categories are:
Quantitative: This data type, also called interval data, generally allows one to do certain mathematical operations, e.g., subtraction or addition. Different subcategories exist:
Qualitative: Broadly speaking, qualitative data are those that do not allow one to perform any mathematical operations such as subtraction or addition. Qualitative data which has no intrinsic order is also caled nominal (scale) data. Types of such data are:
Ordinal: This is usually considered a type of categorical variable, but it is worth thinking about it as something on its own. Ordinal data fall in between being strictly quantitative or strictly qualitative. For instance, if a question asks a person to rank their level of a pain on a scale from 1-10, a 7 is clearly higher than a 6, and a 6 higher than a 5. But it’s unclear if the difference between 5 and 6 is the same as 6 and 7. Thus it is not clear if one can do operations like subtraction (to get a difference of 1 in each case). Another example is level of education, which a survey might collect in categories of ‘no high school’, ‘high school’, ‘some college’, ‘college degree’, ‘graduate degree’. We could code that with numbers 1-5, and in some sense these items are ordered, but it’s unclear if one is justified in considering the difference between ‘high school (2)’ and ‘some college (3)’ the same as ‘some college (3)’ and ‘college degree (4)’.
The type of variables will influence the analysis approach. That’s especially true for the outcomes of interest, less so for the independent, predictor variables.
Methods applied to quantitative outcomes are usually referred to as regression approaches, with different variants depending on the subtype (e.g., linear regression for continuous, Poisson regression for discrete). Methods applied to categorical outcomes are usually referred to as classification approaches.1 If you have an ordinal outcome, you can use ordinal regression. Alternatively, you can treat the outcome as unordered categorical or as continuous (depending on how you code them, i.e., in R as a factor or numeric). There are no rules as to when it is ok to treat an ordinal variable as fully quantitative. It is often done but needs to be justified. You can always treat it as categorical, but then you lose some information, namely the ordering.
In the machine learning literature, supervised learning refers to cases when we have a specific outcome of interest. This kind of data is most common. For data where there is no clear outcome, analysis methods are usually referred to as clustering approaches and are also called unsupervised learning methods.
We will discuss and apply some of those methods in more detail when we begin our discussion of analysis methods.
To efficiently work with data in R, you need to understand how the types of data described above are represented in and handled by R.
The following is a summary of the most important data types in R. I’m also listing useful packages to deal with them. You have already seen some of the information I describe below. Now would be a good time to revisit the Types section of the RStudio programming basic primer and to revisit chapter 3 of IDS.
Characters/strings: A string is a collection of
characters. You will often hear the labels “character” and “string” used
interchangeably, though strictly speaking, a string is a collection of
characters. Everything can be encoded as a character string.
Unfortunately, you cannot do a lot of analysis with strings. For
instance, if you have the numbers 2 and 3, you can add and subtract. But
if you code those numbers as characters “2” and “3”, you can’t do much
with them. Thus, transforming characters into other, more useful
categories (if applicable) is a common task. It is common that you read
in some data and there is a variable which should be numeric, but some
entries are not (e.g., the original spreadsheet shows something like
“<10”). In this case, R reads all of these variables as characters.
You then have to go in, clean the “<10” value, and convert the rest
to numeric. Sometimes you do want to work with strings directly. There
are many tools and packages in R that are helpful, including base R
command. The stringr
package
is particularly useful.
It is quite likely that you will need to work with strings at some point during a data analysis, even if it is only to find specific values, clean up variable names, etc. Thus, learning more about this topic is a good idea. A very powerful, and also very confusing way to deal with strings is to use what are called regular expressions (or regex for short). This concept applies to any programming language, not just R. Being at least somewhat familiar with the concept of regular expressions is useful.
If you have no experience manipulating strings, I suggest you work through the Strings chapter (14) of R4DS, and do the exercises. The string processing chapter (25) of IDS contains further good material that is worth working through. Another good source is the Character Vectors chapter in the STAT 545 book by Jenny Bryan. Take a look at those various sources, decide which one is right for your level and go through some of them. And/or consult them as needed.
Factors: That’s what R calls categorical variables.
They can be ordered/ordinal or not. You need to make sure variables that
should be coded as a factor are, and that those that shouldn’t be
aren’t. For instance, you might have a variable with entries of 0, 1,
and 2. Those could be numeric values, e.g., the number of siblings a
person has. Or it could be a factor coding for 3 types of ethnicity
(unordered), or 3 levels of socioeconomic status (ordered). You need to
make sure it is coded as factor or numeric, based on what you know about
the variable. An excellent package to work with factors is the forcats
package.
To learn some more about factors, you might want to go through the Factors chapter of R4DS, and do the exercises.
Logical: You can think of a logical variable as a
type of categorical variable with 2 categories, TRUE and FALSE.
Alternatively, in R, 0 is interpreted as FALSE and 1 as TRUE (and vice
versa). You will use those logical values often when checking your data,
e.g., if you want to see if your variable x
is greater than
5, then the R command x > 5
will return either TRUE or
FALSE, based on the value of x
.
Numeric (double/integer): Numeric values that are
either integers or any other numeric value (double). You generally do
not need to care too much how exactly your numeric values are coded.
Often, you can treat integers as general numeric value. (In R, a general
numeric variable is called double
.) You might rarely come
across a case where some analytic method or other bits of code requires
integers to be specified as such. In R, you can use the
as.integer()
function to convert general numeric values to
integers. You don’t really need any other special packages in R to deal
with numeric values. Note that when you type an integer value,
e.g. x <- 2
, into R, this is considered numeric by
default. If you want to make sure it is treated as integer, add an
L
, e.g. x <- 2L
.
Date/time: While dates are a type of continuous
numeric variable, you should assign the date class explicitly in R,
which allows you to do more with them. Dates are quite difficult to work
with in base R (which usually calls them POSIX
variables).
The lubridate
package is a good package to work with dates, and is more user
friendly. Others exist.
To learn some more about dates and times in R, check out the Dates and times chapter of R4DS as well as the Parsing Dates and Times chapter of IDS.
The basic data types in R are usually combined into larger objects. The main ones in R are described in the following.
Vectors: vectors are a simple collection of elements
in a single row or column. In R, the easiest way to create vectors is
with the concatenate
command, c()
. An example
is x1 <- c(3,12,5)
. A single vector can contain only one
element type (e.g., all characters or all numeric). If you try to mix
and match, everything ends up as a character. Type the command for
x1
and x2 <- c(6, 5, 'h')
into R and apply
the class()
command to both x1
and
x2
and note the difference.
Matrices: A matrix is a collection of elements in
rows and columns. A matrix can contain only one element type. You can
think of a matrix as a collection of horizontal vectors
stacked on top of each other or vertical vectors
next to
each other.
Data frames: A data frame has the same shape as a matrix, i.e., it is a collection of elements in rows and columns. The critical difference is that each column of a data frame can contain elements of different types. This makes it ideal for storing data, with each row and observation and each variable in a column, and different columns potentially with different data types. E.g., column 1 could be age and numeric, and column 2 could be gender and be categorical, etc.
A list: Lists are the most flexible data types in R. You can combine different elements as in data frames. Further, each element can be of varying length. For instance, you could have the first list element contain a person’s name, the second list element their age, the third their address. You can even have other elements inside lists, for instance, you could have a data frame as a list element containing the names and ages of the person’s parents. Lists are very flexible, and if you get deeper into data analysis, you’ll be working with them. The downside is that because they are more flexible, they can also be a bit more confusing to work with. With enough practice, you’ll figure it out. Also note that almost every function in R that returns something a bit more complicated to you (e.g., the result from a linear fit), returns it as a list.
Other types of data structures exist; they are often introduced by
specific R packages. An important one to know is the tibble
which is a type of data frame used in the tidyverse
. It is
similar, but not exactly like a data frame. You can read more about
tibbles
on its
package website and in R4DS chapter 10.
For some more information, you can check out this video by Jeff Leek where he talks about the types
of data and structures I described above. He also shows some R/coding
examples and discusses the important concept of missing values and
NA
.
We don’t have the time to cover analysis of all the variety of data that exists. As stated before, our focus here was on the analysis of what is sometimes called rectangular data. We have observations (individual units of data) generally as rows, and variables (outcome and predictors) as columns. This is still the most common type of data, especially in public health and more generally the life sciences. However, more complex types of data are rapidly increasing in frequency and importance. If the data has a different structure, it will usually require different, specialized analysis approaches. The good news is that most of what we cover in this course still applies. The difference is often in the processing of the data, and the exact way data and models are combined.
At the end of the course, there is a page that briefly goes into other types of data (time-series, text, and some others) and provides very brief pointers for further resources, should you be interested in data of that type.
Logistic regression, which you might be familiar, is used for classification. However, the underlying model predicts a quantitative outcome (a value between 0 and 1 usually interpreted as a probability), which is then binned to make categorical predictions.↩︎