In this unit, we will discuss what data wrangling is and why it’s a very important part of any analysis.
Once you have the data loaded into your analysis software, such as R,
you almost always need to process it to get it into a format that
suitable for statistical analysis. The terms often applied to this
messing around with the data is called wrangling or
tidying. Cleaning, munging, or
processing are other terms that generally mean the same thing.
To figure out how to get your data into the right shape and what
cleaning steps you need to take, you also need to explore the data. R
provides a lot of functionality for data wrangling and exploring. We
focus on a specific approach which makes use of R packages collectively
known as the tidyverse
, which we’ll cover next. We will
then use them to practice some data wrangling.
You have already done some data wrangling as part of previous exercises, here we talk about it in some more detail.
As with any part of the data analysis process, wrangling should be automated, reproducible, and well-documented. No “fixing by hand”!
There are some types of problems that you encounter often. I briefly discuss a few important ones.
In almost any dataset, there are some missing entries. Data can be missing for different reasons. It could have not been asked/recorded, the person refused to provide it, the variable is not applicable (e.g. for a non-smoker, the variable “number of cigarettes per day” might be left blank.) It is important to understand why data are missing and form an action plan based on that. Many analysis approaches do not allow for missing data. In this case, you need to reduce your dataset such that nothing is missing.
The easiest approach is to remove all observations with missing data (this is called “listwise deletion”). That can get problematic if you have a lot of variables, and each variable has some missing values; you might be left with almost no observations that have complete data. Another approach is to exclude them in analyses where they have missing data in variables of interest, but include then in analyses where they have information for all variables that are being considered (this is called “pairwise deletion”). Both of these methods can lead to bias in different ways and should be used with careful consideration.
Another option is to remove all variables with missing data from further analysis. Unfortunately, it is common that at least some values are missing for each variable, which means you would be left with no variables. Also, if the variable is important for the question you are trying to answer, you obviously cannot remove it.
Another option is to use a combination of removing variables and observations. You could start removing variables with missing values above some threshold, e.g. any variable that has more than 10% or 20% (or some value you pick) missing. There is no rule for this, and you need to justify it. Then once all variables with missing values above some threshold are removed, you remove any remaining observations that still contain missing data. This mix of removing variables and observations might preserve the most amount of data.
There are two problems when removing data. One is the obvious fact that you lose data, and thus statistical power. The other problem is if the data are not missing at random. In that case, by removing those observations with missing data, you introduce bias into your dataset. Again, it is important to understand why and how data are missing so you can have an understanding of potential problems such as introducing bias.
Instead of removing rows and columns (observations and variables)
until you have no more missing entry, you can also impute the
missing values. Basically, you make an educated, data-driven guess as to
what the missing value might have been and stick that value into the
missing slot. In principle, any regression and classification method
that you can use to estimate and predict an outcome can be used by
temporarily thinking of the variable you want to impute as your outcome
and the other variables as predictors, and then predicting the missing
values. Methods such as k-nearest neighbors or random forest,
which we discuss later in the course, are useful for this.
Imputation adds uncertainty since you made guesses for the
missing values, and often, the estimated/guessed values are randomly
drawn from a distribution. Multiple imputation creates several different
imputed datasets, and you can then run your analysis on each of those
imputed datasets, hopefully with similar results for each. In R, the recipes
package, which is part of the fairly new set of tidy modeling tools
called tidymodels, and
others such as the mice package
allow for imputation. We’ll try some of those in a later unit.
There are additional options for dealing with missing values. For instance, you could create new variables/predictors/features with less missing. As an example, if you have data that records if a person drinks beer (yes/no), wine (yes/no), or hard liquor (yes/no) and each of those variables has some missing, maybe you can create a new variable labeled any alcohol and code it as yes/no. If a person has a yes for at least one of the 3 original variables, they would be coded as yes in the new one. If they have all no, they would be coded as no. For anyone left, you do need to decide what to do with missing values in the original variables, i.e. if you interpret them as yes or no. You could either stick with one, e.g. if you had some additional knowledge that suggests anyone who doesn’t have that value recorded is more likely a no. For categorical data, you could also treat missing values as their own category (this can be useful when missingness provides information about the value of the response). Or you can use imputation.
As you can tell, missing values can cause headaches. It is therefore very important if you design and collect data to try to do it in such a way as to minimize missing values. Of course, if you analyze data collected by someone else, there is not much you can do, and you have to decide how to deal with missing values. For that, having a good understanding of what the data mean and how they were collected is essential.
In R, missing values are coded as NA. When you read data into R and
that data for instance codes missing as 99, you should recode to NA. NA
in R is a bit tricky, since any operation on NA returns NA. The
tidyverse functions tend to be pretty good in dealing with NA, but for
base R code you often have to be more careful. The function
is.na()
is often useful. Some functions,
e.g. mean()
and sum()
can deal with NA if you
tell them what to do. Sometimes, functions just deal with NA in some
built-in way. You need to check that this is what you want. Always
perform careful checks when handling missing values! The tidyverse
packages tidyr
and dplyr
have tools for
dealing with missing values during the wrangling process.
It is not uncommon to have values in your data that are strange. It could be that someone at data entry made a mistake. Or your codebook doesn’t properly explain the entries. Other times, it could be real data that is just an outlier. It can be hard to decide if a value is strange but real or a mistake. Dealing with such entries is a judgment call. The best approach is to have a pre-written analysis plan that explains exactly what to do in any such cases. Outside of clinical trials seeking FDA approval, that pretty much never happens. We usually have to decide what to do with strange entries when we run into them. Some good rules are to be consistent, fully document and explain what you do, and if you are able, do it both ways (e.g. do the analysis with the values as they are, and then again with them removed). Of course both ways can quickly turn into a million different ways and at some point, you probably have to stop. However, trying it more than one way can be reassuring if you get pretty much the same answer each time. If you do not get similar results, then you have to be more careful and should describe in your report/paper in detail why and how different approaches to your data cleaning lead to different results.
Sometimes data are censored, i.e., you know that the value is “X or
greater/less” but not the exact value. This is common in longitudinal
(survival or similar) data. It is also common in experimental data,
where you might have a measurement procedure with lower or upper limits
of detection. The way to deal with censored data depends on what kind of
censored data you have, and what you want to do with it. In R, packages
that are used to analyse survival data, e.g. the survival
package, can deal with some censored data. Otherwise, you will
likely need to google around to figure out how you should deal with the
censored data scenario you have and how to do it in R.
At times, you might have categorical variables with many categories, and a lot of them have only a few observations. This can make the modeling process problematic. In that case, you might want to consider combining certain categories into larger ones. For instance, if you have a variable for jobs which has many different values, it might make sense to group the jobs into categories (e.g., manual labor, clerical,..). You need to report what you did so readers can decide if this is a reasonable approach. Sometimes you might also want to group all minor categories into an “other” category. For instance if you have a dataset of nicotine users, your main categories might be cigarettes, cigars, chewing, vaping and everything else (whatever that might be, I don’t know) could be “other”. Note that maybe the way I’m grouping things here is really dumb. That shows that I’m not an expert on smoking. You can let the data decide the grouping by looking at numbers in each category, but there is no substitute for some level of expert topical knowledge.
For tasks like combining categories, and other operations that have
to do with categorical (factor) variables, the forcats
R package
is a great tool.
It is not uncommon that your raw data comes in more than one file. It
could be multiple spreadsheets of a single study, or it could be
different datasets from different sources (e.g., a dataset on asthma
cases combined with a different data source that records air pollution
levels). In such cases, you will likely need to combine the data. That
is often called joining. The dplyr
package has a
great set of _join()
functions that let you do different
types of joining of data. Other tools for R exists, e.g. the Data Table package.
As you do your data cleaning/wrangling, you need to explore your data to figure out what needs fixing. As such, certain aspects of exploratory analysis and cleaning go together. You need to explore your data to figure out how to clean it, and you need to clean it to allow further exploration. So remember that even though in a course like this we have to cover one topic at a time, in reality, a lot of these activities occur at the same time.
We will cover exploratory analysis in a separate, later module, and we will consider exploratory statistical analysis as part of the exploration process. There is no clear definition for exploratory analysis. Some people consider exploratory analysis only the parts up to and including a descriptive analysis. I consider very simple bi-variate statistical approaches also part of the exploration. To me, the exploratory part is anything that comes before you apply your full (usually multivariable) models to the data with the goal of producing “final” results. Just don’t get too hung up on the terminology. In practice, there is no clear division, e.g., people often plot exploratory figures and add a simple linear regression or another simple statistical model to the figure as part of the exploration process.
The whole Wrangle section of R4DS shows a lot of the topics discussed here in action. For a more interactive (but less comprehensive) source of information, check out the Work with Data and Tidy your Data sections of the R Studio Primers.
This paper provides a nice further discussion regarding outliers. It’s very non-technical (no equations) and easy to read.