Missing Data

Author

Andreas Handel

Modified

2024-03-20

Overview

For this unit, we will discuss missing data and what to do about it.

Learning Objectives

  • Be familiar with the problem of missing data.
  • Be able to apply approaches to handle missing data.

Introduction

In almost any dataset, there are some missing entries. Data can be missing for different reasons. It could have not been asked/recorded, the person refused to provide it, the variable is not applicable (e.g. for a non-smoker, the variable “number of cigarettes per day” might be left blank.) Understanding why data are missing is important, so we can form an action plan based on that.

There are different ways to deal with missing data, here are some common ones.

Remove missing data

Many analysis approaches do not allow for missing data. In this case, you need to reduce your dataset such that nothing is missing.

The easiest approach is to remove all observations with missing data (this is called “listwise deletion”). That can get problematic if you have a lot of variables, and each variable has some missing values; you might be left with almost no observations that have complete data. Another approach is to exclude them in analyses where they have missing data in variables of interest, but include then in analyses where they have information for all variables that are being considered (this is called “pairwise deletion”). Both of these methods can lead to bias in different ways and should be used with careful consideration.

Another option is to remove all variables with missing data from further analysis. Unfortunately, it is common that at least some values are missing for each variable, which means you would be left with no variables. Also, if the variable is important for the question you are trying to answer, you obviously cannot remove it.

Another option is to use a combination of removing variables and observations. You could start removing variables with missing values above some threshold, e.g. any variable that has more than 10% or 20% (or some value you pick) missing. There is no rule for this, and you need to justify it. Then once all variables with missing values above some threshold are removed, you remove any remaining observations that still contain missing data. This mix of removing variables and observations might preserve the most amount of data.

There are two problems when removing data. One is the obvious fact that you lose data, and thus statistical power. The other problem is if the data are not missing completely at random. In that case, by removing those observations with missing data, you introduce bias into your dataset. Again, it is important to understand why and how data are missing so you can have an understanding of potential problems such as introducing bias.

Impute missing data

Instead of removing rows and columns (observations and variables) until you have no more missing entry, you can also impute the missing values. Basically, you make an educated, data-driven guess as to what the missing value might have been and stick that value into the missing slot. In principle, any regression and classification method that you can use to estimate and predict an outcome can be used by temporarily thinking of the variable you want to impute as your outcome and the other variables as predictors, and then predicting the missing values. Methods such as k-nearest neighbors or random forest, which we discuss later in the course, are useful for this. Imputation adds uncertainty since you made guesses for the missing values, and often, the estimated/guessed values are randomly drawn from a distribution. Multiple imputation creates several different imputed datasets, and you can then run your analysis on each of those imputed datasets, hopefully with similar results for each. In R, the recipes package, which is part of the fairly new set of tidy modeling tools called tidymodels, and others such as the mice package allow for imputation. We’ll try some of those in a later unit.

Feature engineering

As the creating new data/variables unit discusses, it is possible to create new variables/data from the existing ones. This can sometimes help with missing values.

For instance, you could create new variables/predictors/features with less missing. As an example, if you have data that records if a person drinks beer (yes/no), wine (yes/no), or hard liquor (yes/no) and each of those variables has some missing, maybe you can create a new variable labeled any alcohol and code it as yes/no. If a person has a yes for at least one of the 3 original variables, they would be coded as yes in the new one. If they have all no, they would be coded as no. For anyone left, you do need to decide what to do with missing values in the original variables, i.e. if you interpret them as yes or no. You could either stick with one, e.g. if you had some additional knowledge that suggests anyone who doesn’t have that value recorded is more likely a no. For categorical data, you could also treat missing values as their own category (this can be useful when missingness provides information about the value of the response).

Keep missing data

While many standard statistical models, such as linear and generalized linear models don’t work with missing values, there are some statistical algorithms that can handle missing values. For instance tree-based methods (e.g., random forest, boosted regression trees) can take predictors that have missing values. If you know that the method you plan on using can handle missing values, you can decide to not get rid of them but keep them in the data. Though it’s still useful to think carefully about your missing data and possibly process it. You might get better results from your models, even from those that can handle missing values.

Avoid missing data

As you can tell, missing values can cause headaches. It is therefore very important if you design and collect data to try to do it in such a way as to minimize missing values. Of course, if you analyze data collected by someone else, there is not much you can do, and you have to decide how to deal with missing values. For that, having a good understanding of what the data mean and how they were collected is essential.

Missing data in R

In R, missing values are coded as NA. When you read data into R and that data for instance codes missing as 99, you should recode to NA. NA in R is a bit tricky, since any operation on NA returns NA. The tidyverse functions tend to be pretty good in dealing with NA, but for base R code you often have to be more careful. The function is.na() is often useful. Some functions, e.g. mean() and sum() can deal with NA if you tell them what to do. Sometimes, functions just deal with NA in some built-in way. You need to check that this is what you want. Always perform careful checks when handling missing values! The tidyverse packages tidyr and dplyr have tools for dealing with missing values during the wrangling process.

Summary

In pretty much any real dataset, you’ll encounter missing data. You should understand as much as possible why these entries are missing. Based on that, and your analysis goals, you should formulate a good approach of handling the missing data.

Often, it’s good to do things more than one way and show that your findings are robust. For instance if you either remove or impute some missing data and you get different results, that means the missing data is influential and you need to be very careful about the conclusions you draw. On the other hand, if you get more or less the same result, you can be somewhat reassured that the missing data might not matter too much.

Further Resources

  • There is a chapter on missing data in the book Feature Engineering and Selection: A Practical Approach for Predictive Models.