Processing Data

Authors

Andreas Handel

Megan Beaudry

Modified

2024-03-20

Overview

This unit provides a brief introduction to the topic of data processing.

Learning Objectives

Be familiar with the topic of data processing
Understand the importance of proper processing for data analysis

Introduction

I want to remind you of this diagram:

Double sided arrow points between Question and Data with the text 'Match needed' above the arrow. A spiral moves from Load, to Clean, to Explore, to Process. Below the spiral an arrow points from Model to Report. An rounded arrow points from Model and Report back up to the spiral with the word Repeat. — Data analysis workflow

Different data-analysis related tasks rarely happen in a linear sequence. Instead there is a general stage before statistical analysis where you mess with your data (that’s a technical term 😁) to get ready for the main formal analysis. And often, as you start that formal analysis, you’ll have to go back and do some more cleaning/exploration.

While one can distinguish the different tasks of Cleaning, Exploring and Processing, one can also more generally think of them as all being a part of data processing. Alternative terms often used instead of processing are data wrangling or tidying. Cleaning, munging.

As with any part of the data analysis process, processing should be automated, reproducible, and well-documented. No “fixing by hand”!

Data processing tasks

Almost all datasets require some processing before they are in a format that can be used with statistical models. For any data analysis, the majority of time is likely spent in the data processing and exploration steps before the main model fitting. Some folks suggest that this processing part is as much as 80% of the time spent on the whole project. In my experience, this is about right.

While every dataset is different, there are certain tasks that are common. Those involve missing data, recoding variables, and dealing with outliers. Further ones are merging and reshaping of the data. We will discuss some of them in further units.

Data processing and exploratory data analysis (EDA)

As you process your data, you need to explore your data to figure out what needs processing. As such, exploratory data analysis (EDA) and data processing generally go together.

EDA generally relies on making figures and tables, and we’ll talk about that in a separate module. There is no clear definition for exploratory analysis. Beyond figures and tables, it can sometimes involve simple statistical computations, for instance you might compute correlations among predictors.

You can think of EDA as any process that happens before you start building and fitting your main models. Again, it often happens concurrently with the processing of the data.

Summary

Data processing is a major and crucial part of any data analysis project. It can be tedious, but with practice and good tools, you might even reach a point where you enjoy the process somewhat 😁.

Further resources

Chapters 4 and 5 of The Art of Data science (ADS) discuss some processing and exploration tasks and examples. if you want to work along, see the Some practice section of the Data Analysis Overview page for some details on how to do it.
The whole R4DS book focuses on the early stages of data analysis, including processing and exploration. Definitely keep checking this book if you are looking for more information on a specific topic.