Thinking about data generation

Author

Andreas Handel

Modified

2024-03-20

Overview

We are generally in the business of analyzing data, not generating it (but see the synthetic data content). Nevertheless, it is very important to think about the processes that generated the data you are trying to analyze.

Learning Objectives

  • Understand why it’s important to think of the data generation process
  • Be familiar with common features and structures in data

Introduction

Every dataset is generated by processes that can often be very complex. At times, seemingly similar data is generated by different processes. For instance, most of the text you read in this course was written by a single person. In contrast, a lot of the text you read in Wikipedia was co-authored and co-edited by many individuals. Thus, while both are pieces of text, the processes that generated them are different. Often, this difference in processes is reflected in patterns in the data, and needs to be accounted for. For any kind of analysis you do, it is useful to think about the likely processes that generated the data. This can influence your analysis approach.

Randomness

Many natural processes that lead to the generation of data have some inherent randomness associated with them. For instance if you want to analyze the effect of a drug on some health outcome, you expect to see variation among individuals. Some of the variation is due to individual characteristics (e.g., age, weight), but some variation is likely random. While one could discuss if true randomness exists, for the purpose of pretty much any analysis, randomness just means anything that can’t be measured in our system and that fluctuates randomly and contributes to variation. Being aware of which processes might be affected by more or less of such random variation is important. In the worst case, if you have too much random variation, it might swamp your signal and you won’t see any meaningful pattern in your data.

Measurements

Measurements that lead to data always have inherent limitations. For instance if you use a scale to measure someone’s weight, it might only report it to one decimal place. Similarly, if your scale has a minimum and maximum value it can record, anything that is lighter and heavier than that value will not produce a numeric result, just an indication that it’s too light (the scale might show zero) or too heavy (the scale might show some error message or the maximum value). It is important to consider these kind of measurement limitations and account for them if possible.

Interactions

Often, the components in our system of interest interact. The simplest interaction is a direct one between a predictor and an outcome. Say you give some kind of cancer drug to a patient and see if it reduces the size of their tumor. This is often the scientific question of interest. However, there might be other factors interacting with the variables of interest. For instance the drug might induce some immune response which could either help or hinder the ability of the drug to shrink the tumor. Let’s say the immune system synergistically helps the drug reduce the tumor in individuals that have blood type A-positive and it hinders the drug in everyone else. These might be important aspects that — if you know them — need to inform your model.

The whole field of causal modeling tries to - among other things - get at such situations and help you devise the best analysis approach given what you know or assume about the system. It’s worth learning causal modeling, but we can’t cover it in this class.

Data structures

Depending on the processes that generate the data, it is common for data to have certain structures. Thinking about those structures, and as suitable accounting for them in your analyses, is important. The Complex Data unit discussed some important structure such as temporal/spatial/hierarchical. These structures will likely determine the details of your modeling approach, since you will likely want to use models that account for such structures.

Summary

Occasionally, you might do an analysis that tries to explicitly model (some of) the processes that lead to the observed data. More often, you will want to discover patterns and you don’t directly care about the processes that led to the data. Nevertheless, thinking about them is important since it might influence the way you go about analyzing the data, the specific questions you might try to answer, and the conclusions you can draw from your analysis. Therefore, as a good analyst, you want to understand as much as possible about the system and processes that led to the data you are analyzing.

Further resources

  • A great source for causal modeling, and statistics in general, is Richard McElreath’s online course - some of the materials are free, though his book is not.