# make sure the packages are installed
# Load required packages
library(dplyr)
library(purrr)
library(lubridate)
library(ggplot2)
library(here)
Generating synthetic data with R
Overview
In this unit, we look at a few examples of generating synthetic data with R
.
Learning Objectives
- Be able to generate synthetic data with
R
code.
Introduction
While there are R
packages available that help with the generation of synthetic data - and we will discuss them in a different unit, it is also fairly easy to use basic R
functionality to generate simulated data. We’ll do a few examples to show how this can be done.
AI tools are very useful for generating synthetic data. In fact, the code in the examples below was partially generated through back and forth with ChatGPT, together with some manual editing. More on using AI to generate data is provided in another unit of this module, and the more general topic of LLM AI tool use for coding is covered in the AI module.
Example 1
For the first example, we’ll generate a simple dataset of some hypothetical clinical trial for some treatment (say, a Cholesterol lowering medication) and its impact on outcomes such as Cholesterol levels and adverse events.
It is often useful to go slow and type your own code, or copy and paste the code chunks below and execute one at a time. Note that typing out the code yourself rather than pasting can help improve your muscle memory and retention of what you’re doing! However, if you are in a hurry, you can find the code for this example in this file.
In the code below, you might be wondering what the dplyr::glimpse()
and similar such notation is about. This notation indicates explicitly from which package the function comes. Once you loaded a package, that’s not necessary anymore, just using glimpse()
would work equally. But it is often nice to quickly see what package a function comes from. An important thing to note is that some packages share function names. In such cases, if a function appears in multiple packages in your environment, R
will default to using the one from the most recently loaded package. Using explicit notation is a good way to be sure you’re using the function you want! Thus, you’ll see me use the explicit notation often, though not always. It’s typically a matter of being more explicit versus typing less. I switch back and forth.
Setup
We’ll start by loading the packages used in the code below. While we will not use packages that are specifically meant to generate synthetic data, we will still use several common packages that make data generation tasks easier. That said, you could also do all of this with base R
and no additional packages.
# Set a seed for reproducibility
set.seed(123)
# Define the number of observations (patients) to generate
<- 100 n_patients
Generating data
# Create an empty data frame with placeholders for variables
<- data.frame(
syn_dat PatientID = numeric(n_patients),
Age = numeric(n_patients),
Gender = character(n_patients),
TreatmentGroup = character(n_patients),
EnrollmentDate = lubridate::as_date(character(n_patients)),
BloodPressure = numeric(n_patients),
Cholesterol = numeric(n_patients),
AdverseEvent = integer(n_patients)
)
# Variable 1: Patient ID
$PatientID <- 1:n_patients
syn_dat
# Variable 2: Age (numeric variable)
$Age <- round(rnorm(n_patients, mean = 45, sd = 10), 1)
syn_dat
# Variable 3: Gender (categorical variable)
$Gender <- purrr::map_chr(sample(c("Male", "Female"), n_patients, replace = TRUE), as.character)
syn_dat
# Variable 4: Treatment Group (categorical variable)
$TreatmentGroup <- purrr::map_chr(sample(c("A", "B", "Placebo"), n_patients, replace = TRUE), as.character)
syn_dat
# Variable 5: Date of Enrollment (date variable)
$EnrollmentDate <- lubridate::as_date(sample(seq(from = lubridate::as_date("2022-01-01"), to = lubridate::as_date("2022-12-31"), by = "days"), n_patients, replace = TRUE))
syn_dat
# Variable 6: Blood Pressure (numeric variable)
$BloodPressure <- round(runif(n_patients, min = 90, max = 160), 1)
syn_dat
# Variable 7: Cholesterol Level (numeric variable)
# Option 1: Cholesterol is independent of treatment
#syn_dat$Cholesterol <- round(rnorm(n_patients, mean = 200, sd = 30), 1)
# Option 2: Cholesterol is dependent on treatment
$Cholesterol[syn_dat$TreatmentGroup == "A"] <- round(rnorm(sum(syn_dat$TreatmentGroup == "A"), mean = 160, sd = 10), 1)
syn_dat$Cholesterol[syn_dat$TreatmentGroup == "B"] <- round(rnorm(sum(syn_dat$TreatmentGroup == "B"), mean = 180, sd = 10), 1)
syn_dat$Cholesterol[syn_dat$TreatmentGroup == "Placebo"] <- round(rnorm(sum(syn_dat$TreatmentGroup == "Placebo"), mean = 200, sd = 10), 1)
syn_dat
# Variable 8: Adverse Event (binary variable, 0 = No, 1 = Yes)
# Option 1: Adverse events are independent of treatment
#syn_dat$AdverseEvent <- purrr::map_int(sample(0:1, n_patients, replace = TRUE, prob = c(0.8, 0.2)), as.integer)
# Option 2: Adverse events are influenced by treatment status
$AdverseEvent[syn_dat$TreatmentGroup == "A"] <- purrr::map_int(sample(0:1, sum(syn_dat$TreatmentGroup == "A"), replace = TRUE, prob = c(0.5, 0.5)), as.integer)
syn_dat$AdverseEvent[syn_dat$TreatmentGroup == "B"] <- purrr::map_int(sample(0:1, sum(syn_dat$TreatmentGroup == "B"), replace = TRUE, prob = c(0.7, 0.3)), as.integer)
syn_dat$AdverseEvent[syn_dat$TreatmentGroup == "Placebo"] <- purrr::map_int(sample(0:1, sum(syn_dat$TreatmentGroup == "Placebo"), replace = TRUE, prob = c(0.9, 0.1)), as.integer)
syn_dat
# Print the first few rows of the generated data
head(syn_dat)
PatientID Age Gender TreatmentGroup EnrollmentDate BloodPressure Cholesterol
1 1 39.4 Female B 2022-08-25 152.0 179.7
2 2 42.7 Female B 2022-06-14 128.7 192.2
3 3 60.6 Female A 2022-04-17 153.4 150.6
4 4 45.7 Male B 2022-02-02 131.1 171.7
5 5 46.3 Female A 2022-03-24 119.6 160.5
6 6 62.2 Female A 2022-12-20 156.5 154.3
AdverseEvent
1 1
2 0
3 0
4 0
5 1
6 1
# Save the simulated data to a CSV and Rds file
write.csv(syn_dat, here("data","syn_dat.csv"), row.names = FALSE)
# if we wanted an RDS version
#saveRDS(syn_dat, here("data","syn_dat.Rds"))
Checking data
Take a peek at the generated data.
summary(syn_dat)
PatientID Age Gender TreatmentGroup
Min. : 1.00 Min. :21.90 Length:100 Length:100
1st Qu.: 25.75 1st Qu.:40.08 Class :character Class :character
Median : 50.50 Median :45.60 Mode :character Mode :character
Mean : 50.50 Mean :45.90
3rd Qu.: 75.25 3rd Qu.:51.92
Max. :100.00 Max. :66.90
EnrollmentDate BloodPressure Cholesterol AdverseEvent
Min. :2022-01-08 Min. : 91.3 Min. :129.6 Min. :0.00
1st Qu.:2022-04-06 1st Qu.:110.7 1st Qu.:160.7 1st Qu.:0.00
Median :2022-06-25 Median :130.8 Median :176.3 Median :0.00
Mean :2022-06-30 Mean :128.0 Mean :175.2 Mean :0.29
3rd Qu.:2022-10-04 3rd Qu.:147.4 3rd Qu.:188.7 3rd Qu.:1.00
Max. :2022-12-30 Max. :159.5 Max. :223.7 Max. :1.00
::glimpse(syn_dat) dplyr
Rows: 100
Columns: 8
$ PatientID <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
$ Age <dbl> 39.4, 42.7, 60.6, 45.7, 46.3, 62.2, 49.6, 32.3, 38.1, 4…
$ Gender <chr> "Female", "Female", "Female", "Male", "Female", "Female…
$ TreatmentGroup <chr> "B", "B", "A", "B", "A", "A", "B", "Placebo", "A", "Pla…
$ EnrollmentDate <date> 2022-08-25, 2022-06-14, 2022-04-17, 2022-02-02, 2022-0…
$ BloodPressure <dbl> 152.0, 128.7, 153.4, 131.1, 119.6, 156.5, 139.6, 118.9,…
$ Cholesterol <dbl> 179.7, 192.2, 150.6, 171.7, 160.5, 154.3, 172.8, 189.2,…
$ AdverseEvent <int> 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1…
# Frequency table for adverse events stratified by treatment
table(syn_dat$AdverseEvent,syn_dat$TreatmentGroup)
A B Placebo
0 28 19 24
1 15 11 3
# ggplot2 boxplot for cholesterol by treatment group
ggplot(syn_dat, aes(x = TreatmentGroup, y = Cholesterol)) +
geom_boxplot() +
labs(x = "Treatment Group", y = "Cholesterol Level") +
theme_bw()
This concludes our first example. This was a very simple setup, with rectangular data. You often encounter these kind of data in clinical trials. As you generate your data, you can build in any dependencies between variables you want to explore. Then later, when you use the data to test your analysis code, you can see if your analysis code can detect the dependencies you built in. We’ll come back to that.
You could also add further complexities into your synthetic data, for instance you could set some values to be missing, or you could add some outliers. The goal is to generate data that has as the important features of your real dataset to allow you to test your analysis approach on data where you now the truth (since you generated it). If your analysis works on your generated data, there is hope it might also work on the real data (for which of course you don’t know the truth). We’ll define in a later unit what we mean by “your analysis works”. But basically, you want to be able to recover the patterns/dependencies you built into your data with your analysis methods.
Example 2
In this example, we’ll generate a dataset that has a more complex structure. We’ll generate a dataset that has a longitudinal structure, i.e. repeated measurements over time. We’ll also generate a dataset that has a hierarchical structure, i.e. measurements nested within groups.
We’ll walk slowly through the code chunks, you can find all the code in one file here.
Setup
The usual setup steps.
# make sure the packages are installed
# Load required packages
library(dplyr)
library(ggplot2)
library(here)
# Set seed for reproducibility
set.seed(123)
# Number of patients in each treatment group
<- 20
num_patients # Number of days and samples per patient
<- 7
num_days <- 1 num_samples_per_day
Generating data
# Treatment group levels
<- c("Low Dose", "High Dose")
treatment_groups
# Generate patient IDs
<- rep(1:num_patients, each = num_days)
patient_ids
# Generate treatment group assignments for each patient
<- rep(sample(treatment_groups, num_patients, replace = TRUE),
treatment_assignments each = num_days)
# Generate day IDs for each patient
<- rep(1:num_days, times = num_patients)
day_ids
# Function to generate drug concentrations with variability
<- function(day, dose_group, patient_id) {
generate_drug_concentrations <- ifelse(dose_group == "Low Dose", 8, 15)
baseline_concentration <- rnorm(1, mean = 0, sd = 1)
patient_variation <- exp(-0.1*day)
time_variation * time_variation + patient_variation
baseline_concentration
}
# Generate drug concentrations for each sample
<- mapply(generate_drug_concentrations,
drug_concentrations day = rep(day_ids, each = num_samples_per_day),
dose_group = treatment_assignments,
patient_id = rep(1:num_patients, each = num_days))
# Flatten the matrix to a vector
<- as.vector(drug_concentrations)
drug_concentrations
# Generate cholesterol levels for each sample
# (assuming a positive correlation with drug concentration)
<- drug_concentrations +
cholesterol_levels rnorm(num_patients * num_days * num_samples_per_day, mean = 0, sd = 5)
# Generate adverse events based on drug concentration
# (assuming a higher chance of adverse events with higher concentration)
# Sigmoid function to map concentrations to probabilities
<- plogis(drug_concentrations / 10)
adverse_events_prob <- rbinom(num_patients * num_days * num_samples_per_day,
adverse_events size = 1, prob = adverse_events_prob)
# Create a data frame
<- data.frame(
syn_dat2 PatientID = rep(patient_ids, each = num_samples_per_day),
TreatmentGroup = rep(treatment_assignments, each = num_samples_per_day),
Day = rep(day_ids, each = num_samples_per_day),
DrugConcentration = drug_concentrations,
CholesterolLevel = cholesterol_levels,
AdverseEvent = adverse_events
)
Checking data
Take a peek at the generated data.
# Print the first few rows of the generated dataset
print(head(syn_dat2))
PatientID TreatmentGroup Day DrugConcentration CholesterolLevel AdverseEvent
1 1 Low Dose 1 8.462781 12.401475 0
2 1 Low Dose 2 6.909660 10.754871 0
3 1 Low Dose 3 6.327317 7.988330 0
4 1 Low Dose 4 5.473243 0.431360 1
5 1 Low Dose 5 4.296404 3.699141 0
6 1 Low Dose 6 6.177406 4.775430 1
summary(syn_dat2)
PatientID TreatmentGroup Day DrugConcentration
Min. : 1.00 Length:140 Min. :1 Min. : 2.081
1st Qu.: 5.75 Class :character 1st Qu.:2 1st Qu.: 5.174
Median :10.50 Mode :character Median :4 Median : 7.056
Mean :10.50 Mean :4 Mean : 7.593
3rd Qu.:15.25 3rd Qu.:6 3rd Qu.: 9.738
Max. :20.00 Max. :7 Max. :14.933
CholesterolLevel AdverseEvent
Min. :-4.494 Min. :0.0000
1st Qu.: 3.844 1st Qu.:0.0000
Median : 7.234 Median :1.0000
Mean : 7.855 Mean :0.6571
3rd Qu.:11.259 3rd Qu.:1.0000
Max. :24.422 Max. :1.0000
::glimpse(syn_dat2) dplyr
Rows: 140
Columns: 6
$ PatientID <int> 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3…
$ TreatmentGroup <chr> "Low Dose", "Low Dose", "Low Dose", "Low Dose", "Low…
$ Day <int> 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4…
$ DrugConcentration <dbl> 8.462781, 6.909660, 6.327317, 5.473243, 4.296404, 6.…
$ CholesterolLevel <dbl> 12.4014754, 10.7548711, 7.9883301, 0.4313600, 3.6991…
$ AdverseEvent <int> 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1…
Exploratory plot
# Plot drug concentrations over time for each individual using ggplot2
<- ggplot(syn_dat2, aes(x = Day, y = DrugConcentration,
p1 group = as.factor(PatientID), color = TreatmentGroup)) +
geom_line() +
labs(title = "Drug Concentrations Over Time",
x = "Day",
y = "Drug Concentration",
color = "TreatmentGroup") +
theme_minimal()
plot(p1)
<- ggplot(syn_dat2, aes(x = as.factor(AdverseEvent), y = DrugConcentration,
p2 fill = TreatmentGroup)) +
geom_boxplot(width = 0.7, position = position_dodge(width = 0.8), color = "black") +
geom_point(aes(color = TreatmentGroup), position = position_dodge(width = 0.8),
size = 3, shape = 16) + # Overlay raw data points
labs(
x = "Adverse Events",
y = "Drug Concentration",
title = "Boxplot of Drug Concentration by Adverse Events and Treatment"
+
) scale_color_manual(values = c("A" = "blue", "B" = "red")) + # Customize color for each treatment
theme_minimal() +
theme(legend.position = "top")
plot(p2)
This dataset has a bit more complicated structure than the previous one, but it isn’t much harder to generate it with code. You’ll want to generate your data such that it mimics the real data you plan on analyzing.
Save data
# Save the simulated data to a CSV or Rds file
write.csv(syn_dat2, here("data","syn_dat2.csv"), row.names = FALSE)
# if we wanted an RDS version
#saveRDS(syn_dat2, here("data","syn_dat2.Rds"))
We are saving the data as CSV and Rds files. CSV files are readable with pretty much any software and thus very portable. Rds files are R
-specific, thus not as flexible. The advantage of Rds files is that they are generally smaller, and they retain information about the variables, e.g. if they are factor or numeric variables. Either format works, sometimes one or the other might be better.
Summary
It is fairly easy to generate synthetic data with some basic R
commands, and it’s even easier if you use some LLM AI tool to help you write the code.
Since you generate the data, you can decide what dependencies between variables you want to add, and you can then test your models to see if they can find those patterns. We’ll get to that in another unit.
Further Resources
While the manual generation of data is generally fairly easy and gives you the most control, sometimes using R
packages specifically designed for this purpose can be useful. We’ll look at a few in a later unit.