library('dplyr')
library('ggplot2')
library('here')
Scrambling existing data - R
Overview
In this tutorial, we discuss how to scramble existing data to make it “new”.
Learning Objectives
- Be able to generate scrambled data with
R
based on existing data.
Introduction
If you want to be as close to the original data as possible, you can just take that data and scramble values such that observations become new.
Let’s say you have individuals with different characteristics such as age, height, gender, BMI, etc. By just randomly re-assigning values for each variable to individuals, you create new individuals. Those individuals are not real and thus you minimize potential problems with confidentiality.
Since you are using exactly the same values for each variable as in the original dataset, the distribution for each variable remains the same and is thus very close (in fact, identical) to the real data.
However, a problem with such scrambling is that while you preserve the distribution of each variable, you might break associations. For instance, males are generally taller than females. If you randomly scramble both gender and height without taking into account this potential association, you might end up with a dataset that has a distribution of heights among males and females that is the same. Depending on your goals, this might or might not be a problem.
Another drawback of scrambling data is that you can’t build associations between variables into the data generating process, so you don’t really know what your models should find when they look for patterns. Thus, a big advantage of synthetic data, namely the fact that you know exactly how it was generated, goes away.
In general, I’m not too big a fan of the scrambling approach, but there might be scenarios where this is what you want/need, therefore we should talk about it.
AI help
Since you are working with the real data, you probably don’t want to use AI for this, unless your AI tool operates in a secure environment (e.g., fully on your companies’ servers).
Example
Time for a simple example. You can find the code shown below in this file.
Setup
First, we do the usual setup steps of package loading and other housekeeping steps.
# setting a random number seed for reproducibility
set.seed(123)
Data loading and exploring
We’ll look at some real data from this paper. As is good habit (and should be the standard), the authors (which includes some of us) supplied the data as part of the supplementary materials, which can be found here.
If you want to work along, go ahead and download the supplement, which is a zip file. Inside the zip file, find the Clean Data folder and the SympAct_Any_Pos.Rda
file. Copy that file to the location where you’ll be placing your R
script.
First, we load the data. Note that the authors (that would be us 😏) used the wrong file ending, they called it an .Rda
file, even though it is an .Rds
file (for a discussion of the differences, see e.g. here).
The data
#assuming your R script is in the same folder
#rawdat <- readRDS('SympAct_Any_Pos.Rda')
# this is for my setup
<- readRDS(here::here('data','SympAct_Any_Pos.Rda')) rawdat
Next, we take a peek.
dim(rawdat)
[1] 735 63
::glimpse(rawdat) dplyr
Rows: 735
Columns: 63
$ DxName1 <fct> "Influenza like illness - Clinical Dx", "Acute tonsi…
$ DxName2 <fct> NA, "Influenza like illness - Clinical Dx", "Acute p…
$ DxName3 <fct> NA, NA, NA, NA, NA, NA, NA, NA, "Fever, unspecified"…
$ DxName4 <fct> NA, NA, NA, NA, NA, NA, NA, NA, "Other fatigue", NA,…
$ DxName5 <fct> NA, NA, NA, NA, NA, NA, NA, NA, "Headache", NA, NA, …
$ Unique.Visit <chr> "340_17632125", "340_17794836", "342_17737773", "342…
$ ActivityLevel <int> 10, 6, 2, 2, 5, 3, 4, 0, 0, 5, 9, 1, 3, 6, 5, 2, 2, …
$ ActivityLevelF <fct> 10, 6, 2, 2, 5, 3, 4, 0, 0, 5, 9, 1, 3, 6, 5, 2, 2, …
$ SwollenLymphNodes <fct> Yes, Yes, Yes, Yes, Yes, No, No, No, Yes, No, Yes, Y…
$ ChestCongestion <fct> No, Yes, Yes, Yes, No, No, No, Yes, Yes, Yes, Yes, Y…
$ ChillsSweats <fct> No, No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, No, Yes, …
$ NasalCongestion <fct> No, Yes, Yes, Yes, No, No, No, Yes, Yes, Yes, Yes, Y…
$ CoughYN <fct> Yes, Yes, No, Yes, No, Yes, Yes, Yes, Yes, Yes, No, …
$ Sneeze <fct> No, No, Yes, Yes, No, Yes, No, Yes, No, No, No, No, …
$ Fatigue <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Ye…
$ SubjectiveFever <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, No, Yes…
$ Headache <fct> Yes, Yes, Yes, Yes, Yes, Yes, No, Yes, Yes, Yes, Yes…
$ Weakness <fct> Mild, Severe, Severe, Severe, Moderate, Moderate, Mi…
$ WeaknessYN <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Ye…
$ CoughIntensity <fct> Severe, Severe, Mild, Moderate, None, Moderate, Seve…
$ CoughYN2 <fct> Yes, Yes, Yes, Yes, No, Yes, Yes, Yes, Yes, Yes, Yes…
$ Myalgia <fct> Mild, Severe, Severe, Severe, Mild, Moderate, Mild, …
$ MyalgiaYN <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Ye…
$ RunnyNose <fct> No, No, Yes, Yes, No, No, Yes, Yes, Yes, Yes, No, No…
$ AbPain <fct> No, No, Yes, No, No, No, No, No, No, No, Yes, Yes, N…
$ ChestPain <fct> No, No, Yes, No, No, Yes, Yes, No, No, No, No, Yes, …
$ Diarrhea <fct> No, No, No, No, No, Yes, No, No, No, No, No, No, No,…
$ EyePn <fct> No, No, No, No, Yes, No, No, No, No, No, Yes, No, Ye…
$ Insomnia <fct> No, No, Yes, Yes, Yes, No, No, Yes, Yes, Yes, Yes, Y…
$ ItchyEye <fct> No, No, No, No, No, No, No, No, No, No, No, No, Yes,…
$ Nausea <fct> No, No, Yes, Yes, Yes, Yes, No, No, Yes, Yes, Yes, Y…
$ EarPn <fct> No, Yes, No, Yes, No, No, No, No, No, No, No, Yes, Y…
$ Hearing <fct> No, Yes, No, No, No, No, No, No, No, No, No, No, No,…
$ Pharyngitis <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, No, No, No, Yes, …
$ Breathless <fct> No, No, Yes, No, No, Yes, No, No, No, Yes, No, Yes, …
$ ToothPn <fct> No, No, Yes, No, No, No, No, No, Yes, No, No, Yes, N…
$ Vision <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, …
$ Vomit <fct> No, No, No, No, No, No, Yes, No, No, No, Yes, Yes, N…
$ Wheeze <fct> No, No, No, Yes, No, Yes, No, No, No, No, No, Yes, N…
$ BodyTemp <dbl> 98.3, 100.4, 100.8, 98.8, 100.5, 98.4, 102.5, 98.4, …
$ RapidFluA <fct> Presumptive Negative For Influenza A, NA, Presumptiv…
$ RapidFluB <fct> Presumptive Negative For Influenza B, NA, Presumptiv…
$ PCRFluA <fct> NA, NA, NA, NA, NA, NA, Influenza A Not Detected, N…
$ PCRFluB <fct> NA, NA, NA, NA, NA, NA, Influenza B Not Detected, N…
$ TransScore1 <dbl> 1, 3, 4, 5, 0, 2, 2, 5, 4, 4, 2, 3, 2, 5, 3, 5, 1, 5…
$ TransScore1F <fct> 1, 3, 4, 5, 0, 2, 2, 5, 4, 4, 2, 3, 2, 5, 3, 5, 1, 5…
$ TransScore2 <dbl> 1, 2, 3, 4, 0, 2, 2, 4, 3, 3, 1, 2, 2, 4, 2, 4, 1, 4…
$ TransScore2F <fct> 1, 2, 3, 4, 0, 2, 2, 4, 3, 3, 1, 2, 2, 4, 2, 4, 1, 4…
$ TransScore3 <dbl> 1, 1, 2, 3, 0, 2, 2, 3, 2, 2, 0, 1, 1, 3, 1, 3, 1, 3…
$ TransScore3F <fct> 1, 1, 2, 3, 0, 2, 2, 3, 2, 2, 0, 1, 1, 3, 1, 3, 1, 3…
$ TransScore4 <dbl> 0, 2, 4, 4, 0, 1, 1, 4, 3, 3, 2, 2, 2, 4, 3, 4, 0, 4…
$ TransScore4F <fct> 0, 2, 4, 4, 0, 1, 1, 4, 3, 3, 2, 2, 2, 4, 3, 4, 0, 4…
$ ImpactScore <int> 7, 8, 14, 12, 11, 12, 8, 7, 10, 7, 13, 17, 11, 13, 9…
$ ImpactScore2 <int> 6, 7, 13, 11, 10, 11, 7, 6, 9, 6, 12, 16, 10, 12, 8,…
$ ImpactScore3 <int> 3, 4, 9, 7, 6, 7, 3, 3, 6, 4, 7, 11, 6, 8, 4, 4, 5, …
$ ImpactScoreF <fct> 7, 8, 14, 12, 11, 12, 8, 7, 10, 7, 13, 17, 11, 13, 9…
$ ImpactScore2F <fct> 6, 7, 13, 11, 10, 11, 7, 6, 9, 6, 12, 16, 10, 12, 8,…
$ ImpactScore3F <fct> 3, 4, 9, 7, 6, 7, 3, 3, 6, 4, 7, 11, 6, 8, 4, 4, 5, …
$ ImpactScoreFD <fct> 7, 8, 14, 12, 11, 12, 8, 7, 10, 7, 13, 17, 11, 13, 9…
$ TotalSymp1 <dbl> 8, 11, 18, 17, 11, 14, 10, 12, 14, 11, 15, 20, 13, 1…
$ TotalSymp1F <fct> 8, 11, 18, 17, 11, 14, 10, 12, 14, 11, 15, 20, 13, 1…
$ TotalSymp2 <dbl> 8, 10, 17, 16, 11, 14, 10, 11, 13, 10, 14, 19, 13, 1…
$ TotalSymp3 <dbl> 8, 9, 16, 15, 11, 14, 10, 10, 12, 9, 13, 18, 12, 16,…
So it looks like these are 735 individuals (rows) and 63 variables (columns). A lot of them have names of symptoms and are coded as Yes/No. Some variables are harder to understand, for instance without some meta-data/explanation, it is impossible to guess what TransScore3F
stands for. Hopefully, your data came with some codebook/data dictionary/information sheet that explains what exactly everything means. For this specific data set, you can look through the supplementary materials to learn more. We won’t delve into it now, and just pick out a few variables to illustrate the data scrambling process.
Data processing
For simplicity, let’s assume we are interested in just a few of these variables, namely ActivityLevel
, Sneeze
, Nausea
, and Vomit
. We’ll select those and look at the first 10 entries.
<- rawdat |> dplyr::select("ActivityLevel","Sneeze","Nausea","Vomit")
dat head(dat,10)
ActivityLevel Sneeze Nausea Vomit
1 10 No No No
2 6 No No No
3 2 Yes Yes No
4 2 Yes Yes No
5 5 No Yes No
6 3 Yes Yes No
7 4 No No Yes
8 0 Yes No No
9 0 No Yes No
10 5 No Yes No
Data Scrambling
Now we’ll scramble the data. I’m doing this here with a simple loop. I’m looping through each variable, and I sample from the old values without replacement, which basically just rearranges them. There are computationally faster and more concise ways of doing this, but the loop makes it hopefully very clear what’s going on.
# define a new data frame that will contain scrambled values
<- dat
dat_sc = nrow(dat) #number of observations
Nobs # loop over each variable, reshuffle entries
for (n in 1:ncol(dat))
{<- sample(dat[,n], size = Nobs, replace = FALSE)
dat_sc[,n]
}
head(dat_sc,10)
ActivityLevel Sneeze Nausea Vomit
1 5 No No No
2 0 No No No
3 3 Yes No No
4 5 No Yes No
5 5 No Yes No
6 1 Yes No No
7 3 Yes No No
8 8 Yes No No
9 6 Yes No No
10 4 Yes Yes No
The first 10 entries look different, so that’s promising.
Comparing old and new data
Now let’s see if things worked. First, we summarize both the old and the new data. We should see that they are the same, since we just re-arranged the values across individuals. This is indeed the case.
summary(dat)
ActivityLevel Sneeze Nausea Vomit
Min. : 0.000 No :340 No :477 No :656
1st Qu.: 3.000 Yes:395 Yes:258 Yes: 79
Median : 4.000
Mean : 4.463
3rd Qu.: 6.000
Max. :10.000
summary(dat_sc)
ActivityLevel Sneeze Nausea Vomit
Min. : 0.000 No :340 No :477 No :656
1st Qu.: 3.000 Yes:395 Yes:258 Yes: 79
Median : 4.000
Mean : 4.463
3rd Qu.: 6.000
Max. :10.000
We can also look at correlations between variables. Here is where we run into the above-mentioned problems. Correlations that might exist in the original data can be wiped out. We see that here. In the original data, more individuals (approximately 63% + 9%) reported either absence or presence of both nausea and vomiting. In the scrambled data, this dropped to around 58% + 4%. We would expect that these 2 symptoms are somewhat related, and the scrambling removed it. Similarily, the original data showed lower activity levels for those with vomit as symptom. This pattern is gone in the scrambled data.
# cross-tabulation of 2 symptoms
=table(dat$Nausea,dat$Vomit)
tb1prop.table(tb1)*100 #as percentage
No Yes
No 62.993197 1.904762
Yes 26.258503 8.843537
=table(dat_sc$Nausea,dat_sc$Vomit)
tb2prop.table(tb2)*100
No Yes
No 58.095238 6.802721
Yes 31.156463 3.945578
# looking at possible correlation between activity level and Vomit
<- dat |> ggplot(aes(x=Vomit,y=ActivityLevel)) + geom_boxplot()
p1 plot(p1)
<- dat_sc |> ggplot(aes(x=Vomit,y=ActivityLevel)) + geom_boxplot()
p2 plot(p2)
That means any statistical conclusions based on the scrambled data are not valid. This kind of data is just useful at testing the overall workflow and making sure everything can run, but one can’t conclude anything from it.
It is of course possible to try to scramble while preserving potential correlations, but that gets tricky and at this stage one might maybe just re-create the data based on some of the concepts discussed in the previous unit.
Summary
Scrambling the original data is generally fairly easy. It can be useful to for sharing with others or some AI system with reduced issues of confidentiality. One can use it to test if the whole analysis workflow runs. However, one cannot test methods since one doesn’t know what patterns the models should detect, and any statistical conclusions based on the scrambled data are not very meaningful.
Further Resources
The synthpop R package might be useful. It doesn’t quite preserve the original data, it’s more similar to the approach discussed in the unit Generating synthetic data based on existing data. But you can get data that is very close to the original, thus might often give you what you wanted to get when you considered the scrambling approach.