R packages to generate synthetic data

Author

Andreas Handel

Modified

2024-03-20

Overview

This unit provides brief introductions of different R packages that can be used to generate synthetic/simulated/artificial/fake data.

Learning Objectis

  • Be familiar with several R packages that can be used to generate synthetic data.

Introduction

While it is generally not that difficult to write a bit of R code to generate your own data, sometimes it can be useful to reach for dedicated packages. R has a number of such packages. This is just a quick summary of some of those packages.

R packages to generate synthetic data

  • The coxed package allows you - among other things - to generate longitudinal survival (time-to-event) data using its sim.survdata() function.

  • The admiral package, which is part of the pharmaverse, allows users to generate data in ADaM format. ADaM is a common data format in the pharmaceutical area.

  • The simstudy package allows users to define variables and relations between them, and then have synthetic data generated based on those specifications. It can also create data with potentially complex structures, such as longitudinal or hierarchical data.

  • The synthpop package can produce synthetic data that is very similar to the original data, including potential patterns/correlations.

  • The mice package is most often used to impute missing data. But one can also use it to generate synthetic data. This tutorial provides a good introduction.

Some R helper packages

  • The CRAN Task View: Probability Distributions lists packages that can be used to generate data following various probability distributions. This can be useful if you want to generate data that has a certain shape/distribution.

Further Resources

None that come to mind. If you find other R packages that are useful for generating synthetic data, please let me know.