Generating high-fidelity synthetic time-to-event datasets to improve data transparency and accessibility

Smith, Aiden; Lambert, Paul C; Rutherford, Mark J

s12874-022-01654-1.pdf (2.36 MB)

Generating high-fidelity synthetic time-to-event datasets to improve data transparency and accessibility

journal contribution

posted on 2022-07-08, 12:20 authored by Aiden Smith, Paul C Lambert, Mark J Rutherford

Background

A lack of available data and statistical code being published alongside journal articles provides a significant barrier to open scientific discourse, and reproducibility of research. Information governance restrictions inhibit the active dissemination of individual level data to accompany published manuscripts. Realistic, high-fidelity time-to-event synthetic data can aid in the acceleration of methodological developments in survival analysis and beyond by enabling researchers to access and test published methods using data similar to that which they were developed on.

Methods

We present methods to accurately emulate the covariate patterns and survival times found in real-world datasets using synthetic data techniques, without compromising patient privacy. We model the joint covariate distribution of the original data using covariate specific sequential conditional regression models, then fit a complex flexible parametric survival model from which to generate survival times conditional on individual covariate patterns. We recreate the administrative censoring mechanism using the last observed follow-up date information from the initial dataset. Metrics for evaluating the accuracy of the synthetic data, and the non-identifiability of individuals from the original dataset, are presented.

Results

We successfully create a synthetic version of an example colon cancer dataset consisting of 9064 patients which aims to show good similarity to both covariate distributions and survival times from the original data, without containing any exact information from the original data, therefore allowing them to be published openly alongside research.

Conclusions

We evaluate the effectiveness of the methods for constructing synthetic data, as well as providing evidence that there is minimal risk that a given patient from the original data could be identified from their individual unique patient information. Synthetic datasets using this methodology could be made available alongside published research without breaching data privacy protocols, and allow for data and code to be made available alongside methodological or applied manuscripts to greatly improve the transparency and accessibility of medical research.

Funding

Cancer Research UK (Grant Number C41379/A27583)

National Institute for Health Research (NIHR) Applied Research Collaboration East Midlands (ARC EM)

History

Author affiliation

Department of Health Sciences, University of Leicester

Version

VoR (Version of Record)

Published in

BMC medical research methodology

Volume

22

Issue

1

Pagination

176

Publisher

Springer Science and Business Media LLC

issn

1471-2288

eissn

1471-2288

Acceptance date

2022-06-06

Copyright date

2022

Available date

2022-07-08

Publisher DOI

https://doi.org/10.1186/s12874-022-01654-1

Spatial coverage

England

Language

eng

Publisher version

https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/s12874-022-01654-1

Usage metrics

Keywords

Humans Survival Analysis Reproducibility of Results Biomedical Research

Licence

CC BY 4.0

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

Generating high-fidelity synthetic time-to-event datasets to improve data transparency and accessibility

Funding

Cancer Research UK (Grant Number C41379/A27583)

National Institute for Health Research (NIHR) Applied Research Collaboration East Midlands (ARC EM)

History

Author affiliation

Version

Published in

Volume

Issue

Pagination

Publisher

issn

eissn

Acceptance date

Copyright date

Available date

Publisher DOI

Spatial coverage

Language

Publisher version

Usage metrics

Categories

Keywords

Licence

Exports