Please read National COVID Cohort Collaborative’s article in medRxiv titled, “Demonstrating an approach for evaluating synthetic geospatial and temporal epidemiologic data utility: Results from analyzing >1.8 million SARS-CoV-2 tests in the United States National COVID Cohort Collaborative (N3C).“
COVID-19 has illustrated the need to disseminate accurate, timely, and useful epidemiologic public health data – especially data related to ongoing pandemics or pandemic preparedness. It has also highlighted the need to protect the privacy of individuals. The National COVID Cohort Collaborative (N3C) was created to share and harmonize individual-level electronic health record (EHR) data into a single data set. The N3C has received, ingested, harmonized, and characterized data from across the United States (US). To balance data access and privacy, N3C created two levels of data sets: (1) the limited data set (LDS) which has 16 HIPAA Privacy Rule direct identifiers stripped out except dates and zip codes, and (2) synthetic data which are computationally derived from the LDS to mimic the LDS data statistical distributions, covariance, and higher order interactions. Synthetic data generation can potentially protect privacy because synthetic data rows are not directly tied to the original source data. Pending a pilot study and privacy validation, synthetic data sets are the only data under consideration to be shared outside of the N3C enclave. To read the full article.
Demonstrating an approach for evaluating synthetic geospatial and temporal epidemiologic data utility: Results from analyzing >1.8 million SARS-CoV-2 tests in the United States National COVID Cohort Collaborative (N3C). Thomas JA, Foraker RE, Zamstein N, Payne PRO, Wilcox AB; N3C Consortium. 2021 Jul 8:2021.07.06.21259051. PMID: 34268525 PMCID: PMC8282114 DOI: 10.1101/2021.07.06.21259051