Dataset Selection

What is Dataset Selection?

Dataset selection refers to the process of determining appropriate data type and source. It also refers to using the proper instrumentation and processes involved in gathering the data.

Researchers are oftentimes tasked with the difficulty of narrowing down information in a dataset to what is useful and pertinent for their specific project. Essentially researchers must balance between the time, resources, and money needed to sift through datasets in order to find information that is useful for their specific project.

Examples of datasets include:

  • CMS National Medicare and Medicaid health care claims and enrollment data
  • New Jersey Medicaid claims data
  • State administrative data (e.g., Department of Corrections)
  • Commercial pharmacy data (IQIVA)
  • Publicly available community-level datasets (e.g., HRSA (Health Resource and Services Administration) Area Health Resource File)
  • Mathematica
Mechanisms for Working with Big Data Sets

When working with large datasets, time is of the essence. This pertains to not only being mindful of the time needed to sort through datasets, but more so to data use agreements. Certain datasets must be requested ahead of time with the authorization of a data use agreement. These agreements refer to a contractual agreement referring to the transfer of data between government, non- profit, or private institutions. The data may be sensitive and have restrictions on who is able to access it. Therefore, investigators must be mindful of requesting a data use agreement ahead of time so they have it available when needed for study. Another common challenge to the storage of sensitive data in terms of security at institutions, access servers, and even IT support. Once the data has been received, it must be integrated safely into the system while restricting access to only authorized individuals for the purpose of research. Once the data has been safely stored, it can now be modified to create extracts for further analysis.

Not only is it important to condense large datasets;- it is also important to check the data to ensure its accuracy and applicability to the research question. Data validation is an integral step of research. Examples of issues to be mindful of when it comes to data checking include:

  • Duplicated individuals and
  • Blanks and invalid
  • Out of range data
  • Incomplete data
  • Source variables regarding inclusion and exclusion

Once data has been checked, it can now be linked with existing data. Doing so allows researchers to look at connections between datasets in order to develop even more complex research questions and hypotheses. Benefits of data linking include allowing researchers to have access to an even wider range of information that may be beneficial to advancing their study. It can also provide a different perspective when looking at the data to draw conclusions that might not have come up if the data was not linked. However, there are some challenges to be aware of including missing data, variable formats among data sets, and different measurement periods.

CMS Data Analysis Considerations

There are several obstacles that must be overcome when using datasets from the Centers for Medicare and Medicaid Services (CMS). These include lengthy application processes and the need to have proper data security protocols and compliance implemented. There may be many gaps between the data due to the lack of detailed clinical and social information which may make it harder for researchers to bridge gaps in areas of their study. Additionally, when looking at statewide data there are many racial and ethnic variables when reporting the data which may make it much more difficult to generalize. An important tool utilized by investigators is ResDAC, an online tool used to find, request and use CMS data in order to streamline data analysis.

Supplemental datasets that researchers can integrate with other datasets,including CMS data; include:

  • IPUMS (Integrated Public Use Microdata Series) census microdata from University of Minnesota
  • Area Health Resources Files from the Health Resources & Service Administration
  • Behavioral Risk Factor Surveillance System Survey Data
  • National Survey on Drug Use and Health (NSDUH)
  • National Health and Nutrition Examination Survey (NHANES)
  • National Epidemiologic Survey on Alcohol and Related Conditions III (NESARC-III)
  • National Health Interview Survey (NHIS)
  • National Ambulatory Medical Care Survey (NAMCS)
  • Medical Expenditure Panel Survey (MEPS)
  • Medicare Part D Prescribers
Federal Regulations Regarding Data Retention and Record Keeping

HHS Regulations require investigators to retain research related records. The institution must maintain research related records as well as IRB related activities for three years following completion of the research (45 CFR 46.115(b)). Examples of research related records that must be retained for three years include documentation of informed consent (signed informed consent form, short form, or signed assent form) and project related materials such as the written research summary. If the IRB waives the requirement for informed consent or documentation of informed consent, it may be an exception to standard three-year retention of consent documentation (45 CFR 46.117). However, based on the type of dataset as well as the regulatory requirements, the dataset may be retained for an extended period of time.

If investigators are required to retain certain records on behalf of the HHS and IRB, they must be retained in an appropriate manner. Potential methods of dataset retention include hardcopy, electronic or other media (secure cloud or encrypted hard-drive). The data must be accessible for IRB inspection and copying by an IRB authorized representative of HHS.

HIPAA requirements state that research involving the collection of Protected Health Information (PHI) such as signed consent forms must be retained for six years following the date on which the subject signed the consent form or the date from which it was last in effect, whichever is later.