Linked Data Answers America’s Complex Health Questions

Connected datasets deliver new insights that can improve people’s health

light bulb shape made from rainbow cords

Linking data from multiple sources enables scientists and policymakers to answer complex health questions relevant to everyone in America. Below are two highlights from CDC’s recent data linkage work.

Informing the future of data linkage

Through its innovative data linkage program, CDC’s National Center for Health Statistics (NCHS) has been hard at work making connections between a wide array of data sources. This work is already providing a more complete picture of health to address health disparities and delivering new resources to address emerging public health issues.

For example, data linkage projects are enhancing algorithms through data science tools like machine learning. Recently, the data linkage program conducted the first-ever linkage between two very different data sources:

  • NCHS’ survey data
  • Veterans Administration (VA) administrative data

Because the administrative data were not designed for research purposes, NCHS and VA worked closely together to develop new methods for merging the different types of files. By enhancing the algorithms, they got new data that can now be analyzed to answer key health-related questions that could not be answered by either source alone.

This linkage expands the research potential to study critically important aspects of veteran’s health and health outcomes. It provides a unique opportunity to examine the factors that influence disability, chronic disease, and healthcare utilization and expenditures among veterans enrolled in the VA healthcare system.

Importantly, the project also created an infrastructure that can be used for future linkages, enabling the government to move quickly to combine data sources whenever a public health threat arises.

Shining a light on how COVID-19 vaccines protect Americans’ health

Linking data in new ways is helping us understand how COVID-19 vaccines protect people’s health in the real world.

Before COVID-19, vaccine effectiveness studies required extensive follow-up with patients and their providers to obtain vaccine histories — a process which was both time- and labor-intensive. The COVID-19 Public Health Emergency and CDC Provider Agreement drastically increased the quality of data in state and local vaccine registries, also known as immunization information systems (IIS). Providers, including large healthcare systems and electronic healthcare record (EHR) systems, set up bi-directional linkages with state and local IIS, allowing data on doses received to be automatically transferred between the systems.

To better understand how COVID-19 vaccines work, CDC found automated ways to improve the quality of the data. For example, CDC partnered with large healthcare systems that had bi-directional linkage with state and local IIS. With these connections, we were able to routinely assess vaccine effectiveness of the primary series and monovalent and bivalent boosters. The scale and timeliness of this effort was only possible due to linkages between IIS and EHR systems.

In addition, many state and local health departments are able to link case surveillance data to their IIS data and identify the vaccination status of people who test positive for COVID-19. CDC has partnered with these jurisdictions, who represent a large proportion of the U.S. population and all regions of the country, to monitor rates of COVID-19 cases and deaths by vaccination status.

Findings from these efforts have shown that:

  • Vaccinated groups have overall lower risk of dying from COVID-19 and testing positive for SARS-CoV-2 compared with people who were unvaccinated.
  • People who have been vaccinated with an updated (bivalent) booster dose have lower rates of dying from COVID-19. and slightly lower rates of testing positive for COVID-19, compared with people who were vaccinated but had not received an updated booster dose.

CDC has also been able to use privacy-preserving record linkage (PPRL) technologies to link de-identified vaccination data reported by US pharmacies participating in the Federal Retail Pharmacy Program. This linkage enabled the use of de-identified clinical data to study the association between differing levels of COVID-19 vaccination status with health outcomes. Studies like this Clinical Infectious Diseases article, which looked at relative effectiveness of three different COVID-19 vaccination series, demonstrate the utility of PPRL in combining disparate de-identified healthcare datasets for advancing clinical and public health research.

CDC continues to monitor how well the vaccines are working in the real world and uses these data to inform vaccine policy. However, the end of the Public Health Emergency and the beginning of COVID-19 vaccine commercialization will impact the amount of vaccination providers who enter data into the IIS, and therefore our ability to access timely data to understand vaccine effectiveness.