Frequently Asked Questions: Data Pipeline Pilot Project
CDC and the United States Digital Service (USDS) co-led a pilot project with the Virginia Department of Health from January to September of 2022. The project led to the creation of a prototype data processing pipeline that validates, ingests, and links data across multiple data streams so it can be used for timely public health action.
To read the full report, visit A Prototype of Modernized Public Health Infrastructure for All: Findings from a Virginia Pilot (November 2022).
This work is part of the Data Modernization Initiative that CDC is spearheading to help state, territorial, local, and tribal (STLT) health departments reduce the significant manual effort needed to access clean, analysis-ready data for public health action across multiple data sources and use cases.
The purpose of the initial six-month pilot was three-fold:
- Co-develop a prototype of a modern data processing pipeline with Virginia Department of Health that would help them use the lab, case, and vaccine data that they are already receiving to answer urgent COVID-19 public health questions with less manual effort than is currently needed.
- Use the context of this project to explore new approaches to storing, processing, and linking different incoming data streams to yield robust, enriched, analysis-ready data and actionable insights.
- Test how reusable, modular tools (a.k.a. “Building Blocks”), such as a geocoding service, could aid in more targeted and informed public health action.
This effort resulted in a working prototype of a customizable, cloud-based data pipeline comprised of a “quick start” set of “building blocks” serving as fundamental tools that automatically process raw datasets (lab results, case reports, and vaccines) in a single place. Within this system, data is standardized, geocoded, deduplicated, and linked to better facilitate patient and case-level analyses. The prototype saves time and manual effort, increases data processing speed, creates a single source of truth for incoming data, and removes the need for duplicative processes.
Though the pilot focused on Virginia’s needs, the project team used lessons learned to create reusable solutions that other state, territorial, local, or tribal (STLT) partners can use to solve similar public health data-related challenges. This approach follows CDC’s blueprint for making public health data work better, also known as the North Star Architecture.
This pilot project is a first step in the data modernization journey. There is a lot more for us to try, discover, and understand as we implement best practices at each step along the data journey from patient to public health and back. We welcome your feedback and continued participation as we work together to refine processes, develop tools and resources, and figure out next steps with the purpose of providing data for action across public health.
Piloting, Research, and Development
Continue research to understand the needs of STLTs across the technical maturity spectrum, including how Building Blocks can improve data pipelines, and inform infrastructure recommendations.
- Design Building Blocks to be easily portable across STLTs’ systems with minimal lift.
- Test the ability of a FHIR Server to support seamless interjurisdictional data exchange alongside a Building Block-based data ingestion pipeline.
- Stand up a marketplace to house the Building Blocks as they become production-ready.
In order to maximize the benefit of Building Blocks, the team recommends the following foundational infrastructure tenets to all STLTs as they begin to modernize.
- Conduct a data and systems inventory to identify priority candidates for cloud migration, and begin the migration process
- Explore options for consolidated data hosting (e.g., data lakes) and use them for consolidation and replacement of siloed systems
- Develop performance monitoring for their data ingestion pipeline to better identify problems and troubleshoot them in real time
- Maintain long-term access to raw, unprocessed data that has not yet been ingested into surveillance systems
- Increase the use of modern data processing and analytics tools that use open technologies (e.g., open source, standards, and architecture), such as
- Open-source SQL-based relational database management systems
- data science and engineering languages like R and Python
- data processing, querying, and visualization tools (e.g., PowerBI, Tableau, Azure Synapse, and others)
The data pipeline is constructed out of discrete “Building Blocks”, or modular software services that accomplish one specific task, but which can be combined to create larger data processing and analysis pipelines. One might think of them like complementary ingredients that can be used in recipes from simple to complex. Using the cloud-based data pipeline allowed us to bring disparate data streams together into a single database in the cloud, standardize data elements, convert them to the Fast Healthcare Interoperability Resources (FHIR) standard, and then process the data at once, further upstream in the data pipeline.
The prototype uses software architecture and design practices that make it easier to:
- Integrate with STLTs’ existing systems,
- Share and reuse by many jurisdictional public health authorities,
- Keep up to date with security patches, feature improvements, and bug fixes, and
- Modify individual data transformation processes without having to upgrade one’s entire data ingestion system.
The data streams included in the pilot project were electronic diagnostic lab results reporting, case reporting, and immunizations.
While some of the work from this prototype is specific to Virginia and the infrastructure the team worked in (i.e., Azure), there are learnings that have informed larger architectural recommendations that can be applied to other STLTs that may be in the early stages of upgrading or modernizing their infrastructure. During the next phase of work, the team will apply the learnings from the Virginia prototype to prioritize, develop, and scale modular Building Blocks with a wide range of state, territorial, local, or tribal (STLT) partners to solve similar public health data-related challenges.