Proof of Concept and Technology for Hosting Bio-Surveillance Systems in the Amazon Infrastructure Cloud
Project Name: Proof of Concept and Technology for Hosting Bio-Surveillance Systems in the Amazon Infrastructure Cloud
Project Status: Completed
Point of Contact: Yury Khudyakov
Center: National Center for HIV/AIDS, Viral Hepatitis, STD, and TB Prevention
Keywords: Outbreak, Surveillance, NGS, Cloud, HPC, Scalability
Project Description: Hepatitis C is a major public health problem in the United States and worldwide. Outbreaks of hepatitis C virus (HCV) infections associated with unsafe injection practices, drug diversion, and other exposures to blood and blood products are difficult to detect and investigate, because HCV infections can remain asymptomatic in >70% of infected persons for years, even decades. Thus, effective HCV outbreak investigation requires comprehensive viral hepatitis surveillance and robust case investigation. The Division of Viral Hepatitis (DVH) has recently developed the website GHOST (Global Hepatitis Outbreak and Surveillance Technology) which hosts our validated pipeline for the Advanced Molecular Detection (AMD) of hepatitis C outbreaks.
Briefly, this pipeline allows for a rapid and cost-effective identification of transmission clusters by integrating epidemiological evidence, next generation sequencing (NGS), and data analysis. The analytical methods generate an output in the form of a simple transmission network that shows plainly which cases are linked by transmission and identifies the source of the outbreak, if the source was sampled. However, all three steps of our pipeline (Quality Control, Analysis and Visualization) are extremely computationally intensive. It is imperative that we adapt our pipeline to the High-Performance Computing (HPC) environment.
The goal of the proposal is the migration and adaptation of this pipeline to the cloud. This would make it available to public health laboratories, enabling them to identify outbreaks by simply uploading viral sequences using an online tool provided by CDC. Pathogen outbreaks usually occur in bursts, with calm periods of inactivity followed by frantic emergencies with sudden needs of great Laboratory and computational capabilities. This behavior fits perfectly the cloud capabilities, where a fast response to public health laboratories could be achieved by applying the pipeline in parallel, while reducing infrastructure and maintenance costs.
Objectives: Making use of standard Amazon Cloud Services, adapt current algorithms for rapid data processing o Develop and document repeatable processes to adapt existing application code to make best use of cloud-specific capabilities, specifically including load-balancing and auto-scaling, to improve performance and reduce costs o Demonstrate the ability to operate systems in accordance with existing security and governance processes in the cloud environment. Patching Monitoring Availability/recovery – Use of Cloud-hosted HPC capabilities and parallel algorithm code to achieve improved response times over other operational environments.
Background: Hepatitis C is a major public health problem in the United States and worldwide. HCV infects approximately 3.5 million persons the United States and 180 million persons worldwide. Outbreaks of hepatitis C are mostly associated with unsafe injection practices, drug diversion, and other exposures to blood and blood products. They are difficult to detect and investigate, because HCV infection remains asymptomatic in >70% of infected persons for up to decades after initial infection. DVH has recently developed and validated a pipeline for AMD of hepatitis C outbreaks. Briefly, this pipeline allows for the rapid and cost-effective identification of transmission clusters by integrating epidemiological evidence, NGS, and data analysis. The pipeline includes novel experimental and analytical methods that improve the accuracy of transmission detection and allow for a 50- to 100-fold reduction in cost per specimen, a 4- to 8-fold reduction in processing time per specimen, and a 20- to 50-fold increase in the number of specimens tested. The analytical methods generate an output in the form of a simple graph that shows plainly which cases (out of suspected cases analyzed) are linked by transmission and identifies the source of the outbreak, if the source was sampled.
Public Health Impact: Previously, Hepatitis outbreak detection was an expensive, slow and CDC-centralized procedure. This was due to significant expertise in bioinformatics and computer science that was needed, thus making the use of these methods difficult for public health practitioners. Our current pipeline has integrated all steps of this procedure into a single analytical tool, the use of which requires no experience in bioinformatics. Availability of the tools for the detection of HCV transmissions will foster deeper involvement of public health researchers and practitioners in hepatitis C outbreak investigation in the United States and worldwide. Simple and efficient experimental and analytical tools reduce the cost of molecular testing, making outbreak investigations affordable to many laboratories. Improvement in molecular detection capacity also will increase the rate of detection of transmissions in the United States, thus providing opportunity for rapid and effective response to outbreaks of hepatitis C. Finally, detection of directionality of transmissions using the developed tools significantly facilitates accuracy of outbreak investigation, informing on the potential source of infections.
Expected results: This project improves the CDC’s capabilities to efficiently detect outbreaks of hepatitis C using NGS and novel computational tools and rapidly respond to outbreaks, enabling fast identification of transmission clusters and coordination among public health laboratories. However, the modular framework has two very important properties that make it readily adaptable to other pathogens: (i) It incorporates a module for NGS Quality control and Data processing, which is a step shared by most applications of the NGS technology. This module of the pipeline can be readily used by any Public Health Laboratory, regardless of their specific task. (ii) Our analytical framework can be applied to any heterogeneous virus. For instance, we are currently in the process of using the same framework with Hepatitis A Virus, Hepatitis B Virus and Hepatitis E Virus, which are completely different viruses. Although each Center will need to create new modules incorporating their relevant back-end computations, our proposed framework will have a place for such modules and will provide a starting workflow for rapid response.