Natural Language Processing Workbench Web Services

Overview of Project Activities
Five steps of the NLP Workbench: environmental scan; stakeholder engagement, requirements gathering, and technical design; prototype development; pilot testing; and release.
The NLP Workbench project will include five steps: environmental scan; stakeholder engagement, requirements gathering, and technical design; prototype development; pilot testing; and release.

In the United States, central cancer registries collect cancer data from sources such as hospitals, laboratories, physician’s offices, and independent diagnostic and treatment centers. While Meaningful Use and other activities have increased the use of standardized electronic health record (EHR) systems, some parts of the medical records, laboratory reports, and other clinical reports are still free-form, unstructured (narrative) text.

Computers cannot process narrative text automatically; human intervention is required to extract the critical pieces of information needed to complete a cancer case report. Similarly, trained abstractors must retrieve and code the appropriate elements from the clinical narrative text submitted to the U.S. Food and Drug Administration’s (FDA’s) spontaneous reporting systems for drugs, vaccines, and blood products.

The process of abstracting these data manually is labor-intensive and expensive. In addition, a diminishing workforce and an increased demand for timely and accurate data that are stored in narrative text create challenges. The unstructured narrative text in pathology, post-market, biomarker, and EHR reports contain data researchers need to study overall population health and quality of patient care.

The use of natural language processing (NLP) will increase the completeness, timeliness, and accuracy of data while reducing the level of human intervention needed to identify critical data in narrative text.

The Assistant Secretary for Planning and Evaluation’s Patient-Centered Outcomes Research (PCOR) Trust Fundexternal icon funded FDA and CDC for two years to develop an NLP Workbench on a shared Web service platform for PCOR researchers, as well as public health agencies at all levels. The NLP Workbench will provide free access to NLP and machine learning tools to develop and share language models and other algorithms that convert unstructured clinical text to coded data. The NLP Workbench will consist of open-source architectures and tools that any public health agency can use to develop NLP services, and will be hosted initially on CDC’s Innovation Research and Development lab.

A diagram shows the suggested architecture of the NLP Workbench, which will serve two types of users.
This pilot project will use machine learning techniques to improve extraction and automatic coding of cancer data from pathology reports, and to use a shared, open-source, model that can be expanded.
Clinical and temporal information will be extracted from medical safety reports using natural language processing.
More Information

Stakeholder Meeting

CDC and FDA hosted the first quarterly web call on April 26, 2017, to share progress and gather input from interested stakeholders. Please see the list of questions and answers pdf icon[PDF-24KB] that were discussed during the call.

Related Research

Kreimeyer K, Foster M, Pandey A, Arya N, Halford G, Jones SF, Forshee R, Walderhaug M, Botsis T. Natural language processing systems for capturing and standardizing unstructured clinical information: a systematic review.external icon Journal of Biomedical Informatics 2017.

Contact Us

For more information about this project or to join the NLP stakeholder meetings, please send e-mail to