Natural Language Processing Workbench Web Services

Overview of Project Activities
Five steps of the NLP Workbench: environmental scan; stakeholder engagement, requirements gathering, and technical design; prototype development; pilot testing; and release.

The NLP Workbench project included five steps: environmental scan; stakeholder engagement, requirements gathering, and technical design; prototype development; pilot testing; and release.

This pilot project converts narrative clinical text to coded data automatically.

Central cancer registries gather cancer-related data from a variety of sources. Some parts of these data—like medical records, laboratory reports, and other clinical reports—are unstructured or narrative text.

This text is reviewed manually to identify reportable cancer cases and code key data elements. As a result, there is a delay between the time when cancer is diagnosed and when this information is available to the cancer registry.

This project focuses on how CDC is using natural language processing (NLP) strategies to automate this process. CDC completed this 2-year project in 2019.

Dictionary-Based Approach


Workbench: A framework that supports the production of software by integrating a variety of activities to meet a specific need while limiting or eliminating the need for multiple programming languages.

Natural language processing: The technology used to help computers understand human language. In this case, it refers to the words in a laboratory report or medical record.

Machine learning: An application of artificial intelligence that allows information-processing systems to learn and improve automatically by using previous results, without being specifically programmed by a person.

In the past, CDC’s eMaRC Plus software used a dictionary-based NLP approach that required inclusion of every possible term, abbreviation, or representation of a reportable cancer before the software could find a case. Unfortunately, the number of possibilities for these terms and representations is nearly infinite. In addition, common errors like misspelling, transposed letters, or extra white space could prevent the software from recognizing and processing a case.

Keeping the dictionary updated so the software would not miss valid pathology reports was a challenge.

Machine-Learning Approach

Another method is a statistical NLP approach using supervised machine learning to account for the complexity and variation of electronic pathology reports. This approach works best if there is a large volume of valid training documents from many laboratories and pathologists in the United States.

The HHS Assistant Secretary for Planning and Evaluation’s Patient-Centered Outcomes Research (PCOR) Trust Fund funded CDC and the US Food and Drug Administration (FDA) to develop an NLP Workbench.

This resulting Clinical Language Engineering Workbench (CLEW) provides a web service platform that the health care community can use to develop and share NLP pipelines, language models, and other algorithms that convert unstructured clinical text to coded data. It allows for community-driven feedback that can be used to train and improve the model in the future.