Natural Language Processing for Cancer Surveillance

At a glance

CDC is using natural language processing strategies to automate the process of identifying reportable cancer cases from narrative text.


Central cancer registries gather information about cancer cases from a variety of sources. Some of these sources—like medical records, laboratory reports, and other clinical reports—are unstructured or narrative text.

Cancer registrars review this text to identify reportable cancer cases and enter data into an abstract. The abstract contains information from the patient's medical record that has been transferred to the registry's computer software using standardized codes. As a result, there is a delay between the time when a cancer is diagnosed and when information about the cancer is available to the cancer registry.

CDC is using natural language processing (NLP) strategies to automate this process.


Natural language processing: The technology used to help computers understand human language. In this case, it refers to the words in a laboratory report or medical record.

Machine learning: Artificial intelligence that allows computers to use previous results to improve automatically.

Dictionary-based approach

CDC's eMaRC Plus software uses a dictionary of terms, abbreviations, and other representations of reportable cancers. It follows rules to compare pathology reports to the dictionary to identify reportable cancers. Then it creates abstracts with information about the cancers filled in automatically. This approach produces good results, but a lot of effort is required to keep the dictionary up to date.

Machine-learning approach

Another method is a statistical NLP approach. It uses supervised machine learning to account for the variation in pathology reports. This approach works best if a lot of pathology reports from many laboratories and pathologists are available for training.

The US Department of Health and Human Services funded CDC and the US Food and Drug Administration to develop an NLP Workbench under the Office of the Assistant Secretary for Planning and Evaluation's Patient-Centered Outcomes Research (PCOR) Trust Fund. The NLP Workbench was developed as a platform for members of the health care community to develop and share NLP pipelines, language models, and other algorithms that convert unstructured clinical text to coded data.

NPCR’s strategy

CDC’s National Program of Cancer Registries uses dictionary-based and cloud-based statistical NLP approaches to process pathology reports. Both approaches address challenges that laboratories and registries face when collecting, processing, and reporting cancer data.