NLP Workbench Web Services Cancer Domain Pilot Project
Two pilot projects will be completed to demonstrate how Natural Language Processing (NLP) Workbench Web Services can be used to meet specific requirements. They will focus on cancer (described below) and safety surveillance.
About 90% of all cancer cases require pathological confirmation of the diagnosis. Pathology reports have been mostly text-based narrative reports, which are time-consuming to process. The College of American Pathologists (CAP) accreditation program requires laboratories to use the CAP Cancer Protocols, and standard templates have been developed to capture key cancer data in electronic cancer checklists for pathology and biomarker outcomes.
CDC developed eMaRC Plus software that uses NLP methods to process the narrative reports. The purpose of this pilot project is to use machine learning techniques to improve extraction and automatic coding, and to use a shared, open-source, model that can be expanded.
Challenges with collection and use of pathology and biomarker reports include—
- CAP checklists are not required for biomarkers.
- Laboratories are not required to store or transmit cancer data in discrete data elements.
- Terminologies, test names, and data included in the biomarker reports are inconsistent among laboratories.
- The report organization and reporting in Health Level Seven (HL7) messages is inconsistent.
Project specifications include—
- Collecting de-identified data from at least four national laboratories for breast, lung, prostate, and colorectal cancers.
- Collecting histopathology cases from several states.
- Collecting 125 cases per cancer site from each laboratory, for a total of at least 2,000 cases.
- Double annotation will be completed by certified tumor registrars with a master reviewer.
Use cases to be addressed include—
- Identifying case reportability before and after negation.
- Extracting histology, primary site, behavior, laterality, and grade.
- Coding cancer data items to a nationally adopted coding system (International Classification of Diseases for Oncology, 3rd Edition [ICD-O-3].)
The diagram above illustrates the use cases for the cancer domain.
- Laboratory information systems transmit unstructured text in the form of a Health Level Seven (HL7) version 2.5.1 Observation Result (ORU) message to the NLP web service, which returns a reportability determination. Reportability is determined by International Classification of Diseases, Tenth Revision, Clinical Modification (ICD-10-CM).
- Narrative pathology reports are sent to the cancer registry’s eMaRC Plus system in the form of an HL7 version 2.5.1 ORU message. eMaRC Plus transmits the unstructured text to the NLP web service, which returns structured data. Structured data include primary site, histology, laterality, behavior, and grade.