Skip Navigation LinksSkip Navigation Links
Centers for Disease Control and Prevention
Safer Healthier People
Blue White
Blue White
bottom curve
CDC Home Search Health Topics A-Z spacer spacer
Blue curve MMWR spacer

Persons using assistive technology might not be able to fully access information in this file. For assistance, please send e-mail to: Type 508 Accommodation and the title of the report in the subject line of e-mail.

Using Categorization of Reason-for-Visit Strings as the Basis for an Outbreak Detection System --- Minnesota, 2002--2003

Sivakumaran Raman, J. Levin, D. Hall, K. Frey
Children's Hospitals and Clinics, Roseville, Minnesota

Corresponding author: Sivakumaran Raman, Children's Hospitals and Clinics, 2900 Centre Pointe Drive, Roseville, MN 55113. Telephone: 651-855-2055; Fax: 651-855-2075; Email:

Disclosure of relationship: The contributors of this report have disclosed that they have no financial interest, relationship, affiliation, or other association with any organization that might represent a conflict of interest. In addition, this report does not contain any discussion of unlabeled use of commercial products or products for investigational use.


Introduction: The syndromic surveillance system used by Children's Hospitals and Clinics (CHC) to detect disease outbreaks uses reason-for-visit (e.g., chief complaint) information recorded when patients register at CHC facilities.

Objectives: A machine-learning approach that employs text categorization of reason-for-visit fields from patient encounters was used to assess whether daily counts of disease syndromes based on International Classification of Diseases, Ninth Revision (ICD-9) diagnostic codes correlated with daily counts of text categorizer-assigned syndromes.

Methods: Reason-for-visit text strings collected from CHC emergency departments and general pediatrics clinics were used to define a set of pediatric-focused disease syndromes. Textcategorization programs that were based on common algorithms available as modules in the Perl programming language (open source software available at were given previously categorized data from 2002. Data for 2003 were used to evaluate the agreement between CHC syndromes assigned by ICD-9 codes and syndromes assigned by text categorizers. Spearman's rank correlation coefficients were calculated to permit examination of the association between daily counts of ICD-9--assigned syndromes and categorizer-assigned syndromes. Receiver operating characteristic (ROC) curves were plotted for certain categorized data to examine the performance of the text categorizer.

Results: From 2003 data, 102,435 reason-for-visit strings were classified into syndromes by using the associated principal ICD9 codes and running Perl programs for the Support Vector Machine (SVM) and naïve Bayesian categorizers. Spearman's rank correlation coefficient values for daily counts of categorizer-assigned syndromes and ICD-9--based syndromes demonstrated a correlation between the two. Spearman's coefficients for the counts for SVM versus ICD-9 syndromes were 0.754 for the EENT (eyes, ears, nose, and throat) syndrome, 0.722 for the FEVER syndrome, 0.843 for the GASTROINTESTINAL syndrome, 0.923 for the INJURY syndrome, and 0.913 for the RESPIRATORY syndrome. Correlation was also seen between EENT-ICD9 and RESPIRATORY-SVM syndromes and EENT-ICD-9 and FEVER-SVM syndromes. Similar correlation results were obtained for the naïve Bayesian categorizer. ROC curves drawn for the naïve Bayesian categorizer-assigned scores (used as the test) against the ICD-9--assigned scores (used as the standard) provide evidence of the categorizer's high performance. Areas under ROC curves for the 13 syndromes ranged from 0.966 (for the INJURY CHC syndrome) to 0.701 (for the EENT CHC syndrome).

Conclusion: Text categorization of reason-for-visit strings gives robust results and can be combined with a statistical or algorithmic method of detection of extraordinary events to create an outbreak detection system. The code used is available for public use at

Use of trade names and commercial sources is for identification only and does not imply endorsement by the U.S. Department of Health and Human Services.

References to non-CDC sites on the Internet are provided as a service to MMWR readers and do not constitute or imply endorsement of these organizations or their programs by CDC or the U.S. Department of Health and Human Services. CDC is not responsible for the content of pages found at these sites. URL addresses listed in MMWR were current as of the date of publication.

Disclaimer   All MMWR HTML versions of articles are electronic conversions from ASCII text into HTML. This conversion may have resulted in character translation or format errors in the HTML version. Users should not rely on this HTML document, but are referred to the electronic PDF version and/or the original MMWR paper copy for the official text, figures, and tables. An original paper copy of this issue can be obtained from the Superintendent of Documents, U.S. Government Printing Office (GPO), Washington, DC 20402-9371; telephone: (202) 512-1800. Contact GPO for current prices.

**Questions or messages regarding errors in formatting should be addressed to

Date last reviewed: 8/5/2005


Safer, Healthier People

Morbidity and Mortality Weekly Report
Centers for Disease Control and Prevention
1600 Clifton Rd, MailStop E-90, Atlanta, GA 30333, U.S.A


Department of Health
and Human Services