STANDARDIZED OCCUPATION & INDUSTRY CODING
NOTE: This page is archived for historical purposes and is no longer being maintained or updated.
Newer web-based Industry and Occupation (I&O) coding tool: NIOSH Industry & Occupation Computerized Coding System (NIOCCS)
What is SOIC?
Many data sets contain narrative text for industry and occupation. These include vital records systems, cancer registries, worker's compensation systems, and healthcare records. Manually assigning industry and occupation (I&O) codes can be expensive, time consuming, and not highly consistent. Furthermore, because some industry and occupation titles are so rare, or include infrequently used synonyms, even experienced coders have great difficulty in reaching agreement.
To decrease the number of cases a manual coder must review and to create national consistency, NIOSH led the development of the Standardized Occupation and Industry Coding (SOIC) software. The development of SOIC was a collaborative effort that included the National Association for Public Health Statistics and Information Systems, the National Center for Health Statistics (NCHS), the Bureau of Labor Statistics (BLS), the National Center for Chronic Disease Prevention and Health Promotion, and the Bureau of the Census (BOC).
SOIC codes occupation and industry narratives according to the 1990 BOC Alphabetical Index of Industries and Occupations supplemented with special codes for non-paid workers, non-workers, and the military as defined in the NCHS Instruction Manual, Part 19. This website provides downloadable versions of the current version (SOIC 1.5) and its documentation. The SOIC software may be downloaded free of charge. Minimum system requirements include:
- 90 MHZ Pentium with 32 MB of RAM
- Windows® 98, NT, ME, or 2000
- Minimum 30 MB of free disk space
The SOIC system client was written using the Microsoft Visual Basic programming language and the Microsoft Access database management system. SOIC data tables and data files are stored as Access tables and files. SOIC offers several data access features: the main window can be used for data entry; text or ASCII files can be imported and exported; and files in Microsoft Access, dBase, or FoxPro formats can be opened directly into the software. Microsoft conventions for Windows applications were used wherever possible. The software has an easy-use-interface created based on the U.S. standard death certificate (the data entry screen) and includes an extensive system menu that includes options for opening and saving files, editing or finding records, and coding a single record and entire files.
The Coding Process
To assign industry and occupation codes, the software uses a stepwise series of increasingly complex coding modules. Narrative information is processed through each module until an industry or occupation code is assigned or the narrative is determined to be uncodable.
Auto-Spell: corrects some misspellings and expands fused words, acronyms, and abbreviations.
Lookup Tables: assigns codes based on exact matches to various I&O narrative combinations.
- Paired-phrase matching: commonly occurring I&O narratives.
- Company matching: a limited list of state-specific industry names.
- Idiom matching: misleading industry narratives.
Knowledge Base: assigns codes based on static handwritten coding rules that emulate the logic that a manual coder would typically apply (e.g., performs “fuzzy” matching on word fragments). There are 2,055 rules that are broken down into 848 industry rules and 1,207 occupation rules.
Word-to-Code: predicts codes based on word patterns observed in data used to develop the software.
NIOSH conducted a comparison of SOIC and an expert’s manually assigned codes for 48,067 cases from a death certificate based surveillance system. The number of software-assigned codes that matched the expert manual coder is shown below. In this test there was no adjudication of the results; that is, the mismatched cases were not reviewed to determine if the SOIC autocoder or the manual coder was actually correct. These results are provided as an illustration. Coding results will vary and depend upon overall data quality. The software does not perform well on narratives with company names and other ambiguous information.
|Number of SOIC assigned codes that matched manually assigned codes|
|Industry Codes matched||36,376 cases (76%)|
|Occupation Codes Matched||36,207 cases (75%)|
|Both occupation and industry codes matched||30,389 cases (63%)|
The current SOIC software version available for download is v. 1.5 and is based on the 1990 BOC industry and occupation coding scheme. The software is provided as a resource tool for injury and illness researchers where uniform coding of industry and occupation is beneficial to prevention efforts. No further revisions will be made and user support is limited. Assistance for the current version may be requested by contacting the NIOSH SOIC group.
Although no further revisions will be made to SOIC, NIOSH is currently developing new coding software called the NIOSH Industry and Occupation Computerized Coding System (NIOCCS) which will be available in late 2012. As SOIC did, the new NIOCCS will have the ability to translate free text industry and occupation narratives found on employment and health records to standardized I&O codes including the 2002 Bureau of the Census I&O codes. It will also have the ability to crosswalk data from the 1990 Census I&O codes to the 2002 Census codes. For additional information, view the NIOSH Industry and Occupation Coding and Support page.
- Page last reviewed: February 4, 2016 (archived document)
- Content source:
- National Institute for Occupational Safety and Health Division of Safety Research