STANDARDIZED OCCUPATION & INDUSTRY CODING
SOIC Frequently Asked Questions
On this Page
- About SOIC
- Coding Results
- About SOIC
- Coding Results
- What is the overall performance of SOIC?
- How well does the software perform on industry narratives?
- How well does the software perform on occupation narratives?
- What are the limitations to using an automated system?
- What can be done to data to improve the software’s performance?
- Should I perform quality control on coded data?
About SOIC
What is SOIC?
The Standardized Occupation and Industry Coding (SOIC) system
is a standalone Windows based software package that assigns 3-digit numerical
codes to narrative industry and occupation descriptions. Using data that
have been imported from existing files or entered directly into SOIC using
the data entry screen, the software can code one record at a time or all
records in a file. The software assigns codes based on the 1990 Bureau
Of the Census Alphabetical Index of Industries and Occupations.
How much does the software cost?
Because SOIC was created in the public domain, the software can
be downloaded free of charge.
Where can industry and occupation information
be found?
Many data sets contain narrative text for industry and occupation,
including:
- Vital records systems
- Cancer registries
- Worker’s compensation systems
- Healthcare records
Why collect industry and occupation information?
The collection of industry and occupation (I&O) information
serves many purposes:
- To associate specific health outcomes (e.g., a particular cancer or cause of death) with certain industries and/or occupations
- To identify areas in need of further research
- To assess socioeconomic status and identify persons who may be at high-risk of disease or injury.
This type of information can be used by public health workers, industrial organizations, employers, and others to provide the best possible hazard abatement and control, and safety and health programs for workers.
What is the advantage of an automated
coding system?
Manually assigning industry and occupation (I&O) codes can
often be expensive, time consuming, and not highly consistent. An automated
system like SOIC can decrease the number of cases a coder must review
and create consistency in a records system.
SOIC offers flexible data access features. A user can:
- Enter data directly into SOIC
- Operate directly on external data tables in dBase, FoxPro, and Access
- Import and export text files
Will the software be updated?
The current version of SOIC assigns codes based on the 1990 Bureau of the Census (BOC) coding scheme. No further revisions will be made to SOIC. However, NIOSH is currently developing new coding software called the NIOSH Industry and Occupational Computerized Coding System (NIOCCS). As SOIC did, the new NIOCCS will have the ability to translate free text industry and occupation narratives found on employment and health records to standardized I&O codes including the 1990 and 2000 Bureau of the Census I&O codes as well as NAICS. For additional information, view the Industry and Occupation Coding Support page.
Coding Results
What is the overall performance
of SOIC?
NIOSH conducted a comparison of SOIC and an expert’s manually
assigned codes for 48,067 cases from a death certificate based surveillance
system. The number of software-assigned codes that matched the expert
manual coder is shown below. In this test there was no adjudication of
the results; that is, the mismatched cases were not reviewed to determine
if the SOIC autocoder or the manual coder was actually correct. These
results are provided as an illustration. Coding results will vary and
depend upon overall data quality. The software does not perform well on
narratives with company names and other ambiguous information.
Industry codes matched |
36,376 cases (76%) |
Occupation codes matched |
36,207 cases (75%) |
Both occupation and industry codes matched |
30,389 cases (63%) |
How well does the
software perform on industry narratives?
To identify and categorize SOIC errors, unique industry and occupation
narratives were reviewed separately. Using Bureau of the Census industry
divisions as a guide, SOIC and manually assigned codes were compared.
Of the 48,067 total cases, 16,096 cases
contained unique industry narratives (all duplicates were removed, e.g.,
the file contained the term "construction roofing" in the industry
narrative field only once compared to the numerous times it may have appeared
in the original file of 48,067 cases). More than half (9,262)
were coded correctly. The software incorrectly coded 3,296 narratives and could not code 3,538 narratives. Problems
that were identified are characterized below.
| Codes were in different division categories but were unrelated (e.g., construction vs. hospital) | 1,159 (35%) |
| Codes were in different division categories but were related (e.g., food manufacturing vs. retail grocery) | 1,032 (31%) |
| Codes were in the same division category | 825 (25%) |
| Codes were in different division categories and one code was an unclassified code | 248 (8%) |
| Codes were in different division categories and one code was a non-worker code | 32 (1%) |
How well does
the software perform on occupation narratives?
To identify and categorize SOIC errors, unique industry and occupation
narratives were reviewed separately. Using Bureau of the Census occupation
divisions as a guide, SOIC and manually assigned codes were compared.
Of the 48,067 total cases, 9,808 cases
contained unique occupation narratives (all duplicates were removed, e.g.,
the file contained the term "carpenter" in the occupation narrative
field only once compared to the numerous times it may have appeared in
the original file of 48,067 cases). Almost half (4,852)
were coded correctly. The software incorrectly coded 2,020 narratives and could not code 2,936 narratives. Problems
that were identified are characterized below..
| Codes were in different division categories but were unrelated (e.g., nurse vs. machine operator) | 789 (39%) |
| Codes were in the same division category | 748 (37%) |
| Codes were in different division categories but were related (e.g., machine operators vs. operating engineers) | 388 (19%) |
| Codes were in different division categories and one code was an unclassified code | 60 (3%) |
| Codes were in different division categories and one code was a non-worker code | 35 (2%) |
What are the limitations to using an
automated system?
Because of data quality, there are often challenges to using
an automated industry and occupation (I&O) coding system.
- Multiple ways of reporting industries or occupations
- Industry: Hauling/Transporting/Trucking
- Occupation: Teacher/Tutor/Instructor
- State-specific business names
- Industry: Bar Church Key Pub/Harper’s County Ham
- Ambiguous I&O narratives
- Industry: Food/Computers
- Occupation: Healthcare/Assistant
What can be done to data to improve the software’s
performance?
Data quality plays an integral part in software performance.
The cleaner, more straightforward your data, the better the software performs.
Several steps can be taken to ensure well-coded data. Before coding, review
the industry and occupation (I&O) narratives in your data and modify
as needed:
- Spell out abbreviated words and acronyms
Example:- AFB = Air Force Base
- ED = Emergency Department
- Use business type instead of business name
Example:- Construction instead of Smith & Sons Contracting
- Make sure that industry and occupation narratives are in appropriate fields
- Delete non-essential words
Example:- a teacher
- works as a lawyer
Should I perform quality control
on coded data?
After data have been coded, we strongly encourage performing
quality control to ensure that there are no systematic errors.
- Randomly sample electronically coded cases
- Independently code sample manually
- Compare records and adjudicate mismatches
Contact Us:
- National Institute for Occupational Safety and Health (NIOSH)
- Centers for Disease Control and Prevention
- 800-CDC-INFO
(800-232-4636)
TTY: (888) 232-6348 - New Hours of Operation
8am-8pm ET/Monday-Friday
Closed Holidays - cdcinfo@cdc.gov


