Skip directly to local search Skip directly to A to Z list Skip directly to navigation Skip directly to site content Skip directly to page options
CDC Home

STANDARDIZED OCCUPATION & INDUSTRY CODING

SOIC Frequently Asked Questions

occupation/industry logo

NOTE: This page is archived for historical purposes and is no longer being maintained or updated.

About SOIC

Back to top     What is SOIC?
The Standardized Occupation and Industry Coding (SOIC) system is a standalone Windows based software package that assigns 3-digit numerical codes to narrative industry and occupation descriptions. Using data that have been imported from existing files or entered directly into SOIC using the data entry screen, the software can code one record at a time or all records in a file. The software assigns codes based on the 1990 Bureau Of the Census Alphabetical Index of Industries and Occupations.

Back to top     How much does the software cost?
Because SOIC was created in the public domain, the software can be downloaded free of charge.

Back to top     Where can industry and occupation information be found?
Many data sets contain narrative text for industry and occupation, including:

  • Vital records systems
  • Cancer registries
  • Worker’s compensation systems
  • Healthcare records

Back to top     Why collect industry and occupation information?
The collection of industry and occupation (I&O) information serves many purposes:

  • To associate specific health outcomes (e.g., a particular cancer or cause of death) with certain industries and/or occupations
  • To identify areas in need of further research
  • To assess socioeconomic status and identify persons who may be at high-risk of disease or injury.

This type of information can be used by public health workers, industrial organizations, employers, and others to provide the best possible hazard abatement and control, and safety and health programs for workers.

Back to top     What is the advantage of an automated coding system?
Manually assigning industry and occupation (I&O) codes can often be expensive, time consuming, and not highly consistent. An automated system like SOIC can decrease the number of cases a coder must review and create consistency in a records system.

Back to top     What types of data are accepted by SOIC?
SOIC offers flexible data access features. A user can:
  • Enter data directly into SOIC
  • Operate directly on external data tables in dBase, FoxPro, and Access
  • Import and export text files

Back to top     Will the software be updated?
The current version of SOIC assigns codes based on the 1990 Bureau of the Census (BOC) coding scheme. No further revisions will be made to SOIC. However, NIOSH is currently developing new coding software called the NIOSH Industry and Occupation Computerized Coding System (NIOCCS). As SOIC did, the new NIOCCS will have the ability to translate free text industry and occupation narratives found on employment and health records to standardized I&O codes including the 2002 Bureau of the Census I&O codes as well as NAICS. It will also have the ability to crosswalk data from the 1990 Census I&O codes to the 2002 Census codes. For additional information, view the NIOSH Industry and Occupation Coding and Support page.

Back to top     How does SOIC differ from NIOCCS?
NIOCCS is a web-based system with newer coding structures whereas SOIC is a standalone software application limited to 1990 BOC codes.

Coding Results

Back to top     What is the overall performance of SOIC?
NIOSH conducted a comparison of SOIC and an expert’s manually assigned codes for 48,067 cases from a death certificate based surveillance system. The number of software-assigned codes that matched the expert manual coder is shown below. In this test there was no adjudication of the results; that is, the mismatched cases were not reviewed to determine if the SOIC autocoder or the manual coder was actually correct. These results are provided as an illustration. Coding results will vary and depend upon overall data quality. The software does not perform well on narratives with company names and other ambiguous information.

Industry codes matched
36,376 cases (76%)
Occupation codes matched
36,207 cases (75%)
Both occupation and industry codes matched
30,389 cases (63%)

Back to top     How well does the software perform on industry narratives?
To identify and categorize SOIC errors, unique industry and occupation narratives were reviewed separately. Using Bureau of the Census industry divisions as a guide, SOIC and manually assigned codes were compared. Of the 48,067 total cases, 16,096 cases contained unique industry narratives (all duplicates were removed, e.g., the file contained the term "construction roofing" in the industry narrative field only once compared to the numerous times it may have appeared in the original file of 48,067 cases). More than half (9,262) were coded correctly. The software incorrectly coded 3,296 narratives and could not code 3,538 narratives. Problems that were identified are characterized below.

Codes were in different division categories but were unrelated (e.g., construction vs. hospital)
1,159 (35%)
Codes were in different division categories but were related (e.g., food manufacturing vs. retail grocery)
1,032 (31%)
Codes were in the same division category
825 (25%)
Codes were in different division categories and one code was an unclassified code
248 (8%)
Codes were in different division categories and one code was a non-worker code
32 (1%)

Back to top     How well does the software perform on occupation narratives?
To identify and categorize SOIC errors, unique industry and occupation narratives were reviewed separately. Using Bureau of the Census occupation divisions as a guide, SOIC and manually assigned codes were compared. Of the 48,067 total cases, 9,808 cases contained unique occupation narratives (all duplicates were removed, e.g., the file contained the term "carpenter" in the occupation narrative field only once compared to the numerous times it may have appeared in the original file of 48,067 cases). Almost half (4,852) were coded correctly. The software incorrectly coded 2,020 narratives and could not code 2,936 narratives. Problems that were identified are characterized below..

Codes were in different division categories but were unrelated (e.g., nurse vs. machine operator)
789 (39%)
Codes were in the same division category
748 (37%)
Codes were in different division categories but were related (e.g., machine operators vs. operating engineers)
388 (19%)
Codes were in different division categories and one code was an unclassified code
60 (3%)
Codes were in different division categories and one code was a non-worker code
35 (2%)

Back to top     What are the limitations to using an automated system?
Because of data quality, there are often challenges to using an automated industry and occupation (I&O) coding system.

  • Multiple ways of reporting industries or occupations
    • Industry: Hauling/Transporting/Trucking
    • Occupation: Teacher/Tutor/Instructor
  • State-specific business names
    • Industry: Bar Church Key Pub/Harper’s County Ham
  • Ambiguous I&O narratives
    • Industry: Food/Computers
    • Occupation: Healthcare/Assistant

Back to top     What can be done to data to improve the software’s performance?
Data quality plays an integral part in software performance. The cleaner, more straightforward your data, the better the software performs. Several steps can be taken to ensure well-coded data. Before coding, review the industry and occupation (I&O) narratives in your data and modify as needed:

  • Spell out abbreviated words and acronyms
    Example:
    • AFB = Air Force Base
    • ED = Emergency Department
  • Use business type instead of business name
    Example:
    • Construction instead of Smith & Sons Contracting
  • Make sure that industry and occupation narratives are in appropriate fields
  • Delete non-essential words
    Example:
    • a teacher
    • works as a lawyer

Back to top     Should I perform quality control on coded data?
After data have been coded, we strongly encourage performing quality control to ensure that there are no systematic errors.

  • Randomly sample electronically coded cases
  • Independently code sample manually
  • Compare records and adjudicate mismatches
 
Contact Us:
USA.gov: The U.S. Government's Official Web PortalDepartment of Health and Human Services
Centers for Disease Control and Prevention   1600 Clifton Rd. Atlanta, GA 30333, USA
800-CDC-INFO (800-232-4636) TTY: (888) 232-6348 - Contact CDC–INFO