INDUSTRY AND OCCUPATION CODING

I&OlogoI&Ologo

About the NIOSH Industry & Occupation Computerized Coding System (NIOCCS)

NIOCCS is a web-based software tool designed to translate industry and occupation (I&O) text to standardized I&O codes. It is used by occupational researchers, federal government agencies, state health departments and other organizations that collect and/or evaluate information using I&O. Its purpose is to provide a tool that reduces the high cost of manually coding I&O information while simultaneously improving uniformity of the codes.

NIOCCS is available free of charge and requires only internet access and a web browser for use. Users are required to register for a NIOCCS account if they wish to upload files of records for coding.

NIOCCS Login Button

a

  • Single Record Coding
  • Batch File Processing
  • Computer-Assisted Coding for records not automatically coded
  • I&O Coding Classification Scheme options:
    • Census 2010 / NAICS 2007 / SOC 2010
    • Census 2002 / NAICS 2002 / SOC 2000
    • Census 2000 / NAICS 1997 / SOC 2000
  • Census Industry and Occupation Alphabetical Index Lookup
  • Crosswalk Coding forward or backward using Census, NAICS/SIC, or SOC coding classifications
    Crosswalk coding is the mapping of a code from one I&O classification coding scheme to another I&O classification coding scheme or to a different code within the same I&O coding scheme for a different year.

Overview of NIOCCS V3.0 Enhancements:

  • Improved autocoding rates by 10-25% depending on quality of data input.
  • Industry input can be text or NAICS codes.
  • Simplified upload process (upload and autocode in one step).
  • Increased upload file size limit from 2.5MB to 30MB.
  • Removed option for High and Medium confidence level coding; added a ‘Suggest Review’ flag on complex autocoded records.
  • Detailed NAICSExternal and SOCExternal codes included in output files along with Census Industry and Occupation (I&O) codes.
  • Automatic email message to user when autocoding job is completed.
  • Crosswalk coding enhancements:
    • Crosswalk from 2000 Census to 2010 Census in one step.
    • NAICS crosswalks added.
    • SOC crosswalks added.
    • SIC to NAICS crosswalks added.
  • Enhanced computer-assisted coding features:
    • Bureau of Labor Statistics (BLS) data available to assist with coding decisions.
    • Quick Google search buttons for industry, occupation, and employer text.
    • Ranked I&O pair candidates provided.
    • Census, NAICS, and SOC titles viewable with mouse hover.
    • Hot keys added to reduce mouse use.
    • Improved Alpha Index searching.

NIOCCS is available free of charge and requires only internet access and a web browser for use.  Many features of the system do not require a NIOCCS user account, such as Single Record Coding and the Census Industry and Occupation Alphabetical Index Lookup.  To perform coding of an entire file of records, a user account is required.

The diagram and steps below outline the process for using NIOCCS to code a file of records.

NIOCCS System
  1. User uploads file to be coded.  Minimum fields required:  Record ID, Industry text, Occupation text.
  2. Data is processed by the NIOCCS coding engine.
  3. Records are flagged as autocoded or needing manual coding.
  4. Using the tools in the Computer-Assisted Coding Interface of NIOCCS, user selects codes for records needing manual coding.
  5. User downloads coded file.  Output file contains input fields plus the Census industry code, Census occupation code, NAICS code, SOC code, and flags indicating which records were autocoded.

 

Autocoding Process Overview

The NIOCCS industry and occupation (I&O) coding process is based upon the U.S. Census Bureau I&O Alphabetical Indexes supplemented with special codes developed by CDC/NIOSH for non-paid workers, non-workers, and the military (see NIOSH I&O coding documentation for more information).

The Census Alphabetical Index of Industries and Occupations lists industry and occupation titles used most often in the economy.  These indexes were developed by the U.S. Census Bureau for use in classifying a respondents industry and occupation as reported in Census Bureau demographic surveys.  These indexes list over 21,000 industry and 31,000 occupation titles in alphabetical order.  Each title has been assigned a Census Industry Code or Census Occupation Code.  Additionally, the associated North American Industry Classification System (NAICS) code or Standard Occupational Classification (SOC) code is also provided for each title.  For more detailed information about the Census Alphabetical Indexes, go to the U.S. Census Bureau website at:  https://www.census.gov/topics/employment/industry-occupation/guidance/indexes.htmlExternal

NIOCCS codes input I&O narratives, or NAICS codes instead of industry narratives, to Census I&O codes for the user specified target Census I&O Classification year.  Once coded, the NAICS and SOC codes associated with the Census code in the alphabetic index are included in the output results.

A high level view of the NIOCCS autocoding process is illustrated in the diagram below. Click on the diagram boxes for details of each process.








  • Punctuation is removed
  • Acronyms are organized into a standard format
  • White space characters are cleaned up.
  • Stop words, such as articles and prepositions, are removed from the input.
  • Words within the narrative inputs are stemmed, removing suffixes such as –s, –ing and –er.
Standardized I&O narrative inputs are used by the NIOCCS candidate generation process to produce lists of candidate lines from their respective alphabetic indexes.

For inputs using NAICS codes rather than industry narratives, the input NAICS codes are validated and automatically crosswalked to the equivalent NAICS codes for the specified NAICS published year. The resulting crosswalked NAICS codes serve as the industry candidate lines in this form of I&O input.

In the NIOCCS I&O candidate generation process:

  1. The alphabetic index dictionary is searched for possible matches using words in the standardized I&O narrative inputs.
  2. Matches between words in the narrative inputs and words in the index titles are used to select I&O candidate lines from the Census alphabetic indexes.
  3. The words in the narrative inputs are compared to the words in the index titles for the selected I&O candidate lines.
  4. The presence or absence of words in the comparison are used to score these selected I&O candidate lines, and low scoring candidate lines are dropped from consideration.
  5. All industry candidate lines are paired with all occupation candidate lines to form a list of possible I&O pairs. I&O pair scores are determined by combining industry candidate line scores with occupation candidate line scores.
Codes in the Census Alphabetical Indexes sometimes have restriction rules associated with them.  These rules are used to ensure occupation codes selected for a given industry are valid; and vice versa for industry codes.

The index restrictions are examined for each I&O candidate pair.  When index restriction rules are violated then the pair is dropped from consideration for autocoding.

Learn more about the Census Alphabetical Index I&O RestrictionCdc-pdfExternal or review the I&O Coding Instruction Manuals found on the NIOSH I&O Coding.

If a single remaining I&O pair has the highest remaining pair score, then the autocoder selects that I&O pair.

If more than one of the remaining I&O pairs share the highest remaining score, then tiebreakers will be used where possible to select one the these highest scoring I&O pairs.

Tiebreaker #1:  Census I&O coding rules are applied. An example of Census coding rules includes rules for the default selection of occupation index lines with “own business not incorporated” (OBNI) or “private company” (PR) industry restrictions for certain occupation titles, see Census I&O Coding Instruction Manual (http://www.cdc.gov/niosh/topics/coding/nioccsuserdocumentation.html).

Tiebreaker #2:  Bureau of Labor Statistics (BLS) occupation employment totals for each pair’s NAICS/SOC combination are examined. I&O pairs with employment totals that occur much more frequently than others among the best scoring pairs may be selected under some circumstances.

I&O inputs with an I&O pair selected are autocoded. Otherwise, I&O candidate lines and I&O pairs are saved for review and possible selection during manual coding using the computer-assisted coding features of the system.

Autocoding Example #1

Industry = MAIL DELIVERY
Occupation = POSTAL TRUCK DRIVER

The autocoder selects all industry lines containing “MAIL” and all industry lines containing “DELIVERY”. All lines that do not contain both “MAIL” and “DELIVERY” are penalized. There is a “MAIL DELIVERY” line in the industry index, and it receives the highest score.

The autocoder selects all occupation lines containing “POSTAL”, all industry lines containing “TRUCK” and all industry lines containing “DRIVER”. There isn’t a “POSTAL TRUCK DRIVER” line in the occupation index, so all selected occupation lines are penalized. There are several “TRUCK DRIVER” occupation lines in the occupation index, and these lines receive the highest score. The “TRUCK DRIVER” lines are for different types of truck drivers, including drivers of semi-trucks and drivers of light trucks used for delivery, such as postal delivery trucks. These “TRUCK DRIVER” lines differ from each other by the occupation codes they categorize and by their industry restrictions.

All selected industry lines are paired with all selected occupation lines, and the pairs are scored by combining their industry line score and their occupation line score. The highest scoring pairs are the “MAIL DELIVERY” and “TRUCK DRIVER” pairs. Restrictions are applied in the highest scoring pairs, and pairs associated with “TRUCK DRIVER” occupations lines that are for drivers of semi-trucks are eliminated because their industry restriction does not include the “MAIL DELIVERY” industry line’s industry code.

Once restrictions are applied to the highest scoring pairs, only one pair remains, and the I&O narrative inputs are autocoded to that I&O codes represented by the remaining highest scoring pair.

Autocoding Example #2:

Industry = BEER
Occupation = DRIVER

The autocoder selects all industry lines containing “BEER”. All lines containing more words than just “BEER”, such as “ROOT BEER”, are penalized.  There are 3 “BEER” lines in the industry index: beer manufacturing, beer wholesale and beer retail. These 3 “BEER” lines all receive the highest score.

The autocoder selects all occupation lines containing “DRIVER”. All lines containing more words than just “DRIVER”, such as “TRUCK DRIVER” are penalized. There are several “DRIVER” occupation lines in the occupation index, and these lines receive the highest score.  Similar to “TRUCK DRIVER” in example #1, the “DRIVER” lines are for different types of drivers, including drivers of semi-trucks, drivers of light trucks used for delivery, and drivers of taxicabs. These “DRIVER” lines differ from each other by the occupation codes they categorize and by their industry restrictions.

All selected industry lines are paired with all selected occupation lines, and the pairs are scored by combining their industry line score and their occupation line score. Restrictions are applied in the highest scoring pairs.

Once restrictions are applied, only 3 pairs remain. There isn’t enough information for the autocoder to select between beer manufacturing, beer wholesale and beer retail. Therefore, the I&O narrative inputs are not autocoded, and the selected I&O index lines and I&O pairs are saved for review and possible selection during manual coding using the computer-assisted coding features of NIOCCS.

Performance

Baseline performance measures for NIOCCS autocoding are based on accuracy rates of the data that is autocoded by the system.  The accuracy threshold for NIOCCS is 10% or less error rate found in autocoded records.  NIOCCS accuracy rates are continually monitored for all files submitted by NIOSH to ensure the accuracy threshold is met.  For NIOSH V3.0, accuracy was tested using large sets of records that were coded and verified by NIOSH trained I&O coders.

Production rates are determined by calculating the percent of records coded automatically by NIOCCS. It is important to note that the quality of data input for coding can result in very different autocoding production rates. Using the benchmarks set for coding accuracy, the average NIOCCS production rates for autocoding over time are shown below.  These percentages are based on internal NIOSH data submissions to NIOCCS using the High Confidence Level option in Versions 1.0 and 2.0.

NIOCCS Average Autocoding Rates

NIOCCS Average Autocoding Rates
Data Type 2013-2014
V1.0
2015-2017
V2.0
2018
V3.0
Death Certificates 65% 71% 87%
Survey Data 42% 52% 75%
Workers Compensation 56%* 87%

* NAICS codes converted to industry text by NIOSH

Coding results will vary and depend upon overall quality of the source data.  Structured and detailed data sources will have higher accuracy and production rates than data sources with long and complex text, unintelligible text, or insufficient information.

NIOCCS uses only the industry and occupation text to assign codes. Records that contain employer name and/or job duties will not code at the same rate of accuracy as records containing only industry and occupation. This is because the additional pieces of information (employer and job duties) can conflict and/or provide more detailed information that could alter the I&O codes assigned. Including this information can be helpful however when using the computer-assisted coding module to ensure that appropriate codes are assigned manually.

Limitations

Speed

User internet bandwidth will significantly affect the interactivity of the computer-assisted coding.
The auto-coding process may take a significant amount of time when the volume of the data to be coded is significantly large. The turnaround time for autocoding may also depend on the traffic in the queue of coding jobs.

File Size Limitations

Upload file size is currently (as of December 2017) limited to 30MB. The number of records this equates to will vary depending on how many of the optional fields on the input file format are used. For files that use the slim file format (ID, Industry Occupation), it equates to approximately 300,000 records.  It is recommended however to limit file submissions to no more than 100,000 records at a time otherwise the performance of the computer-assisted coding user interface will be diminished.

NIOSH (2018). NIOSH Industry and Occupation Computerized Coding System (NIOCCS). U.S. Department of Health and Human Services, Public Health Service, Centers for Disease Control and Prevention, National Institute for Occupational Safety and Health, Division of Surveillance, Hazard Evaluation and Field Studies, Surveillance Branch. <website address> Date accessed._________.

NIOCCS is a tool used to code Industry and Occupation. NIOCCS will autocode most of the records, but the remaining will require manual coding, using the computer-assisted features of this system.

The computer-assisted feature of NIOCCS requires trained I&O coders with the knowledge needed to use the system for selecting the appropriate I&O codes. Therefore, we strongly recommend that users be trained in I&O coding prior to using the computer-assisted feature of the NIOCCS system. NIOCCS is not intended to take the place of trained I&O coders. For more information on the computer-assisted features, please see the NIOCCS User Manual.

We provide I&O coding training classes a few times a year.  See our Training and Consultation page for more information.

NIOCCS User Manual, Training Guides, and Input File Format information (as well as documents on I&O coding classification systems and I&O Crosswalks.)

The NIOCCS project team continually works to identify adjustments that can be made to the system to improve usability and autocoding capability. User feedback is welcome and is used to identify and prioritize improvements to be made to the system.

For comments, questions, or problems using NIOCCS, send an email to NIOCCS@cdc.gov

Page last reviewed: March 6, 2019