INDUSTRY AND OCCUPATION CODING
The NIOSH Industry and Occupation Computerized Coding System (NIOCCS) is a web-based software tool designed to efficiently, accurately, and uniformly translate industry and occupation (I&O) text found on employment, vital statistics, and health records to standardized I&O codes. The system was developed by NIOSH and released for public use in December 2012.
This system is used by occupational researchers, federal government agencies, state health departments (vital statistics, cancer registries, etc.), and other organizations that collect and/or evaluate information using I&O. Its purpose is to provide a tool that reduces the high cost of manually coding I&O information while simultaneously improving uniformity of the codes.
NIOCCS is available free of charge and requires only internet access and a web browser for use. Users are required to register for a NIOCCS account if they wish to upload files of records for coding.
NIOCCS Primary System Features:
- Industry and Occupation Coding
- Single Record or Batch File Processing
- Automatic and Computer-Assisted Coding
- Selection of Census I&O Classification Scheme
- NIOCCS codes text to the Census Industry and Occupation classification schemes
- Census 2000 or 2002 options are available as of December 2012
- Census 2010 to be added in calendar year 2014
- Associated NAICS and/or SOC codes can be included in the coded output
- Selection of Confidence Level for Automatic Assignment of I&O Codes
- Crosswalk I&O Coding
- Single Record or Batch File Processing
- Automatic and Computer-Assisted Coding
- Selection of input and output target Census I&O coding schemes
- NIOCCS crosswalks from one Census I&O classification scheme to another
- Crosswalks include Census 1990, 2000, 2002, 2010 as of August 2013
- Associated NAICS and/or SOC codes can be included in the crosswalked output
- Ability to crosswalk forward or backward
- File History Reporting
- User Support
- NIOCCS User Manual and supporting documentation
- Frequently Asked Questions
- Industry and Occupation Support website
- NIOCCS Email Contact for Questions and to Provide Feedback
NIOSH strongly recommends that users be trained in I&O coding prior to using the NIOCCS system. NIOCCS is not intended to take the place of trained I&O coders. Using the computer-assisted features of this system will still require trained I&O coders with the knowledge needed to use the system for selecting the appropriate I&O codes.
NIOSH provides I&O coding training classes several times a year. Requests for training can be made on the NIOSH I&O Coding website at: http://www.cdc.gov/niosh/topics/coding/training.html
If attending a training class is not possible, it is recommended that a copy of the instruction manuals for using 2000 or 2002 Census coding schemes be reviewed (also found on the above website). The instruction manuals were developed for the I&O training class and can be used as a guide for determining industry and occupation codes when using the NIOCCS computer-assisted feature.
For more information about NIOCCS, users can contact the NIOCCS Support Team in one of three ways:
- Submit a question or suggestion using the following NIOCCS form: http://wwwn.cdc.gov/niosh-nioccs/ContactNotLoggedIn.aspx
- Send an email to NIOCCS@CDC.gov
- Contact one of the following NIOSH staff:
NIOCCS codes industry and occupation text according to the Bureau of Census Industries and Occupations Classification System (Code Lists) (http://www.census.gov/people/io/) supplemented with special codes for non-paid workers, non-workers, and the military.
The NIOCCS Coding Engine uses multiple coding processes which includes two coding paths, Autocoding and Computer-Assisted coding, with the help of user selected confidence level settings. The NIOCCS design has processes that cover phrase-based and word-based matching, exact match and proximity match, and weighted and not-weighted matching. Each process has its specialty of best-fit coding areas, so the combined coding ability, in both accuracy and production rates, is enhanced.
A high level view of the NIOCCS coding engine is illustrated in the diagram below.
The NIOCCS Knowledgebase (KB)is designed to handle common industry and occupation combinations and common miss spellings. It is the first process in the coding engine. Input records that have an exact match in the KB will be automatically coded and will not need to be processed through further coding algorithms. The NIOCCS KB was developed using one million records coded by the Bureau of Census on Census surveys and 260,000 death certificate records coded by NIOSH. These records were reviewed by expert NIOSH I&O coders for inclusion in the KB. The initial NIOCCS KB has approximately 40,000 records.
NIOCCS makes use of Confidence Levels CL) to decide the coding path, i.e. Autocoding or Computer-Assisted coding. Records that meet the user specified autocode confidence level setting will be automatically coded. Records that fall below the confidence level setting are made available in the computer-assisted coding module.
|Confidence Level (CL) Setting options|
|If records are processed using the HIGH confidence level setting, then only matched candidates where NIOCCS has 90% or greater confidence of accuracy will be automatically coded.|
|If records are processed using the MEDIUM confidence level setting, then only matched candidates where NIOCCS has 70% or greater confidence of accuracy will be automatically coded.|
|If records are processed using the LOW confidence level setting, then only matched candidates where NIOCCS has 30% or greater confidence of accuracy will be automatically coded.|
NOTE: The higher confidence level (CL) setting will normally result in higher accuracy of the coded results however it may reduce the number of records automatically coded. See Chapter 5 in the NIOCCS User Manual for more information about the NIOCCS Autocoding Confidence Levels.
The I&O Restriction Filter is an inter-dependency arbitrator. The industry code and occupation code sometimes are inter-dependent, in that one industry title may map to more than one industry code, and the most accurate one can be decided only by considering the occupation information; likewise, one occupation title may map to more than one occupation code, only the industry code can help to narrow them down to the most appropriate one. Thus, NIOCCS first assigns the industry code, and then the occupation code, because in most cases the occupation codes are restricted by industry codes. If there is still more than one set of industry and occupation codes that cannot be further screened, they will be output as all possible candidates together with their confidence levels. See Chapter 18.104.22.168 in the NIOCCS User Manual for more information on industry restriction rules.
Crosswalk Coding Engine
Crosswalk coding is the mapping of a code from one I&O classification coding scheme to another I&O classification coding scheme or to a different code within the same I&O coding scheme for a different year.
The crosswalk coding engine uses stored tables that have the code mappings for each year and each scheme. Only exact match processing is used.
Benchmarks for production and accuracy rates were established during the system requirements phase of the NIOCCS project. Accuracy rates are the percentage of correctly assigned I&O codes among all the paired phrases automatically assigned I&O codes by the system. Accuracy rates were determined by comparing the computer generated I&O codes from NIOCCS with the codes assigned in validated test data sample sets. The production rate was determined by calculating the percent of records coded automatically by the system.
The accuracy and production rates from NIOSH’s predecessor system, the Standard Occupation and Industry Coding (SOIC) system, were used as the minimum acceptable benchmark targets for the NIOCCS system performance. Based on information from the SOIC website (http://www.cdc.gov/niosh/soic) and from SOIC user feedback, the average production rate for SOIC is 80% with an average accuracy rate of approximately 75% for automatically coded death certificate records.
Death Certificate data
High Confidence Level setting: 90% accuracy rate
Medium Confidence Level setting (SOIC): 80% production rate, 75% accuracy rate
NIOSH used 50,696 cases from a death certificate based surveillance system to compare the autocoded results from NIOCCS with the manually assigned codes of NIOSH expert coders. The manually assigned codes included a 100% quality control verification process. NIOCCS accuracy and production rates are shown below.
|Death Certificate Data|
|Confidence Level Setting||Production Rate||Matched manually assigned codes on both I&O||Matched manually assigned codes on industry only||Matched manually assigned codes on occupation only|
Coding results will vary and depend upon overall quality of the source data. Different data sources may render significantly different accuracy and production rates. Structured and detailed data sources will have higher accuracy and production rates than data sources with liberal text, insufficient information, or numbers or symbols included in the text.
The first release of NIOCCS uses only the industry and occupation text to assign codes. Records that contain employer name and/or job duties will not code at the same rate of accuracy or production as records containing only industry and occupation. This is because the additional pieces of information (employer and job duties) can conflict and/or provide more detailed information that could alter the I&O codes assigned. Including this information can be helpful however when using the computer-assisted coding module to ensure that appropriate codes are assigned manually. Future releases of NIOCCS will incorporate the use of job duties and employer information in the autocoding process.
NIOSH compared coded results using two surveys coded by NIOSH coders, each containing data for employer and job duties along with industry and occupation text. Both surveys were coded with 100% quality control verification. The accuracy and production rates using NIOCCS to autocode these two surveys are shown below.
|Survey Data||Confidence Level||Production Rate||Matched manually assigned codes on both I&O||Matched manually assigned codes on industry only||Matched manually assigned codes on occupation only|
|Multi-Ethnic Study of Atherosclerosis (MESA)|
Number of records: 8,163
|Reasons for Geographic And Racial Differences in Stroke (REGARDS)|
Number records: 9,550
Internet bandwidth will significantly affect the interactivity of the computer-assisted coding.
The Autocoding process may take a significant amount of time when the volume of the data is large. The turnaround time for autocoding may also depend on the traffic in the queue of coding jobs.
File Size Limitations
Upload file size is currently limited to 1 mg. The number of records this equates to will vary depending on how many of the optional fields on the input file format are used. For files that use only the required fields, 1 mg should equate to approximately 10,000 records. This limit may be adjusted over time depending on server performance and/or improvements in the system or architecture.
Coding directly to NAICS and SOC
NIOCCS coding is based on the Bureau of Census I&O Classification schemes. NAICS and SOC codes can be obtained through NIOCCS, however the NAICS and SOC codes will be limited to the detail provided in the Census Alphabetic Indexes. Users can not code directly to NAICS and SOC codes.
Coding to Census 1990
Although the original system requirements state that NIOCCS will include coding to the Census 1990 classification scheme, NIOSH decided to drop this capability primarily due to limitations on time in obtaining large quantities of records coded and verified in the Census 1990 scheme for the knowledgebase. Priority was focused on the Census 2000 & 2002 classification schemes in order to meet project deadlines. Additionally, it was determined that those who still want to code in Census 1990 could use the NIOSH Standardized Occupation & Industry Coding (SOIC) software to do so. (Visit the SOIC website for more information at: http://www.cdc.gov/niosh/soic/)
Coding only Industry or only Occupation
Due to the complex inter-dependency between industry and occupation (see Chapter 1.7 of the NIOCCS User Mamual, the I&O Restriction Filter of the Coding Engine for more information), coding results will not be as productive when coding only industry text or only occupation text. NIOSH has added special rules to handle situations where only industry or only occupation text is provided but testing revealed that more work needs to be done in this area.
The NIOCCS system will be continually improved over time. The NIOCCS project team will continue to test and identify adjustments that can be made and user feedback will be key in identifying and prioritizing improvements. NIOCCS system architecture was developed to enable the following types of ongoing system improvements:
The NIOCCS KB will be continually evaluated as NIOSH coding and IT staff analyze more coded data to identify the refinements that could be made to the knowledgebase to improve accuracy and efficiency.
As more data have been processed and studied, the internal parameters (such as the weight of process, weight of keywords, etc.) will be adjusted to the optimal values, thus accuracy and production are increased.
Special Coding Rules
Specific rules for unique industry or occupation titles will be added as needed to improve coding accuracy. Each rule will be tested and approved by expert coders before adding into the system, and will be periodically validated, so that invalid or obsolete rules are removed.
The following specific enhancements are top priorities for future releases of NIOCCS:
- Autocoding to the Census 2010 Classification Coding Scheme
- Improve coding only industry or only occupation
- Allow input of NAICS Codes for Industry
- National Institute for Occupational Safety and Health (NIOSH)
- Centers for Disease Control and Prevention
TTY: (888) 232-6348
- New Hours of Operation
- Contact CDC-INFO