Creating the Data Dictionary
The data dictionary is an essential part of the RDC proposal. It outlines the data to be provided by the researcher and what data is requested. During the proposal process, it is used to assess the disclosure risk of the project. Once the proposal is approved, it is used to help the RDC Analyst merge the data and confirm that the data provided is the data approved for the project.
The Data Dictionary
There are three parts to the data dictionary: public data, restricted data, and non-NCHS data.
- Public Data – Please select only the variables from the public data that are necessary to answer your research question. We will not merge the entire public file to restricted data.
- Restricted Data – Many of the restricted variables are listed (link to restricted variables), however these lists are not exhaustive. Reviewing the questionnaire will help determine exactly what data are available. If you need additional help after consulting those resources, please contact the RDC.
- Non-NCHS Data – If you wish to have variables added from another data source, please provide a list of those variables. Please do not exceed 100 variables.
The Data Dictionary Formats:
- Please list or provide in a table format the following information:
- File the data is coming from (e.g. NHIS 2000 person file)
- Variable Names
- Variable Descriptions
- Please use the examples below to help you construct your data dictionary.
- Highlight the variables in each dictionary that will be used in the merge. It is important that these variables be formatted consistently between data sets for the merge to go smoothly and most cost efficiently.
- The data dictionary in your proposal provides the Review Committee with an idea of what your data set will contain for the purpose assessing the disclosure risk. When your RDC Analyst creates your data set, he/she may need additional information (SAS set-up statements, ASCII data files, etc.). Be prepared to discuss the actual merge with your RDC Analyst if approved.
- We strongly encourage you to work with the public data prior to submitting a proposal. If you have already compiled the public use and non-NCHS data sets for the project, you can use your statistical program to run lists of the variables (for example, a SAS proc contents) and submit those as your data dictionary.
- NHDS users: Instead of providing a public and restricted data dictionary, please provide one data dictionary that chooses the variables necessary for your research from the NHDS Restricted Variables Codebook .
- If you forgot a variable during the proposal process and want to add it before the merge happens, please contact your Analyst.
Data Dictionary Examples
Because all data systems are slightly different, the data dictionary may come in a variety of styles. Please see the various examples for the data systems.
For NHIS, the PUBLICID is actually a compound variable that can be used to link files of different levels household, family, and person). Be sure to retain the component variables, and follow the variable names, formats and lengths specified in the documentation for that data system when you are creating your public use subset files.