Policies to Protect Geographic, Temporal, Perturbed/Masked Information
Policies to Protect Geographic Information
During the proposal process, decisions about how geographic variables are used will be specified. The RDC Analyst will apply the follow techniques when creating analytic data sets.
- True Geography Remains in the Data: If the geographic variables are being used to make estimates at a lower level of geography than is publically available true geography can remain in the data set. However, it is important to note that little NCHS data is representative at geographic levels not available on the public files; therefore this is the least common use of geography in the RDC.
- Coarsening Geography: If lower levels of geography can be grouped into larger areas, the RDC Analyst will create the coarsened variable and not provide access to the underlying lower level of geography. Analysts may request that the researcher write the code for how they want the variables created: researchers are always welcome to provide and review the code for created variables.
- Removing All Geography: If the geographic variables are being used to merge NCHS data to another source of data and are not needed for analysis, the geographic variables are removed.
- Randomizing Geography: If the geographic variables are being used for the merge and for analytic purposes, random versions of the variables are substituted.
Policies to Protect Temporal Information
Similar challenges and safeguards are also necessary for temporal variables not included on the public files.
- Coarsening Dates: If coarsened dates (e.g. year, month, or quarter, not the exact date) are needed for merging or analysis, the RDC Analyst should create the coarsened variable and never provide access to the underlying exact date. Analysts may request that the researcher write the code for how they want the variables created: researchers are always welcome to provide and review the code for created variables.
- Creating Variables of Time from Dates: If exact dates are being used to calculate time (exact length of life calculated based on DOB and DOD), the exact dates should only be used for data management and the resulting variables (e.g. length of time) should be used in analysis. ANDRE users will be required to submit their data management code to create those values and the exact dates dropped prior to their data set being loaded on the system. This may also be enforced on site.
Policies to Protect Perturbed and Masked Information
Perturbation and masking are methods that allow potentially sensitive information to be changed in a way that allows it to be made public. If the method is revealed, the public files could be compromised, so all efforts to protect "changed" variables need to be made. The following are examples of "changed" public variables that cannot be included in restricted data sets that have their restricted counterparts.
- Public and restricted mortality variables
- Pseudo/masked and True PSU and Strata variables for NHANES
- Public Use and In-House PSU and STRATUM variables for the NHIS