Public-use Data Files Program Balances Data Demand and Confidentiality
December 11, 2013
NCHS' public-use data file service gives researchers access to datasets, documentation, and questionnaires from NCHS surveys and data collection systems. Free and downloadable from the NCHS website, public-use data files allow researchers to manipulate the data in a format appropriate for their analyses.
Dr. Eve Powell-Griner, NCHS Confidentiality Officer, says public-use data files are central to helping NCHS fulfill its dual missions: to disseminate the nation's health information as widely as possible and to protect NCHS respondents' confidential information. Carrying out these sometimes-competing mandates has always been a balancing act, but recent changes in data technology makes striking that balance more complicated.
The history of public-use data files reads almost as a history of data dissemination technology. NCHS traditionally disseminated data results in print form. In 1971, the Division of Vital Statistics began receiving data in tape format, allowing NCHS to offer files in reel-to-reel format soon after. Tape gave way to floppy disks, and disks gave way to CD-ROMs and DVDs. Today, public-use files have moved to the cloud— as free digital files, available to anyone with a computer and a web browser.
As technology made public-use files more accessible, NCHS increased the number of files available. Seventy-nine files were available in 1975; by 1990, according to that year's catalog of electronic data products, the total had grown to more than 500 public-use data files. That number has grown substantially since then. The number of public-use files approved for release by NCHS is now close to 100 per year.
The key word is "approved." As Confidentiality Officer, Dr. Powell-Griner chairs the Center's Disclosure Review Board, which is comprised of representatives from each of NCHS' data divisions and offices. The Disclosure Review Board is charged with reviewing all files before approving them for release as public-use files. Dr. Powell-Griner says ensuring confidentiality "is crucial to keeping us in business."
To that end, NCHS has grown more conservative about what is released in public-use files. One particular area of concern is geography, such as county or state of residence. Geographic location of respondents is often of great interest to researchers, and that interest will continue to grow with the introduction of the Affordable Care Act. However, Dr. Powell-Griner says geography also poses one of the greatest disclosure risks, since a respondent's geographical location can be combined with other information to potentially reveal the respondent's identity.
Another area of concern is what is generally referred to as "big data." Today's desktops have more computing power than the university mainframes that ran punch cards in the early days of digital information. This increase in power makes it possible for researchers to access more and more data—and to merge NCHS data with data from other, non-NCHS sources—thus increasing the risk of disclosure.
Although users downloading public-use files are required to comply with data use restrictions to ensure that the information will be used solely for statistical analysis or reporting purposes, Dr. Powell-Griner says that once NCHS releases the data, it has little control over how data are ultimately used. To keep NCHS data secure, the Center's various programs monitor other datasets available from other sources that are similar to the data collected by NCHS. By reviewing these alternate datasets prior to the files' release, NCHS is better able to minimize disclosure risks of our publicly released data.
Dr. Powell-Griner points out that while technology has provided additional access to data through remote access and expansion of the number of Research Data Centers (RDCs), an unforeseen challenge to some users in the post-911 world arises from security concerns. While scientists and researchers continue to be among the most frequent users of NCHS data, students working on course work and dissertations represent a growing user component. With changes in rules about access to federal buildings, foreign or non-U.S. students and other researchers who want access to restricted data through RDC face increasing challenges in getting clearance. Noncitizen researchers residing in the U.S. who are affiliated with U.S. institutions can often rely on remote access to analyze restricted NCHS data, but others will have to rely on public-use files, which provide a solid source of current health data but do not have the level of detail and scope of variables offered in RDC.
While technology can be both a boon and a challenge, it has yet to overcome a basic human trait: the tendency to skip the instructions. The focus on easy data availability, Dr. Powell-Griner says, can unfortunately obscure the need to utilize the equally easily available and important documentation that accompanies the data. Every data file has basic descriptive information that describes how the data were collected, who the data pertains to, and the quality of the variables. Public-use data files are very large and complex, and not easy for the untrained user to make sense of. When people have trouble using NCHS data, she says, it's usually not because the data are faulty—it's more likely because they did not pay close enough attention to the documentation.