II. Standards and Suggested Practices
F. Data management and cleaning
Once CTR data have been collected, transferred, and integrated into a health department’s central database, the data must then be cleaned for analysis and submission to CDC. Many CTR data errors can be prevented or resolved during the data collection or data entry stage. Because it is impossible to eliminate all data errors during the data collection and entry phases, systematic error checks must be completed during the data management and cleaning phase.
Figure 10 illustrates the data management and cleaning phase of the CTR data life cycle, beginning with the central CTR database. The central CTR database consists of data that have been collected and transferred by service providers and then integrated at the health department level. These data should always be retained and accessible in their original form. As new (or updated) CTR data are transferred by service providers, they should be appended to the central database.
Data-cleaning procedures include identifying and resolving data errors, recoding data, and computing new variables. Health departments should develop standard methods for handling missing, inaccurate, or out-of-range data and should include a rationale for, and description of, these procedures in their protocol. Once data have been cleaned, additional data procedures—including recoding or reformatting variables and computing new variables—may also be necessary to prepare the database for analysis, reporting, and submission to CDC. Throughout the data-cleaning process, the data manager or analyst should write clear explanations of all data errors, reconciliation decisions, recodes, and computation procedures. This documentation should be included in the protocol.
Figure 10. Health department data management and cleaning stage
Standard F1: Health departments should develop and maintain a data dictionary that includes variable names, value labels, and definitions of all CTR data variables.
A data dictionary is a standardized document that provides information about, and descriptions of, each variable and its values. A data dictionary serves as a reference document during data management for generating lists of common data errors, writing data-cleaning programs, and providing a common language for data managers and analysts. See the following list of the elements (and their definitions). You are encouraged to use this CDC document as a resource for developing a data dictionary for your health department and your service providers.
Data elements of a data dictionary
Variable name: Name of the variable used to provide information about what is being collected and is often restricted to 8 characters or less (e.g., HIVCTR1)
Variable label:Provides more detailed information on the variable and can contain variable name (i.e., HIVCTR1.TestingandReferrals)
Definition: Description of the variable
Value: Description of the allowable numeric values (e.g., 1, 0, 9)
Value label: Descriptions of allowable values (e.g., 1 is “Yes,” 0 is “No,” and 9 is “Not checked”)
Format/length: Denotes the type of variable—date, numeric, or alphanumeric (format). The
length is an indication of the number of allowable characters for each variable.
Requirement: Denotes the requirement classification of each variable. For example, CDC uses the following categories:
- Program-required: A variable that answers national monitoring and evaluation questions or enables the monitoring of program announcements and cooperative agreements.
All program-required variables are reported to CDC.
- System-required: A variable required for system functions (e.g., to populate other parts of the system, to link data within the system or within CDC’s national database, or to save a
record). All system-required variables are reported to CDC.
- Optional: A non-required variable that, if completed, is reported to CDC.
- Not reported: A non-required variable that, if completed, is not reported to CDC.
Notes: Document common errors or issues associated with the variable, skip pattern instructions, and limitations of using a data variable. For example, if the value options of a variable change over time (e.g. a priority population is added), this must be considered during analysis and presentation of data.
- NHME data set (CDC) [cited previously as Resource 27]‡
Complete list of National HIV Prevention Program Monitoring and Evaluation (NHME) variables (variable number, variable name, value choices [if applicable], definition, instructions, and CDC’s reporting requirements for each variable).
Standard F2: Health departments should implement data screening procedures that will identify aggregate-level and record-level data errors in the central CTR database.
Health department staff should develop and maintain a list of all common data errors associated with the collection and entry of their jurisdiction’s CTR data variables and develop standard computerized screening programs to identify those errors. Common errors identified during data screening are missing or incomplete data, out-of-range values, internally inconsistent values, duplicate records, and transcription errors. Individual variables and sets of variables should routinely be analyzed to identify invalid combinations of data values. These analyses are followed by visual examination of output to identify aggregate-level errors. Analytical techniques that are commonly used to identify these data errors include the following.
Simple frequency check: Running simple frequency checks on individual variables and reviewing the output can help to identify invalid and missing values that have not been found during QA procedures for data collection and entry. For example, the frequency counts of the “site ID” variable allow the identification of invalid site IDs (e.g., IDs that do not correspond to sites within the health department’s jurisdiction) or missing values. Also, simple descriptive information about the variable’s distribution (e.g., mean, median, mode, minimum and maximum values) or plots of data values that allow for a graphical display of data values (e.g., stem leaf, normal probability) can be checked to determine whether they fall outside an acceptable or expected range of values. For example, in Table 1, the frequency of HIV tests by age range reveals a higher than expected percentage of tests among clients who are 80 years of age and older. In this instance, further examination of these specific cases may uncover an underlying data error.
Table 1. Frequency distribution of HIV tests, by age range
|Age (years)||HIV tests||Percentage of total|
|Younger than 13||46||0.2|
|80 or older*||203*||1.1*|
*Indicates unusually high numbers.
Logical relationship check: Using cross-tabulated frequency counts to check for logical relationships between variables is a straightforward strategy for identifying invalid combinations of data values. In Table 2, the variable “Previous HIV Test?” is cross-tabulated with “Self-Reported HIV Test Result.” The cross-tabulation of the variables “Not Previously Tested” and “Self-Reported HIV Test Result” reveals internally inconsistent data—five invalid values that need to be resolved. That is, if a client reports having never received an HIV test, any value reflecting a prior test result would be inconsistent with that response. A simple cross-tabulation of these two CTR variables reveals these internally inconsistent values and prompts data verification and reconciliation.
Table 2. Cross-tabulation identifying internally inconsistent values
|Self-reported HIV test result|
|Previous HIV test||0
*Indicates inconsistent values.
Monitoring changes: By observing data on critical variables over time, health departments can compare observed findings with expected findings or compare observed findings with program goals. Large variations from expected findings may indicate data errors or programmatic changes (e.g., in funding). For example, in Table 3, the numbers of HIV cases during 2006 and 2007 are compared by risk factor (increases and decreases of more than 25% are highlighted). When changes of more than 25% are observed, health department staff may need to contact service provider staff to determine what accounts for the increases (e.g., additional staff hired, venues added) and the decreases (e.g., staff turnover, lack of testing supplies, delays in data entry). This will help the health department staff interpret the changes over time, which may result from changes in program activities or errors in the data.
Table 3. Risk factors associated with HIV-positive test results, 2006 and 2007
|Risk factor||HIV-positive tests||Percent difference|
|Male-to-male sexual contact and injection drug use||288||190||-34.03*|
|Male-to-male sexual contact||3,462||2,624||-24.21|
|Injection drug use||1,871||1,335||-28.65*|
|High-risk heterosexual contacta||10,633||9,027||-15.10|
*Indicates a large decrease.
†Indicates a large increase.
a Includes persons who had sex with partner at risk, who had a diagnosis of a sexually transmitted disease, who exchanged sex for drugs or money, who had sex while using noninjection drugs, who were victims of sexual assault, or whose only risk factor was high-risk heterosexual contact.
b Includes persons who did not report a risk factor, whose record contained no data on risk factors, or who reported other risk factors (i.e., perinatal exposure, hemophilia, receipt of blood transfusion, or health care exposure).
Locating record-level data errors: Once aggregate-level data errors have been identified, record-level errors can be resolved. The next data-cleaning task in resolving these errors is to determine the specific records in which invalid, duplicate, or missing values occur. This process involves generating a list of client ID numbers for the selected variables with errors and then locating the records associated with each error. For example, in Table 4, the HIV test date for Client 0009 is invalid (i.e., 07/06/70 rather than 07/06/07), and the two entries for Client 0003 on the same date are duplicates
Table 4. Record-level error check: out-of-range values or duplicate records
|Client ID||HIV test (date)||HIV test results provided (date)|
†Indicates a duplicate record.
*Indicates an out-of-range value.
- Cody’s data cleaning techniques using SAS Software‡
Cody, R. (1999). Cody’s data cleaning techniques using SAS software. Cary, NC: SAS Institute, Inc.
Step-by-step instructions for developing data cleaning programs and macros; includes validation checks (e.g., character data, numeric data, missing values, and date values), searching for duplicate records, double data entry and verification by using the COMPARE procedure, SQL-based solutions, and creating data sets by using general-purpose cleaning and validation programs.
- Example of a CTR data error report [cited previously as Resource 37]
Example of an error report from the health department to the service provider (including errors identified on each form to assist in correction and resubmission) and a cover letter to service providers.
Standard F3: Health departments should implement data-cleaning procedures that will resolve errors and result in a clean CTR database.
Once data errors have been identified through screening procedures, they should be resolved according to decision rules that have been developed by data management staff. Because of limited resources and time, health departments may have to prioritize data variables for error resolution. (For a list of common types of data errors, and possible resolutions, see Appendix C.) Health departments should send service providers standard error reports that include information about the specific record-level errors that have been identified by screening procedures so that these errors can be resolved.
For example, if a report indicates that a “male” was “pregnant,” how can staff determine which client characteristic is incorrect? The discrepancy cannot be resolved without additional information. Health department staff may need to contact service providers to check alternate sources of information, such as medical records or other client forms. In settings where names are not collected or in the absence of an alternative source that can resolve the discrepancy, health department staff may need to establish how errors should be corrected (e.g., accept “male” but not “pregnant”). Inevitably, CTR program policies and procedures change, so it is important to periodically revisit data management plans to ensure that they are current and appropriate.
In addition, health departments should send service providers regular summary reports that provide information about error patterns and provide technical assistance, as needed, to resolve continuing issues. Once errors have been resolved, a final aggregate-level check (e.g., frequency check, cross-tabulation, trend analysis) should be performed to ensure that errors have been addressed.
Standard F4: Health departments should routinely document all decisions and procedures related to data management and cleaning.
CTR data staff should routinely document details of changes to the original data (e.g., recodes, statistical procedures, data transformations, imputation or exclusion of outliers or missing data). Documenting the decisions and procedures concerning data management and cleaning is critical because it ensures standardization; procedures can be replicated consistently by all data management staff. Without clear guidance, it may be difficult to resolve errors associated with the interpretation of specific variables.
Documentation of decisions and procedures should include the following:
- List and description of possible data errors associated with CTR data variables
- Data reports that summarize aggregate-level frequencies
- Error reports that list records by type of error (e.g., duplicate ID numbers; missing, incomplete, out-of-range values)
- Decisions about how data errors were resolved and the rationale for each decision
- Algorithms for recoding variables or computing new variables
- Computer programs that identify errors by the use of queries, frequency counts, cross-tabulations, along with comments to describe what each program does
- Computer programs that resolve errors (e.g., by using if-then statements)