Linking Data From Health Surveys and Electronic Health Records: A Demonstration Project in Two Chicago Health Center Clinics

Introduction Monitoring and understanding population health requires conducting health-related surveys and surveillance. The objective of our study was to assess whether data from self-administered surveys could be collected electronically from patients in urban, primary-care, safety-net clinics and subsequently linked and compared with the same patients’ electronic health records (EHRs). Methods Data from self-administered surveys were collected electronically from a convenience sample of 527 patients at 2 Chicago health centers from September through November, 2014. Survey data were linked to EHRs. Results A total of 251 (47.6%) patients who completed the survey consented to having their responses linked to their EHRs. Consenting participants were older, more likely to report fair or poor health, and took longer to complete the survey than those who did not consent. For 8 of 18 categorical variables, overall percentage of agreement between survey data and EHR data exceeded 80% (sex, race/ethnicity, pneumococcal vaccination, self-reported body mass index [BMI], diabetes, high blood pressure, medication for high blood pressure, and hyperlipidemia), and of these, the level of agreement was good or excellent (κ ≥0.64) except for pneumococcal vaccination (κ = 0.40) and hyperlipidemia (κ = 0.47). Of 7 continuous variables, agreement was substantial for age and weight (concordance coefficients ≥0.95); however, with the exception of calculated survey BMI and EHR–BMI (concordance coefficient = 0.88), all other continuous variables had poor agreement. Conclusions Self-administered and web-based surveys can be completed in urban, primary-care, safety-net clinics and linked to EHRs. Linking survey and EHR data can enhance public health surveillance by validating self-reported data, completing gaps in patient data, and extending sample sizes obtained through current methods. This approach will require promoting and sustaining patient involvement.


Introduction
Monitoring and understanding population health requires conducting health-related surveys and surveillance. The Behavioral Risk Factor Surveillance System (BRFSS), for example, is a state-based system of telephone surveys that collect data on health-risk behaviors, chronic conditions, use of preventive services, and healthrelated quality of life (HRQoL) of adults (1). BRFSS can be modified to assess emerging and urgent health issues and provides data on measures typically unrecorded in the clinical setting (eg, exercise, HRQoL, health attitudes, awareness, health knowledge) (2,3). Searching for new data sources is important, however, because population-based surveys can be costly and time-consuming and may produce biased results that are hard to generalize (1,(4)(5)(6)(7)(8)(9)(10)(11)(12).
Expanded use of electronic health records (EHRs) -complete with appropriate protection of patient confidentiality -can help The opinions expressed by authors contributing to this journal do not necessarily reflect the opinions of the U.S. Department of Health and Human Services, the Public Health Service, the Centers for Disease Control and Prevention, or the authors' affiliated institutions.
improve the design and delivery of public health interventions and clinical care; data in EHRs can be used to help find new causes of infectious disease and to address outbreaks by triggering public health alerts, providing recommendations to clinicians, and enhancing communications between public health practitioners and clinical organizations (11)(12)(13). Additionally, EHRs can help identify patients needing medical care, disease management, preventive health services, and behavioral counseling (2,3,(14)(15)(16)(17). EHRs can also help control rising health care costs by eliminating unnecessary tests, procedures, and prescriptions (17).
EHRs may help improve patient care and population health when linked to survey data and other information about health-related behavior, HRQoL, and details about working and living conditions (2,3,18). For people managing a chronic illness, for example, the EHR can validate responses, because survey answers can be linked to recorded clinical events. Likewise, behaviors (eg, exercise) recorded in a recent survey could trigger alerts and recommendations back through the EHR. Inclusion of patient-reported measures in EHRs can enhance patient-centered care, patient health, and capacity to conduct population-based research (2,3).
The objective of this study was to explore the feasibility of electronically collecting self-administered patient survey data in urban, primary-care, safety-net clinics and subsequently linking and comparing that data with patients' EHR data.

Methods
Alliance of Chicago Community Health Services (Alliance; http:// alliancechicago.org/) is a federal Health Center Controlled Network. Alliance approached 4 of its network health centers about project participation, selecting them for their large patient volume, diverse geographic locations, distinct and diverse patient populations, and history of participation in new initiatives. Although 3 health centers approved the project, only 1 was able to participate in the project's timeframe. We implemented our study in 2 of that health center's clinics, and it was approved by that clinic's research review committee.
We recruited clinic patients aged 18 years or older by using fliers and announcements in waiting areas and in check-in procedures. Survey administrators used standardized scripts to summarize the survey's goals for interested patients. Participants reviewed an electronic consent form and received a hard copy of the form; they provided separate informed consent for survey participation and for subsequent survey-EHR linkage. Each survey participant received a modest incentive (regardless of consent to EHR linkage). Survey administrators were available to assist patients throughout data collection.
From September 2014 through November 2014, a convenience sample of 527 patients completed the self-administered, webbased survey on various brands of electronic tablets, desktop computers, and cellular phones. Tablet data plans were purchased to minimize impact on health center resources and to minimize data connectivity issues.
Questions from the Illinois BRFSS (http://app.idph.state.il.us/ brfss/) were used to collect information on patients' sociodemographic characteristics, health behaviors, chronic conditions, receipt of preventive care services, and medical care. Questions related to chronic conditions were selected on the basis of their ability to be matched to data available in the EHR. Questions on medication use, laboratory findings, and blood pressure readings helped us compare data on self-reported chronic conditions with EHR content. The number of questions each participant was asked was determined by sex (eg, sex-specific preventive care services), age (eg, age-specific cancer screenings), and survey responses that determined question branching. The survey took an average of 20 to 30 minutes to complete and was hosted by using the Survey Analytics Online Survey Platform (Survey Analytics LLC).
Of 527 survey participants, 47% (n = 251) consented to have their survey responses linked to their EHR; 99% (n = 248) of these consenting patients had an EHR. At the end of the survey and EHR extraction, 2 de-identified analytic data sets were created: 1) a set that contained only the survey data of patients who did not consent to the EHR linkage and 2) a set that contained the survey and EHR data of patients who consented to EHR linkage.
When possible, differences in categorical variable construction between survey data and EHR data were resolved by collapsing the original categories to form a common metric. Continuous variables except for blood pressure were constructed similarly in the survey instrument and the EHR. Patients reporting that a health care professional said they had high blood pressure (HBP) were asked to enter their systolic and diastolic blood pressure. For the EHR abstraction, the last 3 systolic and diastolic blood pressure readings were taken, and the mean systolic and diastolic pressures were calculated. Self-reported weight and height were assessed using 2 survey questions: "About how much do you weigh without shoes?" and "About how tall are you without shoes?" Patients were classified as underweight (body mass index [BMI, kg/m 2 ] <18.5), normal weight (BMI 18.5-<25), overweight (BMI 25-<30), or obese (BMI ≥30). Self-classified BMI was assessed with the survey question, "Would you classify your weight as low (underweight), normal weight, overweight, or obese?" We calculated the distribution of the study population by survey duration, sociodemographic characteristics, and self-rated health status, overall and by consent to EHR linkage. For categorical variables, we used the χ 2 test to assess significant differences between those who agreed to survey-EHR linkage and those who did not. For continuous variables, we used the t test to assess significant mean differences between the 2 patient groups. To assess concordance between survey data and EHR data, we examined 248 patients who consented to the EHR linkage and for whom an EHR record was found. For categorical variables, we applied Cohen's (19) κ coefficient with 4 predefined agreement levels: excellent agreement (κ ≥0.9), good agreement (κ ≥0.6 to κ <0.9), fair agreement (κ ≥0.3 to κ <0.6), and poor agreement (κ <0.3). Because we observed some cases that may belong to the κ paradox (20), we also calculated overall agreement in percentage (= 100 × the number of concordant counts/the total sample size). For continuous variables, we applied Lin's (21,22) concordance correlation coefficient (ρ c ) with 4 predefined agreement levels: almost perfect (ρ c >0.99), substantial (ρ c ≥0.95 to ρ c ≤0.99), moderate (ρ c ≥0.90 to ρ c <0.95), and poor (ρ c <0.90) (23). For all analyses, P < .05 was considered significant, and data were analyzed in SAS version 9.3 (SAS Institute, Inc).

Results
Participant ages ranged from 18 to 87 years (mean, 43.4 y; standard deviation [SD], 14.7 y) ( Table 1). The sample was predominantly non-Hispanic black (90.4%), female (70.4%), never married or a member of an unmarried couple (61.2%), spoke English as their primary language (96.3%), and had Medicaid or Medicare as primary health insurance coverage (69.1%). More than 70% reported their health status as excellent, very good, or good, and 62.7% reported no disability. Most had annual household incomes less than $20,000, rented their primary residence, and had no children in the household.
Seven health behaviors of the convenience sample were distributed as follows: always or nearly always wearing a seat belt (90.0%), watching or reducing sodium intake (60.2%), consuming one or more drinks of alcohol in the past 30 days (59.6%), engaging in leisure-time physical activity (58.3%), consuming 5 or more servings daily of fruits and vegetables (42.2%), currently smoking cigarettes (35.4%), and increasing medication use in the past 30 days without the advice of a health care professional (8.9%) (Figure). Compared with patients who did not agree to having their survey results linked to their EHRs, those who agreed were older (mean 45.4 y vs 41.5 y, P = .003), more likely to report fair or poor health (32% vs 24%, P = .03), and took longer to complete the electronic survey (27.8 minutes vs 21.3 minutes, P < .01) ( Table  1).

Discussion
This study compared results from a self-administered web-based survey with de-identified patient data from EHRs in an urban primary-care setting. We found a satisfactory degree of concordance between survey data and EHR data for nonmodifiable demographic characteristics and for some health-related measures: diabetes, HBP, HBP medication, weight, and calculated categorical and continuous BMI. We found lower levels of concordance for modifiable sociodemographic characteristics, pneumococcal vaccination, hyperlipidemia, self-classified BMI, hemoglobin A1c among patients reporting diabetes, and blood pressure among patients reporting hypertension -especially diastolic pressure. EHR data on self-reported health-risk behaviors were unavailable for comparison; data on tobacco use screening were available.
Fewer than half the surveyed patients gave EHR linkage consent; those consenting showed significant differences from those who did not. Similar to other researchers' findings (24), those consenting were older and more likely to report fair or poor health. Unlike other research findings (24), however, we did not find significant differences by sex, employment status, or type of health insurance coverage. Further investigation into what factors may increase consent or enhance patient engagement could aid project sustainability and representativeness of the patient population.
Generally, our concordance findings were consistent with studies that have used similar methods (7,25,26). Our level of agreement was similar to previous research assessing data quality between ambulatory medical record data and patient survey data for diabetes and BMI, but we had a higher level of concordance for HBP, HBP medication, and hyperlipidemia and a lower level of concordance for lipid-lowering medication (26). Additionally, we had substantial agreement for weight and, in contrast, poor agreement for height. We also found good overall agreement for BMI based on self-reported height and weight (86%) but poor overall agreement for self-classified BMI (20%). Studies show people generally overestimate their height and underestimate their weight and BMI (6). This reporting bias varies, however, by the demographic characteristics of the study population (eg, sex, race, age). For example, men are more likely to exaggerate their height than women are. Our convenience sample was predominantly female, non-Hispanic black, and aged 45 to 64 years. Differences between self-report and direct measures may also be due to the respective population's sociocultural perceptions of body weight and may be biased by social desirability (6). Our results demonstrate the need for direct measures that validate self-reported data, because patients were more likely to perceive themselves as in a lower BMI category than their calculated BMI category showed. Our results may also reflect a lack of awareness of their BMI and, consequently, greater risk of poor health outcomes. Further research is needed to fully understand and address this finding (eg, improved patient-provider communication, obesity screening and intervention). Because our results are not generalizable to the health center's patient population or to other patient populations (convenience sample/apparent selection bias), interpretation should be done with care.
Survey and EHR data may have poor concordance for many reasons and may show where each data source can help improve the accuracy and completeness of patient and population data. When survey and EHR clinical measures are not concordant, EHR data tend to be more accurate than survey data because biases associated with self-report vary (5-7). For example, correctly remembering the date of one's last tetanus shot or hemoglobin A1c test result is difficult. For modifiable sociodemographic characteristics, however, self-reported data are likely to be more accurate than EHR data, because busy health centers have few resources or incentives to update nonclinical data elements. Institutional incentives also may influence poor concordance, as when a sliding fee scale could encourage under-reporting of income or private health care coverage or when health insurance plans charge higher premiums to consumers who smoke (27). Our study has several limitations. First, we used a convenience sample of patients from 2 Chicago health center clinics. This sample selection bias limits our ability to make inferences to the health center's patient population across all its clinic sites and its comparability to other patient populations in the area; the sample was predominantly female, non-Hispanic black, unmarried, and low income; patients had public insurance coverage and were more likely to have a cellular telephone or an email address than a landline telephone. As a result, for public health surveillance, multiple data collection modes and data sources may be needed to effectively reach and ensure the representativeness of data for population subgroups. Moreover, public health professionals and policy makers must be aware of subpopulations that are unconnected to the health care system and whose members have limited health records or lack them entirely (4). Second, less than 50% of the patients surveyed consented to EHR linkages. Further analysis of the factors associated with consent, and which are amenable to modification, is needed to access the wealth of data available in EHRs. Third, analysis of the linked data found some variables with low prevalence that prevented further assessment of agreement. Fourth, some variables had good agreement but low κ scores, suggesting that agreement may have occurred by chance alone. Finally, neither data source may be considered a gold standard for all items measured. For example, survey data may have inherent biases, and EHR data and the data extraction process may have complexities that are not fully known or accounted for. Nevertheless, these limitations may change over time with meaningful use of EHRs, advancements in health information technologies, and emphasis on quality and patient-centered care as well as implementing new methods that integrate lifestyle measures into prescribed health care (eg, prescribed physical activity) (2-4,28).
These limitations notwithstanding, a symbiotic relationship exists between survey data and clinical data. Self-reported data are PREVENTING CHRONIC DISEASE needed to augment clinical data for medical services (eg, immunizations, screenings, behavioral counseling), imaging and other diagnostics, and medications obtained outside of the patient's health center (2,7). Self-reported measures, although subject to biases, are vital to providing a complete picture of patient health, because many health-related measures may not be in the EHR (eg, behaviors, HRQoL, health attitudes, awareness, knowledge) or upto-date (eg, modifiable sociodemographic characteristics) (2,17,29,30). At the same time, EHRs can be used to validate selfreported clinical measures and facilitate the development of correction factors that can be applied to self-reported data in the absence of physical measurement, which is often costly or not possible (6). In unison, the 2 data sources have the potential to improve disease management, reduce costs, and enhance two-way data exchange between public health and clinical organizations.
As health systems and their information technologies continue to evolve, researchers should continue the search for high-quality patient health data. Doing so can help health practitioners, public health professionals, and policy makers successfully evaluate and reduce existing health disparities. Furthermore, public health policy and practice can be guided by data science methods (including predictive analytics) by using combined data sources. Population-based surveys, EHRs, and other data sources all have a role in providing a more complete picture of the health of all Americans, while improving their health and access to care. To this end, this project demonstrated the feasibility of computer-assisted collection of consumer survey data and matching it to EHR data. This approach can enhance health information from unique, often underrepresented populations with health disparities, increase efficiency and breadth of surveillance activities, and improve validity of objective measures. More research is needed to promote and sustain patient involvement in their health and health records, which is vital to the success and sustainability of this approach.   Abbreviations: BMI, body mass index; CI, confidence interval; EHR, electronic health record; GED, general equivalency degree; HBP, high blood pressure; HPV, human papillomavirus; N, number of eligible patients included in item-level analysis. a Defined as the number of concordant counts (both answered yes or both answered no in 2 sources) divided by the total sample size and expressed as a percentage. κ ≥0.9 = excellent agreement, κ ≥0.6 to κ <0.9 = good agreement, κ ≥0.3 to κ <0.6 = fair agreement, and κ <0.3 = poor agreement. b Includes respondents who reported their ethnicity as non-Hispanic and their race as American Indian or Alaska Native, Asian or Asian American, Native Hawaiian or Pacific Islander, mixed race, or some other race. In EHR data, patients coded as Hispanic or Latino did not have a race code. Similarly, patients with a race value did not have an ethnicity code. c Includes unemployed, homemaker, and unable to work. Patients coded as unemployed in EHR are categorized as other. d Patients who responded "Yes, through my school" or "Yes, I purchased on my own" on survey were not included in this analysis because EHR data did not have equivalent categories. e Self-reported BMI was based on 2 survey questions: "About how much do you weigh without shoes?" and "About how tall are you without shoes?" and compared with the EHR's BMI based on the EHR's height and weight variables. Patients who responded "Don't know/not sure" to either question or who were missing an EHR value were not included in this analysis. f Self-classified BMI was based on the survey question, "Would you classify your weight as: low (underweight), normal, overweight, or obese?" and compared with calculated BMI based on the EHR's height and weight variables. Patients who responded "Don't know/not sure" or who were missing an EHR value were not included in this analysis.

Tables
(continued on next page) Abbreviations: BMI, body mass index; CI, confidence interval; EHR, electronic health record; GED, general equivalency degree; HBP, high blood pressure; HPV, human papillomavirus; N, number of eligible patients included in item-level analysis. a Defined as the number of concordant counts (both answered yes or both answered no in 2 sources) divided by the total sample size and expressed as a percentage. κ ≥0.9 = excellent agreement, κ ≥0.6 to κ <0.9 = good agreement, κ ≥0.3 to κ <0.6 = fair agreement, and κ <0.3 = poor agreement. b Includes respondents who reported their ethnicity as non-Hispanic and their race as American Indian or Alaska Native, Asian or Asian American, Native Hawaiian or Pacific Islander, mixed race, or some other race. In EHR data, patients coded as Hispanic or Latino did not have a race code. Similarly, patients with a race value did not have an ethnicity code. c Includes unemployed, homemaker, and unable to work. Patients coded as unemployed in EHR are categorized as other. d Patients who responded "Yes, through my school" or "Yes, I purchased on my own" on survey were not included in this analysis because EHR data did not have equivalent categories. e Self-reported BMI was based on 2 survey questions: "About how much do you weigh without shoes?" and "About how tall are you without shoes?" and compared with the EHR's BMI based on the EHR's height and weight variables. Patients who responded "Don't know/not sure" to either question or who were missing an EHR value were not included in this analysis. f Self-classified BMI was based on the survey question, "Would you classify your weight as: low (underweight), normal, overweight, or obese?" and compared with calculated BMI based on the EHR's height and weight variables. Patients who responded "Don't know/not sure" or who were missing an EHR value were not included in this analysis.
(continued on next page) Abbreviations: BMI, body mass index; CI, confidence interval; EHR, electronic health record; GED, general equivalency degree; HBP, high blood pressure; HPV, human papillomavirus; N, number of eligible patients included in item-level analysis. a Defined as the number of concordant counts (both answered yes or both answered no in 2 sources) divided by the total sample size and expressed as a percentage. κ ≥0.9 = excellent agreement, κ ≥0.6 to κ <0.9 = good agreement, κ ≥0.3 to κ <0.6 = fair agreement, and κ <0.3 = poor agreement. b Includes respondents who reported their ethnicity as non-Hispanic and their race as American Indian or Alaska Native, Asian or Asian American, Native Hawaiian or Pacific Islander, mixed race, or some other race. In EHR data, patients coded as Hispanic or Latino did not have a race code. Similarly, patients with a race value did not have an ethnicity code. c Includes unemployed, homemaker, and unable to work. Patients coded as unemployed in EHR are categorized as other. d Patients who responded "Yes, through my school" or "Yes, I purchased on my own" on survey were not included in this analysis because EHR data did not have equivalent categories. e Self-reported BMI was based on 2 survey questions: "About how much do you weigh without shoes?" and "About how tall are you without shoes?" and compared with the EHR's BMI based on the EHR's height and weight variables. Patients who responded "Don't know/not sure" to either question or who were missing an EHR value were not included in this analysis. f Self-classified BMI was based on the survey question, "Would you classify your weight as: low (underweight), normal, overweight, or obese?" and compared with calculated BMI based on the EHR's height and weight variables. Patients who responded "Don't know/not sure" or who were missing an EHR value were not included in this analysis.  Abbreviation: BMI, body mass index; BP, blood pressure; CI, confidence interval; EHR, Electronic health record; N,; SD, standard deviation. a Number of eligible patients included in item-level analysis. b Substantial agreement = ρ c ≥0.95 to ρ c ≤0.99; poor agreement = ρ c <0.90. c Self-reported BMI was based on 2 survey questions ("About how much do you weigh without shoes?" and "About how tall are you without shoes?") and compared with EHR's BMI based on EHR's height and weight variables. Patients who responded "Don't know/not sure" to either question or who were missing an EHR value were not included in this analysis. d Last hemoglobin A1c among patients who reported being told by a health professional that they had diabetes. e Among patients who reported being told by a health professional that they had high blood pressure.