Rochester Epidemiology Project Data Exploration Portal

Introduction The goal of this project was to develop an interactive, web-based tool to explore patterns of prevalence and co-occurrence of diseases using data from the expanded Rochester Epidemiology Project (E-REP) medical records-linkage system. Methods We designed the REP Data Exploration Portal (REP DEP) to include summary information for people who lived in a 27-county region of southern Minnesota and western Wisconsin on January 1, 2014 (n = 694,506; 61% of the entire population). We obtained diagnostic codes of the International Classification of Diseases, 9th edition, from the medical records-linkage system in 2009 through 2013 (5 years) and grouped them into 717 disease categories. For each condition or combination of 2 conditions (dyad), we calculated prevalence by dividing the number of persons with a specified condition (numerator) by the total number of persons in the population (denominator). We calculated observed-to-expected ratios (OERs) to test whether 2 conditions co-occur more frequently than would co-occur as a result of chance alone. Results We launched the first version of the REP DEP in May 2017. The REP DEP can be accessed at http://rochesterproject.org/portal/. Users can select 2 conditions of interest, and the REP DEP displays the overall prevalence, age-specific prevalence, and sex-specific prevalence for each condition and dyad. Also displayed are OERs overall and by age and sex and maps of county-specific prevalence of each condition and OER. Conclusion The REP DEP draws upon a medical records-linkage system to provide an innovative, rapid, interactive, free-of-charge method to examine the prevalence and co-occurrence of 717 diseases and conditions in a geographically defined population.


Introduction
The goal of this project was to develop an interactive, web-based tool to explore patterns of prevalence and co-occurrence of diseases using data from the expanded Rochester Epidemiology Project (E-REP) medical records-linkage system.

Methods
We designed the REP Data Exploration Portal (REP DEP) to include summary information for people who lived in a 27-county region of southern Minnesota and western Wisconsin on January 1, 2014 (n = 694,506; 61% of the entire population). We obtained diagnostic codes of the International Classification of Diseases, 9th edition, from the medical records-linkage system in 2009 through 2013 (5 years) and grouped them into 717 disease categories. For each condition or combination of 2 conditions (dyad), we calculated prevalence by dividing the number of persons with a specified condition (numerator) by the total number of persons in the population (denominator). We calculated observed-to-expected ratios (OERs) to test whether 2 conditions co-occur more frequently than would co-occur as a result of chance alone.

Introduction
Changes in health information technology during the last decade and an increasing demand for data sharing and transparency have increased public access to health-related data. In particular, several web-based tools have been developed to share local, state, and national health data with audiences ranging from the general public to public health agencies and epidemiologic researchers (1-10) ( Table 1). These websites and interactive tools are intended to help communities throughout the United States understand the health of their county or state and to prioritize interventions. For example, County Health Rankings and America's Health Rankings summarize and display data on factors important to health and health management (2,5). However, the data used by these sites are collected from cross-sectional surveys of various groups of people at single points in time (eg, the Behavioral Risk Factor Surveillance System [BRFSS]) (11,12). These sites have data that can be summarized individually, by demographic characteristics, and by geographic region, but exploration of associations between different conditions or across data sets is not possible because the data are rarely linked to identifiable individuals. In addition, survey data are self-reported, and the number and type of health conditions included are limited.
Other interfaces are available from the Centers for Medicare & Medicaid Services (CMS Data Navigator) (7), the Dartmouth Atlas of Health Care (8), and the Agency for Healthcare Research and Quality's Healthcare Cost and Utilization Project website (HCUPnet) (9). These websites allow users to summarize Medicare and Medicaid claims information (7,8) as well as information from state and national inpatient and emergency department databases (13). The CMS Data Navigator provides access to published reports on various topics (7), and users can explore the prevalence and co-occurrence of 20 chronic conditions (through the Medicare Chronic Conditions Dashboard) (14,15). The Dartmouth Atlas website tools also aggregate and summarize Medicare data but allow for more customized queries focused on health care utilization and outcomes (8). However, these sites provide limited information on specific diseases and conditions, particularly those that are rare. In addition, Medicare predominantly serves the population aged 65 years or older. Therefore, these sites are of limited use for understanding the health of younger populations.
Finally, the HCUPnet website allows users to explore detailed data available from State and National Inpatient Databases, the State Ambulatory Surgery and Services Database, and State and National Emergency Department Databases (9). Information is available across all ages and both sexes, and on all conditions that occur during an inpatient or emergency department visit. However, these data sets lack information on outpatient visits, and the interactive tools do not allow users to explore associations across conditions. Our objective was to develop an interactive, web-based tool, the Rochester Epidemiology Project (REP) Data Exploration Portal (DEP), to display data on the prevalence and co-occurrence of 717 conditions from the expanded REP medical records-linkage system, which collects data from participating health care providers in in a 27-county region of southern Minnesota and western Wisconsin (16)(17)(18).

Methods
Development of the REP DEP took place from May 2016 through May 2017. The Mayo Clinic and Olmsted Medical Center institutional review boards approved this project. We designed the tool to allow users to access summary information on the health conditions of people in a 27-county region of southern Minnesota and western Wisconsin. First, we coded information on 717 conditions. We then calculated 1) the prevalence of each selected condi-tion, 2) the prevalence of combinations of 2 conditions (dyads), and 3) observed-to-expected ratios (OERs) to measure the excess co-occurrence of dyads.

Data source
From 1966 through 2010, the REP focused on the health of the Olmsted County, Minnesota, population (16,17). In 2010, the REP expanded to encompass people living in a 27-county region of southern Minnesota and western Wisconsin (18). The expanded REP (E-REP) captures data on all health conditions that come to medical attention at the participating health care providers in this region. The data are electronically available at the person-level for community members of all ages and are collected from all health care providers in the REP system, from inpatient records, outpatient records, and emergency departments (17,18).
We used REP DEP summary data for all people who lived in this region on January 1, 2014, and who were identified by using the E-REP infrastructure (18,19). The REP DEP includes data for nearly 700,000 persons (61% of the entire population in the region) ( Table 2). Characteristics of the REP DEP population are similar to those of the entire 27-county region and of Minnesota and Wisconsin (Table 3). The age and sex distribution is also largely similar to that of the entire US population (20); however, people living in the 27 counties, compared with the entire US population, have a higher level of education and are less likely to be of a nonwhite race or Hispanic ethnicity ( Table 3).
The REP DEP includes information only for persons who have given permission for their medical records to be used for research purposes (91% of the sample population) (18). All information is available in aggregate summary form only, and the REP DEP reports values only when an age, sex, and/or county stratum contains 11 or more people.

Medical conditions
We developed the REP DEP to offer information on 717 conditions. We obtained diagnosis codes of the International Classification of Diseases, 9th edition (ICD-9) from patient medical records between January 1, 2009 and December 31, 2013, and we grouped codes by using 2 coding systems. First, we grouped ICD-9 codes into categories defined by the Agency for Healthcare Research and Quality as part of the Hospital Cost and Utilization Project (21). We used the Clinical Classification Software (CCS) to define a total of 690 conditions: 283 main-level, 376 sub-level, and 31 subsub-level code groupings (22,23). Second, we created 20 additional groupings by using diagnosis code categories defined by the US Department of Health and Human Services for studying multiple chronic conditions (24); we also added anxiety disorders to this list for a total of 21 chronic condition groupings.
Finally, we identified a series of 6 mental and neurological conditions that were well characterized by a single ICD-9 code, and we created a REP-defined sublevel grouping (Alzheimer's disease; dementia with Lewy bodies; Huntington's chorea; restless legs syndrome; amyotrophic lateral sclerosis; and mild cognitive impairment). The complete list of ICD-9 codes defining each of the 717 conditions is available on the REP DEP website: http:// rochesterproject.org/portal/.

Prevalence
We calculated the prevalence of each condition and the prevalence of dyads. A person was determined to have a condition if the medical record showed one or more diagnostic codes from the corresponding code grouping in the 5-year period before January 1, 2014. For each condition, we calculated prevalence by dividing the number of people with a specified condition (numerator) by the total number of people in the population (denominator). We calculated prevalence overall for a single condition and for dyads, and in strata by age, sex, and county.

Observed-to-expected ratios (OERs)
We calculated observed-to-expected ratios (OERs) to measure the excess co-occurrence of dyads (25,26). We divided the number of observed people with 2 conditions by the expected number of people with both conditions under the assumption of conditional independence. We computed the expected numbers of people at the single-year-of-age level. For example, the expected number of people with both conditions for the age stratum 0 to 20 years was calculated for single years of age from 0 to 20 and then summed. An OER of less than 1.0 indicates that fewer people were observed with co-occurring conditions than would be expected under the assumption of conditional independence. An OER greater than 1.0 indicates that more people with co-occurring conditions were observed than would be expected under the assumption of conditional independence.
We determined whether the OER differed significantly from 1.0 by calculating 95% confidence intervals directly from the Poisson distribution using Daly's method (27). We used ColorBrewer version 2.0 to illustrate the range of OERs in color (28). Prevalence and OER values for each county were directly standardized by age and sex to the total 2010 US Decennial Census population (Appendix) to facilitate comparison across counties while accounting for differences in age and sex distributions (20).

Results
The first version of the REP DEP was launched in May 2017 and can be accessed at http://rochesterproject.org/portal/. To search for a condition, users can click on the box "Characteristic A selection" and start typing the text of the condition of interest. The selection list will narrow to include conditions matching the typed text. The second condition, Characteristic B, is selected in the same way. Users can display results by using the "Prevalence" tab and the "Geography" tab.

Prevalence tab
The prevalence tab for 2 selected conditions shows the prevalence of each condition as a line graph, by sex and overall, across 5 age groups ( Figure 1). The tab also shows a graph of the prevalence of the 2 conditions co-occurring, by sex and overall, across 5 age groups. In addition, the tab shows a table of OERs by sex and age group. OERs are not calculated if fewer than 11 persons with both conditions are observed in a group. Similarly, for conditions affecting only one sex (eg, cancer of ovary), "NA [not applicable]" is reported in the table of OERs in the column for the unaffected sex and in the column for both sexes ("Total"). OER values that are significantly different from 1.0 are shaded with purple (OER < 1.0) and orange (OER > 1.0). OER values are not shaded if the OER is not significantly different from 1.0. For example, ovarian cancer and anxiety disorders can never co-occur in men, but they do co-occur more frequently than expected in women aged 40 to 64 years ( Figure 1).

Geography tab
Users can also display the prevalence or OER for a selected condition by county and by sex ( Figure 2). The standardized prevalence and OERs are displayed in a pop-up box when the cursor hovers over a selected county. The map in the sample screenshot indicates that the age-standardized prevalence of ovarian cancer varies across the 27-county region and is highest in Martin County.

Discussion
We developed the interactive, web-based REP DEP to display the prevalence and co-occurrence of 717 diseases and conditions recorded in the E-REP records-linkage system. We expect the REP DEP to be useful to local residents, health care practitioners, and local administrators in understanding patterns of disease in this Midwestern region. The data may also serve as a benchmark for other communities and may provide a cost-effective way for researchers to explore whether an association between 2 conditions exists before conducting a full epidemiologic study.
The REP DEP includes data on all conditions that come to medical attention, regardless of whether the care was delivered in the outpatient, inpatient, or emergency department setting. As such, it overcomes limitations of other websites that include only a limited number of conditions or only data from inpatient or emergency department settings (2)(3)(4)(5)(6)13), and it allows users to obtain prevalence estimates on both common and rare conditions and to include both inpatient and outpatient diagnoses. Second, the REP DEP includes data for all age groups, overcoming the limitations of websites that rely predominantly on Medicare data (7,8). We expect REP DEP prevalence estimates to be particularly useful for public health and care delivery organizations in this 27-county region in ranking their most urgent community health priorities. For example, tax-exempt hospitals must conduct a community health needs assessment every 3 years in compliance with the Patient Protection and Affordable Care Act, and they must develop a community health improvement plan to address the most urgent priorities (29). The REP DEP can identify the prevalence of a wide array of medical conditions, and, in the future, will provide a way to determine whether the prevalence of key conditions changes over time.
REP DEP data are also linked at the person-level, which allows users to explore associations between conditions. This type of data exploration is not possible on other websites that aggregate deidentified data from different sources and populations (2)(3)(4)(5)(6). In addition, the underlying data included in the REP DEP are linked to patient identifiers through the E-REP research infrastructure (18). With appropriate approvals, the E-REP can be leveraged for recruiting study participants, and these participants may be followed via their linked medical records to cost-effectively assess outcomes that come to medical attention. Therefore, the REP DEP offers a method for determining whether a given community includes a sufficient number of potential participants for a community-based clinical trial (30).
This study has limitations. Data are available for 61% of the population residing in the 27-county region. Participants may differ from nonparticipants, and prevalence estimates may be biased. Conditions that are diagnosed and treated at health care providers that do not participate in the E-REP may be missed, and the true prevalence of some conditions may be underestimated. The age and sex distribution of the population included in the REP DEP is similar to US Census estimates for the 27-county region, but participants may differ from nonparticipants in other factors that influence health (eg, socioeconomic status).
Second, we informally compared REP DEP prevalences for 5 common chronic conditions with 2015 prevalence estimates for the state of Minnesota from the BRFSS (31); however, we did not perform formal statistical testing for the differences. Prevalence estimates were similar for asthma (REP DEP, 8% vs BRFSS, 7%) and depression (REP DEP, 14% vs BRFSS, 19%). However, REP DEP estimates were higher for diabetes (REP DEP, 14% vs BRFSS, 8%), and lower for arthritis (REP DEP, 15% vs BRFSS, 22%) and hyperlipidemia (REP DEP, 24% vs BRFSS, 32%). These discrepancies highlight the fact that different data collection methods are likely to yield different prevalence estimates. Methodologic differences between the BRFSS and the REP DEP preclude a more formal comparison. The BRFSS estimates were PREVENTING CHRONIC DISEASE obtained from adult participants reporting whether they had ever been told that they had the condition of interest. By contrast, the REP DEP prevalence estimates were obtained from data on participants of all ages whose medical record had at least one ICD-9 code of interest in a 5-year time frame. The underlying ICD-9 codes were obtained from billing data and were not validated. Therefore, the prevalences and OERs generated by the REP-DEP may deviate from the truth. This limitation is common to all publicly accessible databases. In addition, the sensitivity and specificity of a single ICD-9 code for a condition of interest varies (32,33). Therefore, further validation studies may be necessary, depending on the condition of interest. Finally, the BRFSS estimates were for the adult population of the entire state of Minnesota, whereas the REP DEP estimates were for persons of all ages residing in a region that includes southern Minnesota and western Wisconsin. Inclusion of children in the estimates will underestimate the prevalence of chronic diseases that predominantly affect adults. However, variability in prevalence estimates may also reflect true prevalence differences between the REP DEP and BRFSS populations.
Third, ICD-9 codes were grouped into larger categories. Specific diagnoses may have been overly aggregated, resulting in the inability to test for associations of interest. For example, the diagnostic codes for Alzheimer's disease are part of the larger category of "delirium, dementia, and amnestic and other cognitive disorders." However, Alzheimer's disease is a major research focus for many investigators; therefore, we included Alzheimer's disease as an option in our search tool. In the second release of the REP DEP (January 2018), we included a series of more specific conditions. Finally, once we have 3 full years of data accumulation (2014-2016), we will add trend graphs to explore increases or decreases in the prevalence of conditions across calendar years.
The REP DEP covers a geographically defined Midwestern population, and the prevalence of medical conditions will be different in other United States communities, depending on the characteristics of the underlying population. However, the REP DEP data may still serve as a useful benchmark for other communities. In particular, it is often difficult to obtain baseline prevalence data for rare conditions. The REP DEP provides prevalence estimates for all conditions in this population, and it offers a free, rapid way to obtain comparison data. The REP DEP also provides an example of how other communities might leverage and display their own data to inform local planning efforts.
Finally, the underlying biological processes that lead to the development and co-occurrence of diseases and conditions are less likely to vary from community to community. Therefore, the OERs that can be obtained through the REP DEP are likely generalizable to other populations. As such, these data provide an avenue for researchers to determine whether 2 conditions are associated before conducting a larger, resource-intensive epidemiologic study.
The REP DEP provides a rapid, interactive, free-of-charge method to examine the prevalence and co-occurrence of 717 diseases and conditions in a large, Midwestern population. The REP DEP will be useful to local communities for understanding the prevalence of virtually all conditions in this region. In addition, these data may serve as a benchmark for other communities, particularly for rare conditions. The REP DEP can provide preliminary data for investigators who are considering further studies of the co-occurrence of diseases or are assessing the feasibility of a community-based clinical trial.
In January 2018, we released a new version of the REP DEP. This updated version of the portal allows users to choose from among 1,376 characteristics, including diagnosis-based medical conditions, procedures and surgeries, prescription medications, and demographic characteristics (eg, race, ethnicity, smoking status, overweight and obesity categories). In addition, users may now choose to define a characteristic as occurring in either the 5-year period before prevalence date or in a 1-year period before prevalence date. These updates to the REP DEP give users more flexibility to explore the relationships between characteristics. Complete details can be found in the updated REP DEP User Manual on the portal Documentation tab (http://www.rochesterproject.org/portal/ ).