Forecasting Participants in the All Women Count! Mammography Program

Introduction The All Women Count! (AWC!) program is a no-cost breast and cervical cancer screening program for qualifying women in South Dakota. Our study aimed to identify counties with similar socioeconomic characteristics and to estimate the number of women who will use the program for the next 5 years. Methods We used AWC! data and sociodemographic predictor variables (eg, poverty level [percentage of the population with an annual income at or below 200% of the Federal Poverty Level], median income) and a mixture of Gaussian regression time series models to perform clustering and forecasting simultaneously. Model selection was performed by using Bayesian information criterion (BIC). Forecasting of the predictor variables was done by using an autoregressive integrated moving average model. Results By using BIC, we identified 5 clusters showing the groups of South Dakota counties with similar characteristics in terms of predictor variables and the number of participants. The mixture model identified groups of counties with increasing or decreasing trends in participation and forecast averages per cluster. Conclusion The mixture of regression time series model used in this study allowed for the identification of similar counties and provided a forecasting model for future years. Although several predictors contributed to program participation, we believe our forecasting analysis by county may provide useful information to improve the implementation of the AWC! program by informing program managers on the expected number of participants in the next 5 years. This, in turn, will help in data-driven resource allocation.


Introduction
An estimated 1 in 8 women will be diagnosed with breast cancer at some point in their lives (1). In 2014, breast cancer was the second leading cause of cancer death in women in South Dakota, and 608 women were newly diagnosed with the disease (2). More than half of the new breast cancers were diagnosed and reported at a localized (early) stage. Since 1997, South Dakota has administered mammograms and Papanicolaou (Pap) smears to women who qualify under the All Women Count! (AWC!) program (3). From 2012 through 2016, the AWC! program screened more than 3,914 eligible women for breast cancer (4).
AWC! is part of the National Breast and Cervical Cancer Early Detection Program (NBCCEDP), which has been the subject of ample research and reporting (5)(6)(7)(8). The Journal of Cancer Causes and Control published an entire issue dedicated to the effects of NBCCEDP (7), and several studies have reported on the program, including the proportion of women reached and the program's impact on breast cancer mortality rates among low-income women (annual incomes at or below 200% of the Federal Poverty Level). Other articles did not discuss NBCCEDP but discussed disparities in cancer screening among various groups (9,10) with some specific to breast cancer screenings (11)(12)(13). To our knowledge, however, forecasting participation in NBCCEDP has not been done. Forecasting participation in AWC!, an NBCCEDP program, would assist with planning and resource allocation and thereby increase access to timely breast cancer screening among underserved women in South Dakota. The goals of our project were to forecast the number of participants for South Dakota's 66 counties and to identify county clusters within the state that share similar socioeconomic characteristics and that are rural with low populations. We fit a model within each cluster simultaneously by using a finite mixture of Gaussian regression time series models (14).

Data source
The AWC! data set consisted of patient sociodemographic information, residential information, date of visit to health care provider, and medical screenings from 1997 through 2017. Our analysis focused on breast cancer screening, both mammography and clinical breast examination (CBE). The data set did not include counts from other programs operating in South Dakota that offered free or reduced-cost mammograms, including those of the Indian Health Service. We counted only the first mammography visit per year. If a woman had abnormal results and required an additional mammogram, only the first mammogram was counted. Additional data sources were used to gather further predictors. Locations of participating mammography clinics in the state over a 10 year period (2005-2015) were also provided to us by the South Dakota Department of Health. We used the US Census Bureau's Small Area Income and Poverty Estimates (SAIPE) database to obtain the median income of residents and the percentage of the population with annual incomes at or below 200% of the Federal Poverty Level (hereinafter poverty percentage) for each county (15) and the census bureau's Population and Housing Unit Estimates database (16) for estimates by sex, race, and age group. We extracted the population of women aged 40 to 64 from the latter database at the county level by year and used this for our analysis.
The initial AWC! program screening data set contained 63,990 rows with 26,988 unique participants. At the time of our analysis, 2017 data were not complete and were removed, reducing our row count by 2,139 rows. Our analysis was concerned only with breast cancer screening (mammography and CBE). All participants who received either a CBE or a mammogram were kept. Next, we removed all participants from outside South Dakota, because our analysis included only South Dakota residents. Although women aged 30 to 39 are eligible under NBCCEDP to receive a CBE but not a mammogram, women in this age group were outside the scope of our study and were therefore excluded, reducing the data by 15,783 rows. Ninety-one percent of these of these removed rows contained data on women aged 30 to 39 who received only CBE. Our analysis was concerned only with the number of AWC! participants, defined as women aged 40 to 64 who received a CBE or mammogram at least once during a given year, regardless of their number of clinic visits in a given year. We then obtained counts of the number of participants per year by county. This yielded a total of 37,922 CBE or mammogram visits from 1997 through 2016. The SAIPE and Population and Housing Estimates data sets containing the 3 predictors (ie, population, poverty percentage, and median income) were joined to these counts on the basis of year and county.

Statistical analysis
Some South Dakota counties are very rural and thus had a small number of participants. By grouping these counties together into clusters, we increased the amount of data used to build our forecasting model, thereby increasing the model's robustness. The advantages of this were twofold. First, we could identify similar counties for future program modifications. Second, we took into account that the number of participants over the years for a given county were autocorrelated and not independent. These 2 procedures can be done simultaneously by using a finite mixture model of Gaussian regression time series. The model is given by where τ k 's for k=1,…,K are mixing proportions and have the restrictions 0 <τ k ≤ 1 and must sum to 1. ϕ T is a T-variate Gaussian distribution, and X i β k , and Σ k are the mean vector and covariance matrix of the Gaussian distribution. Therefore, we model y i -X i β k as a zero-mean autoregressive-moving average (ARMA) (p, q) time series, where y i is a T-variate response vector and X i is a T × m matrix of predictor variables, where m is the number of predictor variables in the regression model. The model parameters were estimated by using the Expectation Maximization algorithm (17). The result from the Expectation (E)-step, deals with identifying groups of similar counties that exist in the data and the Maximization (M)-step, provides the parameter estimates within each group identified from the E-step. These 2 steps are iterated until a convergence criterion is met indicating that the best solution was achieved. More details on this model are available (14). This mixture model is used to find similar counties and to build a single regression ARMA model within each cluster. In our work, the optimal number of clusters was determined by using the Bayesian information criterion (BIC) (18).
Once models were trained on the currently available data, forecasting was carried out. All the variables used as predictors in the models needed to be available for the forecast period. To accomplish this, we used a simple ARIMA (autoregressive integrated moving average) (p, d, q) model, which is an ARMA model, with I for Integrated, meaning y t differenced to create a stationary time series. The model is given by and B is a backshift operator such that B j (y t ) = y tj . ε t is assumed to be white noise. The optimal orders for this model, p, q, and d, were found by using the Akaike information criterion (19). This model was fitted by using the R package forecast (20). After obtaining the best model for each county and forecast predictors, we forecast the next 5-year counts. Model assessment was done through the validation set approach. The data set was split into training and validation by using year. The first 17 years of data PREVENTING CHRONIC DISEASE were used for the training set, and the remaining 3 years were used for the validation set. The validation mean squared error (MSE) was calculated to assess the accuracy of our forecasting algorithm. The MSE was calculated as follows: MSE = 1/n Σ i (Y i -Ŷ i ) 2 . We define n as the total number of forecasts, Y i as the observed count, and Ŷ i as the forecast for i=1,…,n. All analysis was completed using R version 3.4.2 (R Corporation).

Results
The number of AWC! participants increased steadily from 1997 through 2011 and then sharply decreased in 2012. Since then, participation steadily decreased. Some counties had similar sociodemographic characteristics (average number of participants , median income, poverty percentage, and population) for the 1997-2016 time period (Figure 1). Minnehaha and Pennington were the most populated counties and had the largest number of AWC! participants (Table 1). Corson, Dewey, Buffalo, and Ziebach all had populations with a low median income and a high poverty percentage. Finally, Brown, Codington, and Lincoln had low poverty percentages and high populations. The mixture model described above was used and the optimal number of clusters was determined to be 5. We summarized the characteristics that pertain to the identified clusters and predictors and the number of participants for 2016 for the 5 clusters (Table 2). Cluster 1 contains the 3 counties with the largest populations, Minnehaha, Pennington, and Hughes counties. This cluster had the smallest average poverty percentage and the highest median income. Cluster 2 had the highest poverty percentage, an average of almost 25%. It also had the lowest median income of the clusters. The overall average number of participants for this cluster was the second-largest even though its average population size was the third largest. Cluster 3 was very similar to Cluster 1 in regard to poverty percentage and median income but had a much smaller population than Cluster 1. Cluster 4's predictors were in the middle of the other clusters. It had the third-largest poverty percentage and median income of the clusters and the second-smallest population. It contains the second-largest number of counties with 19 in the cluster. Cluster 5 has the smallest average population at only 500. It also had the second-highest number of people living in poverty as indicated by the higher poverty percentage and lower median income than the other clusters. In addition, it had the smallest number of participants, an average of 7 participants in the last 20 years, and the largest number of counties (N = 30), fewer than half of the 66 counties in South Dakota.
Analysis of forecasts over the next 5 years shows all 5 clusters with an increase in participants (Table 3). Cluster 2 is the only cluster with an expected decrease for a year, occurring in 2018. Cluster 1 is forecasted to have more than 1,000 participants in 2021. Individual county forecasts identified only 6 counties with an expected decrease in the number of participants, one county staying flat, and the rest of the counties with an increased number of participants. Sixteen counties had an observed decrease in participants over the last 5 years, but our model predicted them to have increased participation in the future.
Only two-thirds of clinics in these counties participated for all 10 years. Geographic patterns in county clusters varied (Figure 2). Most of South Dakota's population is concentrated in the eastern and western parts of the state with the central part sparsely populated. Cluster 1 contained the 2 largest counties on the east and west sides of the state. The counties in cluster 4 appear in groups of 2 to 3, mostly on the eastern side of the state. Most of the low population counties of the central and northwestern parts of the state belong to cluster 5 with the widest scatter. Also, most of the counties in cluster 5 do not have a participating clinic in their county.  Our forecast of the average trend in AWC! participation for the identified clusters for the next 5 years, 2017 through 2022, (Table 3) showed that, if all the circumstances stay the same (eg, insurance coverage, policy, advertisement of the AWC! program) participation on average will increase at the cluster level. At the individual county level, forecasts showed that participation in some will increase and will decrease in others. The MSE of training data for years 1997 through 2013 for the state was 21.54, and for the validation data for years 2014 through 2016 was 43.80. The increased test MSE was expected because our training data contained only the first year of decrease from 2012 through 2013. However, when building the final forecasting model, all 20-year data were used; therefore, we expect the error rate on the forecasts to be less than the test MSE reported above.

Discussion
Our data contained only AWC! screening results. However, 2016 Behavioral Risk Factor Surveillance System (BRFSS) data (21) and Small Area Health Insurance Estimates (SAHIE) (22) data can be used to provide a general context to the AWC! program participation rate. BRFSS results showed that 68% of women in South Dakota aged 40 or older received a mammogram in 2015 and 2016. This is approximately 139,806 women. Of these, we estimated that about 1,926 women aged 40 to 64, about 2.17%, used the AWC! program during those 2 years. Based on the estimates provided in SAHIE data, this is approximately 33% of eligible women in South Dakota.
An exploratory analysis showed an initial increase followed by a decrease in the number of AWC! participants. This may be related to the termination of the WISEWOMEN program, a heart disease screening program that worked in conjunction with AWC! to perform mammography screenings, and the implementation of the Affordable Care Act (ACA). ACA led to an increase in the diagnosis of early-stage cancers, specifically colon and breast cancers, because of an increase in affordability and accessibility of cancer screening (23). Analysis of the effects of these programs or other possible factors on participation needs to be addressed in future work. For example, cluster 5 had a large poverty percentage but low AWC! participation, which may need additional analysis to determine why eligible women were not using the program. Cluster 1 had increased participation for 2016, and further analysis is needed to determine why. Likewise, counties with a large proportion of eligible women screened should be studied to determine factors that possibly contributed to this success. Finally, the model identifies the counties with increasing and decreasing expected participation.
Most forecasting articles we found were on drug use and prescription drug spending. Most of these carried out linear regression analyses. One performed linear regression analysis to aggregate sales data and forecast expected drug expenditures for a hospital (24). Four years of data were used to make predictions for the next 2 years. Similarly, we found another article forecasting resources for US Army health care (25). That study used ordinary least squares estimation, ridge regression, and robust regression and concluded that, although all the models produced nearly the same estimates, ordinary least squares was desirable because it had the simplest interpretations. We considered linear regression for our data. It was, however, too difficult to analyze 66 individual forecasting models for South Dakota counties. Moreover, a forecast for South Dakota as a whole did not provide enough granularity.
In contrast, a mixture of Gaussian regression time series models allowed us to identify, group, and fit models for groups of similar counties. Clustering, as opposed to evaluating single counties, enabled us to use more data when creating forecasts. In our study, we created 5 models from the available data set, as opposed to 66 models by county. If necessary, we can still obtain individual forecasts for each county for more granular analysis.
Several individual county forecasts displayed counter-intuitive trends. Some of these trends may be attributed to an expected increase in forecasted predictors, such as population or median income. This result may also be caused by other predictors, such as advertisement budget, participation through the WISEWOMAN program, or the start of ACA. Future work investigating travel time to the mammography clinic and how that affects participation could also be conducted. Forecasting efforts would also benefit from more comprehensive data sets that include data related PREVENTING CHRONIC DISEASE to other state programs (eg, Indian Health Services) to show the total number of women participating in screening programs. Forecasting projects for similar cancer screening programs in other states will help both to validate our methodology and to improve models for screening programs in general.
The identification of county clusters may assist the South Dakota Department of Health to allocate and manage resources more effectively. The results of our study indicate which counties may see an increase in number of AWC! participants. Hence resource allocation decisions could be tailored on the basis of need, which would lead ultimately to an increase in breast cancer screening rates and early detection of breast cancer. In addition, our results may help the South Dakota Department of Health determine which counties would benefit more from a mobile mammography unit, which would reduce barriers to mammography screening, reach underserved populations, and thus address breast cancer disparities in rural areas. This aligns with the goals and objectives of AWC! and the South Dakota Comprehensive Cancer Control State Plan (26).
Our study identified clusters and forecasted the trend in AWC! participation for the next 5 years. According to our model, the number of participants will increase in some counties and decrease in others. Forecasting is a complex analysis; though our analysis was limited by the number of predictors, this is the first forecasting study among cancer screening programs. Our work provides information for AWC! managers engaged in budgeting and planning strategies to increase screening rates among underserved women in South Dakota.