History Bias, Study Design, and the Unfulfilled Promise of Pay-for-Performance Policies in Health Care

The ongoing flip-flopping of research findings about the effects of medical or health policies weakens the credibility of health science among the general public, clinicians, members of Congress, and the National Institutes of Health (1–3). Even worse, poorly designed studies, combined with widespread reporting on those studies by the news media, can distort the decisions of policy makers, leading them to fund ineffective, costly, or even harmful policies. Several reports in top medical journals in 2015 (4–6) pronounced that economic incentives in Pioneer Accountable Care Organizations saved medical costs, but the reports did not control for major biases created by unfairly comparing selected high-performing organizations with less-experienced control organizations (7). The result? The US Centers for Medicare & Medicaid Services cited the findings as a reason for expanding the program nationwide. 
 
Building on an earlier article in Preventing Chronic Disease (8), this article focuses on a widely accepted but questionably effective (9) health policy that compensates physicians for meeting certain quality-of-care standards, such as measuring or treating high blood pressure. Policy makers often believe that such financial incentives motivate physicians to improve their performance to maintain or increase their incomes, thereby improving patient outcomes (10). Health care systems in the United States, Canada, Germany, Israel, New Zealand, Taiwan, and the United Kingdom have committed billions of dollars to this approach in the hope that such incentives will improve the quality of health care (11). Although this monetary approach sounds good theoretically, international scientific reviews overwhelmingly find little evidence to support it (12). Giving physicians small incremental payments to do things they already do routinely (eg, measuring blood pressure) may be counterproductive and even insulting, may divert their attention from more critical concerns, and does not increase quality of care (13). Some studies even find that such compensation encourages unethical behavior by incentivizing doctors to “cherry-pick” healthy, active, wealthy patients over “costly” sick patients who are less likely to reach the performance targets. Nevertheless, this financial-incentive policy is entrenched in many components of the Patient Protection and Affordable Care Act (colloquially known as Obamacare), including Accountable Care Organizations, patient-centered medical homes, and health information technology (14). 
 
In this article, our aim is to help the public and policy makers understand how a pervasive bias can undermine the results of poorly designed studies of pay-for performance programs published in even the world’s leading medical journals. We also point to observational study designs and systematic reviews of the total body of evidence to find more trustworthy conclusions on the efficacy of pay-for-performance (12). Although randomization is frequently not feasible for evaluating such public policies (15), we also present an example of a randomized controlled trial that supports the conclusions drawn from strong observational study designs.


Introduction
The ongoing flip-flopping of research findings about the effects of medical or health policies weakens the credibility of health science among the general public, clinicians, members of Congress, and the National Institutes of Health (1)(2)(3). Even worse, poorly designed studies, combined with widespread reporting on those studies by the news media, can distort the decisions of policy makers, leading them to fund ineffective, costly, or even harmful policies. Several reports in top medical journals in 2015 (4)(5)(6) pronounced that economic incentives in Pioneer Accountable Care Organizations saved medical costs, but the reports did not control for major biases created by unfairly comparing selected high-performing organizations with less-experienced control organizations (7). The result? The US Centers for Medicare & Medicaid Services cited the findings as a reason for expanding the program nationwide.
Building on an earlier article in Preventing Chronic Disease (8), this article focuses on a widely accepted but questionably effective (9) health policy that compensates physicians for meeting certain quality-of-care standards, such as measuring or treating high blood pressure. Policy makers often believe that such financial incentives motivate physicians to improve their performance to maintain or increase their incomes, thereby improving patient outcomes (10). Health care systems in the United States, Canada, Germany, Israel, New Zealand, Taiwan, and the United Kingdom have committed billions of dollars to this approach in the hope that such incentives will improve the quality of health care (11). Although this monetary approach sounds good theoretically, international scientific reviews overwhelmingly find little evidence to support it (12). Giving physicians small incremental payments to do things they already do routinely (eg, measuring blood pressure) may be counterproductive and even insulting, may divert their attention from more critical concerns, and does not increase quality of care (13). Some studies even find that such compensation encourages unethical behavior by incentivizing doctors to "cherrypick" healthy, active, wealthy patients over "costly" sick patients who are less likely to reach the performance targets. Nevertheless, The opinions expressed by authors contributing to this journal do not necessarily reflect the opinions of the U.S. Department of Health and Human Services, the Public Health Service, the Centers for Disease Control and Prevention, or the authors' affiliated institutions. this financial-incentive policy is entrenched in many components of the Patient Protection and Affordable Care Act (colloquially known as Obamacare), including Accountable Care Organizations, patient-centered medical homes, and health information technology (14).
In this article, our aim is to help the public and policy makers understand how a pervasive bias can undermine the results of poorly designed studies of pay-for performance programs published in even the world's leading medical journals. We also point to observational study designs and systematic reviews of the total body of evidence to find more trustworthy conclusions on the efficacy of pay-for-performance (12). Although randomization is frequently not feasible for evaluating such public policies (15), we also present an example of a randomized controlled trial that supports the conclusions drawn from strong observational study designs.

The Threat: History Bias
The most pervasive threat to the credibility of studies of pay-forperformance (and many other health interventions) is history bias. History biases are simple to understand: they are events unrelated to the policy under study that occur before or during the implementation of that policy and that may have a greater effect on the policy's hoped-for outcome than the policy itself. These events call into question the conclusions of studies evaluating the policy. They can be any event, such as a concurrent improvement in physician practice that successfully identifies and treats patients with high blood pressure, or widespread news media coverage of a new drug or new national guidelines supporting a life-saving treatment (for example, β blockers to prevent acute myocardial infarction [16]). The American Heart Association's physician educational campaign Get With the Guidelines led to a gradual improvement in hypertension management (17). If a study of a pay-for-performance program targeting hypertension management is launched after or in the middle of such a campaign and does not account for that campaign's effect on any improvements, the pay-for-performance program may take credit for the success, even though the blood pressure improvements really resulted from the unmeasured historical changes in physician practices (18).
Weak pre-post designs that did not control for history bias Many studies have evaluated the United Kingdom's pay-for-performance program -a national policy that provides financial incentives to physicians to improve patient care. The largest pay-forperformance program of its kind, this national policy, introduced in April 2004, offered family physicians up to an additional 25% of salary for meeting certain performance standards. Figure 1 shows data from a study that did not protect against history bias when it evaluated the United Kingdom's pay-for-performance program (19,20). The objective of this study was to determine whether the program led to improvements in a set of quality indicators -measurable elements of practice that indicate quality of care (in our example, target total cholesterol levels). This study had data only for the same month in which the policy was implemented and 2 points afterwards.  Figure 2 shows data from another study that evaluated the United Kingdom's pay-for-performance policy. This study also did not account for secular trends, yet it was published in a major medical journal and is highly cited (21). The authors' assessment included only 2 points in time before and only 2 points after program implementation.  The key problem with both of these studies, which purport to show the positive effects of the national pay-for-performance policy, is the use of only 2 data points during a long period before program implementation and 2 data points afterwards. From so few data, it is impossible to know what was happening to affect diabetes scores unrelated to the pay-for-performance policy before that policy was implemented. So we do not know if any small changes after policy implementation resulted from the pay-for-performance program or from some other changes in physicians' practice. If anything, it appears that improvements before implementation (from 1998 to 2003) -to the extent these are detectable by examining only 2 data points -may have lessened or flattened, not increased after implementation of pay-for-performance. These studies are examples of a simple pre-post design. Only one or 2 observations (points) before and after pay-for-performance cannot control for secular trends (history) before program implementation. The pre-existing trajectory of good quality of care both before and after the intervention is unknown, and it is impossible to know whether the policy had any effect on this trend.

PREVENTING CHRONIC DISEASE
Strong interrupted time-series design that controls for history bias Figure 3 illustrates a result of one of the most convincingly negative studies showing that the United Kingdom's pay-for-performance had no detectable effects on quality of care for patients with hypertension. Using a strong interrupted time-series design and 7 years of monthly data (84 time points) for 400,000 patients before and after the program's implementation, Serumaga et al showed that the pay-for-performance program started in the middle of a slight rise in the percentage of patients who began blood pressure treatment (22). Every figure in the article shows flat or slightly improving treatment over many years and no effect of the $2 billion program that links family physician's income to measures of health care quality. The existence of a long prepolicy trend, established by data for www.cdc.gov/pcd/issues/2016/16_0133.htm • Centers for Disease Control and Prevention trend in health outcomes over many years. The stronger study (interrupted time-series design) showed that the United Kingdom's pay-for-performance program had no effects, whereas the 2 weak studies (pre-post design) contributed only to false or exaggerated hopes. Figure 4 shows a remarkably similar negative result of paying hospitals for their measured performance. Hospital pay-for-performance programs and physician pay-for-performance programs are developed under similar assumptions: linking pay-for-performance with certain hospital outcomes are expected to motivate hospital leaders to meet targets to maintain or increase their incomes, thereby improving patient outcomes. A study by Jha et al (23) used an interrupted-time series design with a comparison series to investigate differences in patient mor-tality rates between hospitals participating in a pay-for-performance program and hospitals not participating. The study compared data for patients with one of 4 conditions: acute myocardial infarction, congestive heart failure, and pneumonia, and patients who underwent coronary artery bypass grafting. The 7-year trends in 30-day mortality for the pay-for-performance and non-pay-forperformance hospitals almost completely overlapped, leaving little doubt that the program had no detectable effect on long-term mortality ( Figure 4).
It is worth noting, however, that hospitals in the pay-for-performance program opted into that program and could have already had better outcomes than non-participating hospitals before the payfor-performance program began (ie, they were anticipating the financial rewards). But the equivalent trends and pre-program improvements in mortality for both study groups makes it less likely that this bias would have changed the conclusion.
Interestingly, a more recent study of pay-for-performance effects on 30-day in-hospital mortality rates among patients with pneumonia, heart failure, or acute myocardial infarction in one region of the United Kingdom was also compromised by already occurring declines in mortality rates and a lack of clear differences in mortality rates between the study and comparison groups (24). These short-term declines in mortality rates were not maintained in the long term (25).

Strongest designs for controlling for history bias: randomized controlled trials
The strongest design for evaluating policies is a randomized controlled trial (RCT). In such study designs, random allocation of participants into intervention and control groups increases the likelihood that the only difference between the group receiving the pay-for-performance intervention and the control group (the one not participating in pay-for-performance) is the intervention itself. In a recent RCT, physicians randomized to a pay-for-performance intervention were eligible to receive up to $1,024 per patient who met target cholesterol levels, whereas physicians in the control groups received no economic incentives for achieving better outcomes (26).
Studies with strong, trustworthy designs, such as this RCT, suggest that paying physicians according to their measured performance on quality metrics (eg, reduction in low-density lipoprotein levels) does not improve outcomes ( Figure 5). Physician payments did not produce any meaningful changes in quality of care compared with an equivalent group receiving no incentives.  Rigorous systematic reviews of the entire body of pay-for-performance studies: the most trustworthy evidence It is important to reiterate that no study is perfect, and no single study can determine the truth, whatever that may be. The accumulation of knowledge over time is the best way to assess a health care treatment or policy. However, given that most studies do not control for history or other biases, it is essential to single out the most rigorous systematic reviews -literature syntheses that eliminate weakly designed studies (the simple pre-post study designs illustrated in Figure 1 and Figure 2 or studies that simply correlate pay-for-performance with quality of care at one point in time) and summarize the remaining evidence.

PREVENTING CHRONIC DISEASE
One international systematic review (Box) (12) found that not only was there little evidence to support pay-for-performance's effects on quality of medical care, some studies found that it sometimes had the unintended consequence of discouraging doctors from treating the sickest patients. there was no convincing evidence that the quality of care increased at a faster rate in the 3 years after P4P implementation than before.
[T]he current evidence for P4P targeting individual practitioners is insufficient to recommend wholesale adoption in health care systems at this time.
In addition to the international systematic review, other recent well-conducted systematic reviews supported the conclusion that questions the efficacy of pay-for-performance and advised against its widespread implementation -which has occurred despite the negative evidence (27). For example, Dutch researchers conducted an "umbrella review" -a review of all systematic reviews on pay-for-performance policies-to consider the totality of the evidence (28). They found that most systematic reviews unequivocally concluded that evidence showing effectiveness for pay-forperformance policies is weak, mixed, and inconclusive; many studies failed to find a meaningful effect attributable to the policy. As we have illustrated in this article, studies with weak designs that do not control for biases found more positive results than those with strong designs (29).

Closing Comments
Despite its unfulfilled promise and discouraging evidence, this costly and ineffective approach to improving health care is a widespread component of current national and international health care policies. It is entrenched in many policies created by the Affordable Care Act (14). Part of the problem is the explosion in statistical techniques that attempt to "adjust for" or "correct" unquestionably dissimilar study and comparison groups rather than graphing the actual data over time so policy makers (who appreci-PREVENTING CHRONIC DISEASE www.cdc.gov/pcd/issues/2016/16_0133.htm • Centers for Disease Control and Prevention ate simple graphical displays) can actually look at the size of effects. Exaggeration of the effects of government programs through such "black box" statistics by the news media further widens the divide between reality and perception of policy effects (30). Weakly designed studies may also facilitate the proliferation of policies that encourage physicians to achieve "target rates" for health care procedures even when these procedures may be harmful for some patients. Oftentimes, what's measured is what matters, and quality may deteriorate in areas that are not incentivized (31). Pay-for-performance programs may even result in collateral damage (13), diverting resources from under-resourced facilities such as safety net hospitals that provide care for vulnerable and high-need populations (32).
If we wish to encourage efficiency in medicine when our government and private health care programs are consuming almost onefifth (17%) of the gross national product, it may be time to insist on strong experimental and quasi-experimental research designs (such as RCTs, interrupted time-series designs, and systematic reviews) in pilot tests of expensive policies. Investments of private and taxpayer funds should be based on solid evidence of safety and efficacy. The alternative, the present system, relies on weak and uncontrolled research designs, misleads policy makers and the public, and will ultimately lead to perverse effects, such as unsustainable costs, unhappy clinicians, and policies that may damage rather than improve the quality of medical care (30). Dr Naci received no financial support for developing this article. We are indebted to Dr Sumit Majumdar for his outstanding contributions to an earlier version of the article. We are grateful to Caitlin Lupton for editorial assistance and graphic design. The Commonwealth Fund is a national, private foundation in New York City that supports independent research on health care issues and makes grants to improve health care practice and policy. The views presented here are those of the authors and not necessarily those of The Commonwealth Fund, its directors, officers, or staff.