Data Science Reveals Suicide Trends
Surveillance and Data — Blogs and Stories
When a team of four scientists at the National Center for Health Statistics—working in different divisions and offering different expertise—came together to examine suicide data, they made a leap in data analysis so innovative it earned them CDC’s highest honor for scientific work.
Over months of long hours and many attempts, the team refined a method that allows us to estimate trends in suicide rates at the county levelexternal icon faster than ever before. The new method not only solves longstanding problems around creating stable rates from small numbers, but also paves the way for insights on other big challenges like drug overdoses, teen birth, and deaths from rare causes.
The Statistical Challenge of Small Numbers
With the epidemic of suicides continuing to rise, communities need information to help better target their prevention efforts. Data that captures how suicide is changing over time at the local level can help decision-makers understand what’s working—or what isn’t—for the people they serve. However, from a data scientist’s perspective, producing this kind of information can be a complex process full of scientific hurdles.
Although suicide is the 10th leading cause of death in America, the number of suicides each year for many individual counties is small. For example, in 2015, in 84 percent of counties in the US reported fewer than 20 suicides for the year.
It’s challenging for statisticians to accurately calculate rates with small numbers because they are considered scientifically “unstable” and are too variable for creating a usable estimate. In fact, because of the instability, CDC does not publish death rates where there are less than 20 deaths reported in a county.
“The challenge is that we have reason to believe the death rates in some rural counties are high, and we want to show that but are not able to directly calculate a rate with any precision, so we can’t say how high relative to other counties,” says Margaret Warner, senior epidemiologist. “This happens not only when we study death rates in rural areas, but also when we look at deaths from rare causes.”
Looking “Around” for Answers
So how do statisticians solve this problem? Previously, there were two main ways to generate a stable estimate. Both ways involved combining smaller, “unstable” numbers to create bigger, more stable ones.
The first way involved grouping data from several counties into larger geographic units, like states or regions. The second was to aggregate the data over longer time periods, for example across multiple years. However, both of these methods can mask important geographic and time trends that may be happening at the individual county level.
Benefits of the New Method
The new method takes a different approach, combining information from many different datasets to get a better look at the problem. It uses information from mortality data and dozens of other factors known to affect suicide rates—such as demographic and economic characteristics, divorce rates, and urbanization levels—to create a reliable “approximation” of county-level suicide rates, all across the country.
The specific technique is called “Integrated Nested Laplace Approximation,” or R-INLA, which is implemented in R software to estimate suicide rates based on information from nearby counties and by incorporating a variety of risk factors affecting suicide rates from multiple datasets. This technique produces estimates that are less variable from year-to-year, which can help better monitor trends in places where deaths are occurring in small numbers.
The process is computationally faster, too. Previous modeling approaches would take about 8 weeks to produce results, but R-INLA can do the same work in less than 24 hours. “The previous method for doing this kind of work via Markov Chain Monte Carlo techniques in the software WinBUGS would take many weeks,” says team member Diba Khan, mathematical statistician.
The longer processing times created issues for the team, some unexpected. “One time,” says Diba, “we were in the middle of analysis when a snowstorm hit, and we lost power. It wiped everything out, and we had to start again from scratch.”
The results of the new method make it possible to look at regional differences without regard for state boundaries. “For example, we can identify the higher suicide rates in Appalachia,” explains Holly Hedegaard, CDC medical officer. “In addition, we can look at differences within a state, for example, higher suicide rates in northern California compared to southern California.”
Insights on Major Health Challenges
“This analysis of suicide data has given us our most precise picture yet of where, and how rapidly suicide rates have been rising all across the country,” notes Lauren Rossen, senior health statistician. But just as important is the potential of this method to be applied to other data in the future. Adaptation of this method has already provided insights on other critical public health challenges like drug overdoses, teen birth rates, and deaths from rare causes.
For example, an analysis of teen birth rates from 2003-2015 across 3,137 counties yielded new interactive maps that allow a look at county-level trends. The new method has also been applied to capture drug overdose death rates in counties with small population sizes or small numbers of deaths. Dashboards allow users to look at a series of heat maps, tile grids, and trend-lines of model-based county-level estimates of drug-overdose death rates from 2003-2017.
Collectively, these efforts have provided detailed information about what is happening at the county-level across a variety of important outcomes, and getting information into the hands of people who can use it to improve public health in their communities.
- Read more about the work:
- Learn about the National Center for Health Statistics
- See how R-INLA is being used to solve many real-world problemsexternal icon