Correlation between “Deaths of Despair” in White America & Trump Victory

“Death of Despair” is an underreported phenomenon which has killed nearly 100,000 middle aged white Americans due to depression related causes. (suicides & addiction) since 99. (epidemic comparable to AIDS in its scope). I was curious to find if this has any correlation with the 2016 election victory of Trump. So I collected CDC mortality data for each county from 1999 to 2016 & then I collected data for last three elections viz. 2008, 2012, 2016 at county level.

The analysis is being done at county level, looking at percentage swing in votes for Trump in 2016 election for each county & found that counties which has seen most increase depression mortality rates; has also seen remarkable rise in votes for Trump. Analysis also show during period 2008-2016; high concentration of rise in white mortality in three states viz. Michigan, Wisconsin & Pennsylvania (which were key to success of Trump) and their shift to Trump from traditional voter base of Democrats. It also found a striking correlation that counties which has higher “Death of Despair” rates are more likely to vote for Trump and this shift in votes to GOP increases as death rate increases.(see below)

While “anger” has long been offered as the hidden variable explaining electoral victory for Trump, this analysis may be an attempt to quantify that anger by linking it to depression (or deaths by despair to be specific). The elections are dominated by voter sentiments & it is natural to have sentiments being affected if they were to be surrounded by despair deaths in their neighborhood.

This analysis can be helpful in aiding the future electoral models; by incorporating the despair related mortality rates of region as one of the key feature vectors affecting voters. The psychological impact of depression may also be able to explain appeal of “non-intellectual” simplistic narrative of Trump to its core base.

Technical Details of Analysis 

 

Introduction

A ground breaking study came in 2015, which claimed that something unusual and troubling is happening in America; almost unnoticed. It concluded that middle aged white Americans are suddenly dying at increasing rates; despite the positive life expectancy for other racial & ethnic groups in America and whites in other advanced countries. The authors called it “Deaths of Despair” by which middle aged white Americans are killing themselves either by suicide or addiction related deaths. This epidemic which is on rise since 1999 and has been linked to changes in number of factors like education, labor market, collapse of family & religion etc. [1]

On the other hand surprising victory of Trump in 2016 election remains an enigma, since it was contradictory to most predictions done by traditional electoral analysis. So far the link between right wing politics and middle aged white mortality has been postulated by some but not being studied. This data mining case study looks at correlation between increasing mortality rates during 1999 to 2015 and swing in favor GOP and if the locations which had seen most despair do indeed corresponds to shift of voters from Democrats to GOP as compared to previous elections of 2008 & 2012.

The analysis is being done at county level, looking at the death rates of middle aged white Americans in each county and how that county has voted in 2016 election. More specifically we looked at percentage swing in votes for GOP in 2016 election for each county & found that counties which has seen most increase mortality rates; has also seen remarkable rise in votes for GOP. The maps also show during period 2008-2016; high concentration of rise in white mortality in three states viz. Michigan, Wisconsin & Pennsylvania (which were key to success of Trump) and their shift to GOP from traditional voter base of Democrats. It also found a striking correlation that counties which has higher “Death of Despair” rates are more likely to vote for GOP and this shift in votes to GOP increases as death rate increases.

While “anger” has long been offered as the hidden variable explaining electoral victory for Trump, this analysis may be an attempt to quantify that anger by linking it to depression (or deaths by despair to be specific). The elections are dominated by voter sentiments & it is natural to have sentiments being affected if they were to be surrounded by despair deaths in their neighborhood. This analysis can be helpful in aiding the future electoral models; by incorporating the despair related mortality rates of region as one of the factors affecting voters.

 

1.    Problem Understanding

Nobel Laureate Angus Deaton and Anne Case has published their ground breaking paper last year , titled “Rising morbidity and mortality in midlife among white non-Hispanic Americans in the 21st century” discovering a previously unknown phenomenon in America. That there is a marked increase in the all-cause mortality of middle-aged white non-Hispanic men and women in the United States between 1999 and 2015.This trend of rising in mortality is due to increasing death rates from drug and alcohol poisonings, suicide, and chronic liver diseases and cirrhosis; together phrased as “Deaths of Despair”. This trend of rising mortality is unique among White non-Hispanic Americans as the mortality rates for other races continue to see a decline. More importantly this trend is also unique to America as it is not being observed in other advanced countries.  Below two charts borrowed from the papers, to give a better view of the conclusions drawn.[1]

The first show that rising mortality is unique to US & second chart shows trend is unique among white Non-Hispanics within US. It must be reiterated that mortality for all groups kept a steady decline in US from 1900 to 1999 (due to rise in overall life expectancy & quality of life) and suddenly we see this reversal of trend , where not just the decline is being stopped for a particular group but they are suddenly dying much younger & faster.

 

The paper highlights the enormity of this disturbing trend by calling it an epidemic comparable to the AIDS; but which received no attention. To quote from paper. “If the white mortality rate for ages 45−54 had held at their 1998 value, 96,000 deaths would have been avoided from 1999–2013, 7,000 in 2013 alone. If it had continued to decline at its previous (1979‒1998) rate, half a million deaths would have been avoided in the period 1999‒2013, comparable to lives lost in the US AIDS epidemic through mid-2015.”[1]

 

1.2 Proposition

The unexpected victory of Donald Trump in 2016 US election has been explained by many angles so far. The oft used phrase is that 2016 election tapped into the “anger in America”. However so far there has not been a quantitative definition proposed for measuring anger. It is conceivable that despair is another transfiguration of anger. It may not be a total coincidence that a strong base of GOP also happens to be the white middle aged Americans. Putting these two aspects together is the purpose of this study.

This data mining case study build on paper mentioned above and tries to find, if there is a pattern between rising deaths of despair in America and swing in votes to GOP from 2012 to 2016 election. The proposition is to see if rise in votes for Trump has a correlation between percentage increases in white middle aged mortality rates in those areas. The analysis is being done at a county level, for both votes and mortality rates. The proposition is to detect that variability of rise in votes for GOP can be explained in parts by statistically significant mortality variables.

Very specifically this case study calculates the death rates in each county for white middle aged Americans and tries to determine its relationship with votes for Trump in that county and also swing in favor of Trump in that county from previous elections.

1.3 Source

The data for this project for mortality is being provided by CDC (Center for Disease Control). The data is being downloaded step by step for each year from 1999-2015 (due to size limits on CDC website for downloads). This dataset includes deaths and population for each county in USA throughout this period. [2][3]

 

The 2016 election dataset is scraped from Guardian Newspaper website. The scraped version of data () includes county level votes as well as additional files for 2008 election results and 2012 election results for comparison. [5][6]

2.    Data Preparation

In this section the raw dataset is being converted to a dataset that is relevant to this project work and in a consumable format.

2.1 Prepare Mortality data

Below are the steps performed to prepare the mortality data.

  1. The CDC provides a handbook to understand columns of the dataset at (4). These are being used to interpret the data.
  2. Limit the dataset for middle aged population viz. from ages 45-54 as per reference study.
  3. The overall data files that are downloaded for each year from 1999-2015 contains the information about {County, Year, Race, Hispanic Origin, Gender, Deaths and Population}. There are some additional columns which are discarded at this stage.
  4. All single files are each year mortality data are combined and loaded into a dataframe.
  5. The next step is to clean data for rows where population is NA. Such rows are dropped.
  6. After this we aggregate the data by each year and county and race type. This is to segment the dataset by Race and deaths for each county every year.
  7. Next we calculate the Mortality rate by race in each county. The mortality dataset is now prepared for deriving feature vectors.

2.2 Prepare Voting dataset

  1. Collect the dataset for election results of 2008, 2012 and 2016 at county level.
  2. Combine the datasets at county level to create a comprehensive file.

 

3.    Feature Engineering

We have now got the clean data available which we can use to define feature vectors. Preparing data so-that it can be used models is time consuming activity. It goes back & forth many times, sometimes few data issues are discovered only after running models. For the sake of simplicity only final data preparation methods included are mentioned below and features that was found to be relevant after lot of analysis.

3.2 Mortality Data

  1. Using the Race code (for each race viz. White, Black etc.) and Hispanic origin code , we segment the dataset to derive population identifier for White-NH (White Non-Hispanics) and Non-White ( All other races & Hispanics). We continue further analysis for only White population.
  2. The data is aggregated by this new race descriptor (white vs. non-white) and death rate for each county is derived.
  3. We next take care of missing values. The data for some intermittent years could be missing for counties. Overall the instances for missing data are very few. Since death rates don’t change drastically every year, it is logical to fill last year death rate whenever it is missing.
  4. We now derive some additional features. We first calculate average of death rate for each county. This is calculated in different time period to correlate with the election years. So we get
    • Average Death Rate between 1999 and 2007
    • Average Death Rate between 2008 and 2011
    • Average Death Rate between 2012 and 2015
  5. We also calculate some additional cross period long term average death rates.
  6. Next we calculate percentage shift in death rates between these periods. These correspond directly to above periods. So we get
  • Percentage Rise in Death Rate between 1999 to 2015
  • Percentage Rise in Death Rate between average 2008-2011 to average of 2012-2015
  • Percentage Rise in Death Rate between average of 1999-2007 to 2008-2011

We also additionally take log of absolute death rates for handling the skew in the data.

3.3 Election Data

  1. The election dataset that we prepared in previous instance does not contain percentage vote shares for 2008 dataset. So the first features to derive is to calculate GOP and Dem percentage share for 2008 election.
  2. Next we derive the percentage swing in votes for a given county in favor of GOP from 2008 to 2012 ; 2012 to 2016 and 2008 to 2016.
  3. We also derive the flags to identify which counties are won by Democrats vs. which counties are won by GOP.
  4. Furthermore we also calculate the flags to identify the counties which filled for GOP from 2008 & 2012 in 2016 election. That means these counties were originally won by Democrats but they switched to GOP in 2016.

Finally we combine the mortality dataset and election dataset over the county codes (called as FIPS codes in the notebook).

4.    Modeling & Analysis

The data prepared with feature vectors is being exported to the csv file. We first go through process of spotting the relationships between the variables, detect outliers and if required log the variables to handle non linearity in the data. We will then build a Least Squared linier regression model to verify if features are statistically significant.

4.1 Analysis

  1. We first use scatter matrix to check for any trends in relationships and also multi-collinierity. We then start digging deep into the relationship between two variables.
  2. We see in plot between death rate & swing there is a skew , so we use log of the variables
  3. Along the way we see that there are some outliers. There will be few outliers that will distort the graph. For clear correlation , it is important to get rid of those outliers , so we remove them accordingly. (Details in attached notebooks)
  4. Looking at above charts we can say that there is a clear relationship between the death rate of a county and how that county has shifted in terms of votes for GOP in 2016 elections as compared to both 2008 and 2012 election.

dep3dep4

 

4.2 Regression Model

Various models are built using OLS linier regression; to regress votes for GOP in 2016 and swing in votes in favor for GOP in 2016 election from 2008 & 2012.

Model to regress votes swing in percentage terms from 2012 to 2016 ; found that keeping everything else constant ; for every unit of increased in ln(Death Rate of 2015); there is 6 points increase in swing towards GOP in 2016 election as compared to 2012. (Please note results are on logged scale). The model has a F Stat of 101 , which means we definitely one or more variable in the model and also variables have low p value indicating that they are statistically significant.

Below model find relation between swing in votes to GOP in 2016 and percentage increase in mortality rates during various periods. After taking all factors into account it can be said that the relationship is still positive as more the increase in mortality rates more the votes shifted towards GOP in 2016 vs. 2012.

The model diagnostics is being run on the model and there is no evidence found for heteroskedacity and ggplot also look okay with a fairly linier relationship.

Special Note on R-squared:- The R-squared for these model is found to be low (15-30%) even though variables have a very low p value. This dichotomy is not uncommon especially in Social Sciences or psychology studies as it’s not possible to completely understand a behavior & thereby explain all of variability in the data. Apart from that, based on some articles; many statisticians also don’t consider it to be good measure of model performance; arguing it should never be used. [8]. Considering that the results can be considered to be valid. (Even though it can definitely be said that model can always be improved even further!).

5.    Results

5.1 We first try to see the distribution of death rates for middle aged white Americans as it’s in 2015 and then the long term average of death rate between 1999 & 2015. We then also plot the map of percentage rise in death rates in counties.

 

511512

As we can see from above there are counties especially in Midwest with focus on some counties of upper Midwest (Michigan) and South where there is an epidemic of “Death of Despair”. The above two plot confirms that there is regional flavor to this spread of middle aged white Americans killing themselves & some regions more adversely affected than others. Also its consistent over long period.

5.2 We now plot on map the counties which have seen highest rise in death rates for white middle aged Americans in terms of percentage since 1999 to 2015. Please note that mortality rate among all other races is dropping (& so in other advanced countries) ; so counties which is seeing such a consistent rise , is truly alarming ; almost like a failing dystopian counties.

We then plot the map of percentage swing in favor of GOP from 2012 to 2016 election. This is votes in percentage terms that can be said to have shifted from Democrats to GOP.

 

521522

As it is quite clear that, there is a remarkable similarity between the two maps. We can note that central Midwest , upper Midwest and also Maine ; which has seen votes share for GOP seen a rise of as much as more than 10% , were instrumental in election victory of Trump. These are same counties which had seen remarkable rise in the mortality & morbidity for white Americans.  The 2016 surprise election results of states which were traditionally Democrats (like Michigan) may be explained by this newly found correlation.

5.3. We now plot the maps for percentage swing in favor of GOP from 2008 to 2016 election (votes that shifted from Dems to GOP) and then we also plot the rise in white mortality during the same period (2007 to 2015) in percentage terms.

 

531532

It can be said that correlation between the counties is remarkable & scary. To reiterate, in just 8 years some counties had seen over 20 percent “rise” in deaths for white Americans; when the same mortality for blacks & Hispanics & whites in other countries were falling.

Again the huge swing that Trump has seen in his favor came from exact same regional counties like in Michigan, Wisconsin & Pennsylvania (which decided election in favor of Trump) and other central Midwest region.

Also note a lack of this “death of despair” phenomenon in east coast region (from Boston to Charlotte) and in some parts of west coast (which remained Democrats). Lastly also note the increase in mortality rates & swing in favor of Trump in not so talked about regions of Maine & upper western part of country. These regions can be having potential for an election upset in future.

6.    Conclusion

Key Findings

  • Based on the profiling maps it will congruous to draw conclusion that the mortality rate for a given county explains a lot of swing in favor of Trump in 2016 election from 2012 & 2008 election. There is definitely a statistically significant correlation between the two as found in models. The feature; which is percentage rise in average mortality in each county from the 2008 to 2016 has strong correlation for votes for GOP. Hence swing in favor of GOP can be explained by the same.
  • In fact one of the most missed & talked about events of the election was pollsters missing the mood in Michigan and Wisconsin and Penn. If one were to look closely at the maps , there is definitely something going in Michigan , Wisconsin and Penn and may be this could be missing piece of the puzzle in explaining the swing in favor of GOP.
  • Lastly this does not explain the root cause of phenomenon of “Death of Despair” but how it can affect the political outcome & how not being aware of this impact can lead to surprising results like 2016 election.

Next Steps

Predicting election results or analyzing them is an art requiring reading minds of millions of people. It had never been an attempt in this case study to claim that “Death of Despair” is only factor that affected outcome of election as there are thousands of other factors that goes into it.

  • Due to scarcity of time for the project there were many factors that were not possible to consider which may be open for confounding.
  • Education was not considered while looking for correlation. Education levels of voters play a crucial role in explaining variability in voter percentage swings; and including that can perhaps give more accurate models. Adjusting for other important variables may also good idea. However data for education is difficult to find.
  • Due to some missing data values, a simpler method was used for handling missing data & also to tackle outlier. A more thorough analysis can involve collecting data for all counties from other sources than CDC (like NVSS & Mortality.org) that can fill in the gaps. Overall the final analysis had number of counties missing; a better insight could have been obtained with a richer dataset. Also Cook Distance could be used for handling outliers.
  • Due to lack of time, only regression models have been tried and that too with OLS. Other models may be better fitted to suit the data and may give stronger insights & accuracy.
  • Lastly including mortality rates in traditional election models to predict voter swings, can be proper next step of effectiveness of this study.
    References

 

  1. http://wws.princeton.edu/faculty-research/research/item/rising-morbidity-and-mortality-midlife-among-white-non-hispanic
  2. https://wonder.cdc.gov/mortsql.html
  3. https://www.cdc.gov/nchs/data_access/VitalStatsOnline.html
  4. https://www.cdc.gov/nchs/nvss/mortality_public_use_data.htm
  5. https://github.com/tonmcg/County_Level_Election_Results_12-16
  6. https://simonrogers.net/2016/11/16/us-election-2016-how-to-download-county-level-results-data/
  7. http://www.nber.org/data/vital-statistics-mortality-data-multiple-cause-of-death.html
  8. http://data.library.virginia.edu/is-r-squared-useless/
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s