Analyze Asian Americans community to discover if it is a monolithic group or if there are hidden patterns of distinct social profiles based on socio-economic demographics , priorities of life , immigration status & view of USA ; which can explains their social behavior.
Asian Americans forms one of the largest immigrant group in USA, whose significance in the social , cultural & economic aspects normally outranks their actual size in terms of population. Based on popular stats; with average income of an Asian American much higher than average white American and about twice as many of Asian Americans having graduate degree or more , they are also called arguably most privileged group in America. However typecasting all Asian Americans under a single umbrella may not be correct since Asia itself is too diverse to be having any monolithic identity or behavior. We are essentially referring to a group drawn from the pool of the 4 billion people with hundreds of different cultures & religion, collage of ethnicity & races and a very troubled history not to mention widespread poverty throughout millennia. In fact most of these Asian Americans’ native nations are still open enemy of each other involved in arms race.
In this regards, it would be interesting to analyze what are the social behavior traits based on mindset of the Asian Americans , if there are nuances to their single identity and are those differences on the lines of region or religion or race ? Also how is their integration with rest of America? What are different segments within Asian Americans based on their thinking & behavior? What traits do they share which can be predicted based on basic demographics? And what are the possible explanations behind that social behavior? Are all Asian Americans well off or there is disparity within? How come Asians typically coming from poor countries managed to get pass beyond white Americans in social economic ladder so soon? These are some of the questions we can try to explore by segmentation analysis of the population.
The analysis is based on survey conducted by Pew report. There was no clustering of this data being done based on socio-economic parameters so far & so this project serves to fulfill that objective first time.
To understand mindset of any community we need to begin by collecting what each & every individual thinks. Behavior of a group is driven by many complex factors working simultaneously which are personal demographics, motives, beliefs, historical context what they want in life vs. what they have and social status. This information can be captured by conducting social surveys.
The current survey data captures all the demographic details, the origin details prior to immigration to USA , ties back to native country, what they want in life , their perception of USA vs. home country , perception of discrimination & racism in USA , the relationships between other races viz Blacks , Whites and Latinos , their political ideology and background information.
For this project in order to perform social analytics to profile Asian Americans & understand their behavior, unsupervised learning methods will be used. K means clustering is the algorithm chosen for this segmentation. We will be using the standard distance method calculation to compare various clustering solutions.
It is always a subjective decision to choose the variables for unsupervised learning especially in case of social analytics. Humans are defined by not what who they are but what they want to be, the self defined identity & aspirations are a strong indicators of one’s thought than socially defined identify . In this case our objective to understand how Asian Americans’ behavior varies within & what underlying pattern if any can be discovered to understand them. Typically below are the drivers for analyzing mindset or behavior which can influence how a person thinks in terms of their community identity and will be used as input for the clustering algorithm.
- Basic Demographic information
- Advanced Demographics like Race, Religion , Immigration status , family details & background
- What they want in life/priorities of life?
- How to they compare USA to native country?
The hypothesis proposed is that Asian Americans are not a single group , but too diverse to be included in any single term.
2. Social Analytics Results/ Model Evaluation
- The details of how data is prepared and modelling is done is given below. This section will share key findings.
Cluster 1 – “The Baby Boomers”
- This cluster is dominated by the overwhelming majority (~60%) of “Indians” who make up majority of this group & other South Asians. The second most prominent ethnicity in the group is Chinese.
- This cluster can be classified as being affluent section of American society (if not “privileged”). Majority of people (~40%) make over $100k per year and as much 60% of this group have Advanced degree (Post Graduate or PhD). Overall almost 80% of this group have Graduate Degree or above which is way higher than American average of some 26% and both these parameters of this group is much above white Americans. (1) It mirrors the original baby boomers generation of the USA.
- This cluster is also dominated by the 1st generation of the immigrants with over ~90% respondents born outside & new arrivals to USA. This group even though highly successful in financial terms, considers marriage & family (being good parents) as the most important priority in their life (as compared to career).
- This group thinks that conditions for freedom (political or religion) in USA is same as in their own country however they consider USA to be better place for raising kids & education and equality than home country. This when tied back to earlier point and can explain their behavior of living in the USA & motivation behind it. This is also very aspirational and ambitious group since they want their kids to have much better standard of living than what they have currently.
- They are also group which feels socially integrated with the American culture with ~90% of them feels they have never felt discrimination or racism in USA and almost same percent feels satisfied with their life. (direct co-relation to their income level)
- This is the group that would love to come back to USA always & seems committed to living in USA even though unique social paradox about this cluster is unusually high number people thinking that home country is better than USA in terms of morals & values.
Cluster 2- “In the Shadows”
- This cluster is dominated by South East Asians viz. Koreans & Filipinos along with some Vietnamese.
- Most of the people in this cluster are in USA for long time i.e. with average of ~50 years in USA and also characterized by the concentration of the elderly in this cluster than any other.
- This is group which considers having home, career and financial stability as most important thing in life which when contrasted with their age is a very curious finding.
- This aspiration is perhaps explained in terms of their economic profile. This group can be characterized by majority being poorly educated (High School or below) and also majority in Poor income bracket (i.e. income less than $20k). However they also think USA is better country in terms of treating poor & would come to USA again.
- One interesting observation in this group is that they consider religion as most important thing in life but at same time consider USA as better country than home in terms of morals which is in sharp contrast to rich cluster 1 who despite being successful in USA consider home country to have better values.
- There is also unusual high concentration of woman in this cluster than any other. I could not offer any explanation for the same.
- Another unique feature is unusually high of Republican supporters and conservatives in this group. All other clusters of Asian Americans are typically dominated by moderates or liberals leaning towards Democratic party
Cluster 3- “Stuck In the Middle”
- This cluster is dominated by the concentration of Japanese and “the other Asians”
- This is the cluster dominated by 2nd generation of Asian Americans (as compared to 1st generation in earlier two clusters) and with pre-dominantly younger population.
- This cluster can be distinguished as being one with majority having middle income (~$20-75k) and moderately educated (High School and above).
- This cluster can be identified as being pessimistic in the behavior outlook with majority thinking their kids will have same standard of living as they have now and also significant portion, expressing their desire that they would move to different county than USA if they could turn back time.
- This nostalgic behavior is very interesting observation specially considering the younger age profile of this cluster.(we would normally expect older people which dominates cluster 2 to be nostalgic than aspirational but this reveals something else)
- There is also no clear priority of life that can be distinguished in this group with they considering all the things in life (career , religion or family) as somewhat important as compared to a clear choice.
- There is inexplicable feature of this cluster with overwhelming majority (~85%) of people not disclosing their citizenship status. May be it can explain nostalgic factor & pessimism of this group.
Cluster 4 – “Freedom Guys”
- This cluster is dominated by concentration Chinese with some Koreans & Japanese
- They are mostly first generation immigrant to USA (as compared to Japanese & others of previous cluster which was second generation) & in the middle age bracket.
- This cluster is dominated by group of people with very few permanent employment and income levels at poor or lower middle class (~$10k-50k). Majority unemployment is a distinguishing feature of this cluster as it has not been observed in any other cluster and may be explained by them coming from different educational background within native country which separates them from their fellow counterparts.
- Even though this group considers their financial status as less than fair, they think USA is better country than home in terms of opportunity & treating poor.
- There is no clear life priority that can be distinguished in this group but this group uniquely considers USA as better country than home in terms of religious or political freedom or overall freedom of expression and that may explain motivational behavior of this cluster.
- This group also considers family ties as their primary reason to live in the USA; curiously though they consider USA better off than home country in terms opportunities & would like to come back again.
- One interesting observation in this cluster is very low importance to religion by this group with majority considering religion as not at important in their life.
- This is also the group with majority of people who are not the citizens of USA. That perhaps explains the high rate of unemployment in this cluster.
Cluster 5 – “The Optimist”
- This cluster is dominated by Vietnamese mixed with Chinese & other Asian group.
- This is middle aged group who has been in the US for quite a while and still makes up as a first generation. Perhaps what makes this cluster unique is mixing above characteristics with majority of members being citizen of USA; this combination has not been observed in any other cluster.
- This is middle income group of fully employed people with distributed education levels ranging all across with no clear skewness. This is interesting phenomenon considering they are in USA for a while. It may signify that over a period of time so called ethic advantages fizzles out.
- This group considers religion as priority in life & thinks USA is better country than home in terms of morals & values.
- This is also the cluster where priority in the life for majority of them is marriage & being a good parent and whom also thinking USA is a great place to grow kids.
- They are also optimistic about prospect of better standard of living for their kids (despite moderate income levels as compared to cluster 1 and would like to come back to USA if born again.
3. Social Profiling Conclusion
- Based on the profiling report mentioned above it will be tempting to draw the conclusions that Asian American community is being segmented on the lines of Race or nationality but that would be incorrect conclusion to draw.
- It is very important to note that there has been some mix of other communities in all clusters , the defining factor for cluster is not being race or nationality but “Immigration status” and education level. Based on when they arrived in USA; either first or second generation and also with what background they came to USA is crucial factor to understand the subsequent economic status & further social behavior. This is not to say race relations does not matter , it does at perception & superficial level however a segregation of behavior traits could not be directly linked to any of the race types but to their imig-economic status which varies hugely.
- Perhaps the most important conclusion of the Asian Americans profiling can be explained in terms of what has not been observed. As mentioned in the abstract is very important to note that so called “Asian Americans” is a collage of not just largest but most complex groups on earth in terms of sheer veracity. This is land where religion is predominant factor. However despite it served as inputs; the observed segmentation was not on those lines. Implying that behavior & thinking of various groups within Asian Americans are not divided on the lines of religion but other factors.
- It can be concluded that over the period as the immigration status changed of these communities, they adapt themselves & become very distinct from their native counterparts who are newly arrived. We may call this phenomenon as “Social Darwinism”.
- We may term this phenomenon as “Social Darwinism” whereby groups when isolated or away from mainland adapt to the new land for survival but ends up thriving as a transformed and much better species than from where they originally arrived.
- This can easily be explained in terms of actual Darwinism with example of Galapagos. These islands are isolated group of small volcanic islands far away from mainland South America. With a size smaller than Mass state, it has astonishing diversity of plants, animals species unparallel to any other place on planet. This is more interesting to note that none of species here is native to these island but being washed away from mainland America by accidents who then adapt here so profusely that they become unrecognizable from their native land species.
- While all this adaption is common what makes it unique is it happen at incredible fast speed (in evolutionary terms) in Galapagos & it is still happening which we can experimentally observe adaption in few years (not few million yrs). This rapid integration & adaptation is explained as lack of mainland competition which made these species not just thriving but exceptionally successful & distinct from others.
- A similar explanation can be offered for thriving Asian American community & changes in traits within them which defines them at native countries vs. now in USA , which has always been a melting pot for the ethnicities worldwide and been utilized quite successfully by the community.
- At the same time it is important to note that there are other groups within community (not definable by ethnicity but social ladder) which makes them wildly distinct from each other, impacting their mindset and social behavior.
The hypothesis as mentioned in the abstract that Asian American community is not a a monolithic entity but with measurable social & economic variations and behavioral traits can be thus be accepted based on results of this project.
Data Engineering Section
4. Data Understanding
- The dataset is downloaded for understanding as SPSS file. SPSS is installed as per appendix mentioned below.
- The dataset comes with a codebook which explains the columns definition as data dictionary. This is used for conducting data analysis. To start with there are total of 268 fields. Understanding all these cols and preparing data for model is the most difficult & time consuming part of any analytics project.
- The dataset consists of many metadata fields. As we don’t need these metadata sampling fields they are removed while reading from dataset.
- As next step the cols are renamed to easily recognizable headers as per codebook. For example the col name previously was q9_1 which was renamed to q9_1_coded_which_country_superpower_ and so on for all cols. This will help us further in data understanding
- As the next step of data understanding we will filter out additional cols. These are the cols which are created as duplicates to facilitate specific analysis of this dataset; for example if someone wants to analyze only one race viz. Korean Americans then they can do so. While scope of this project is all Asian Americans we will not be requiring these additional cols.
- Since time spent in USA is imp feature vector we derive this from arrival year field in USA . (PS- 2012 is year survey was conducted for, so that year is used in equation).########## if (to_number(2012-q59_when_came_to_usa)>0) then to_number(2012-q59_when_came_to_usa) else -1 endif############We then see distribution of the field below to determine BIN method. This is followed by Binning the above field . The method used for binning the variable is fixed –width.
- This is followed by performing similar step for Income. The source data has 10 subgroups of income ranges, this is being simplified to 4 groups as per below. The field is then used as BIN
- 1 Less than $10,000= 1 poor
- 2 10 to under $20,000= 1 Poor
- 3 20 to under $30,000= 2 Lower Middle
- 4 30 to under $40,000=2 Lowe Middle
- 5 40 to under $50,000= 2 Lower Middle
- 6 50 to under $75,000= 3 Higher Middle
- 7 75 to under $100,000=3 Higher Middle
- 8 100 to under $150,000= 4 Rich
- 9 $150,000 or more= 4 Rich
- This is followed to BIN the age field. The method used is fixed with BIN method. The results are as below.
- Now we can run Data Audit on all the fields to generate statistics on the data. For detecting outliers, I have used standard methods of stnd deviation greater than 3 delta. Below are the key findings of Data Audit.
- The overall quality of the data is 85% which means we can reliably use data
- The data is clean based on stats generated, means & sd and completeness analysis as per data audit report.
- The cols which have less 30% completeness will be dropped from further analysis.
- The advanced stats do not detect any outliers or skewness issues and so we can use the data without transformations .
5. Data Preparation
3.1 We have now got the clean data available which we can use to define feature vectors. Preparing data so that it can be used models is also time consuming activity. It goes back & forth many times sometimes few data issues are discovered only after running models. For the sake of simplicity only final data preparation methods included are mentioned below.
3.2 As a first step of data preparation we need to observe multi-collinearitity across the fields to omit those fields. We will not use advanced method like VIF or hetero-skedensity since we are running clustering as opposed to classification models. The results for overall multi-collineratiy is included below and does not show anything suspicious.
The obvious suspect cols like geographical region & geographical division will not be included twice as feature vectors.
3.3. After this we will define features for this model. This is based on synopsis given in business statement.The drivers for defining features for clustering are
- Basic Demographic information
- Advanced Demographics like Race, Religion , Immigration status , family details & background
- What they want in life/priorities of life
- How to they compare USA to native country
3.4 Below are the cols that we can define. Overall we get 37 features and ID field.
The data is then reordering in the deductive order for easier collation.
3.5 Special Note- There are two advanced demographic fields included in the features which are “how imp religion is for you” and “do you believe in god”. These two fields may seem co-linier but correlation analysis revealed interesting result that they actually have very weak co-relation. This may imply that those who believe in God need not necessarily consider religion as an important aspect of their life (at least in case of Asian Americans). Since religion or belief in god can influence how people think , both of these fields are included in analysis . See correlation results below.
3.6 Now we can conduct actual data preparation exercise like removing nulls or missing values or replacing them with mean. For this we can use utility provided by SPSS. We define below setting which basically just replace missing values with means. This will also convert all categorical fields to encoded numeric fields & of same data types like float /real type.Any cols which more than 30% missing values are dropped from analysis as mentioned above.
3.7 Now we can export the results of our data preparation step to conduct modeling.
- The data prepared with the feature vectors is being exported to the csv file.
- The k-means script for creating model using Spark-Python can be run directly from SPSS console as well. The method for doing so is mentioned below in appendix.
- However for this project data will be exported to the data file and run directly from spark-python console using a python script. The python script contains below solution steps (Please see script attached & in appendix below)
- First we import pyspark libs as required. We are using Spark MLLIB for conducting machine learning and we import the same.
- This is followed by setting spark context and then loading data from file to spark context. We then create RDD for the data
- The data is then parsed with using lambda map and printed for verification.
- We then build the kmeans model , the model is trained using max iterations as 100 and runs as 50 and seed intialisation as random. The number of clusters is variable entity which is tuned based on error rate.
- The model is evaluated using WSSE error rate . we calculated the distance of the point from the centroid of the cluster . The mean error is printed for evaluating the models.
- The model predicted results is saved and exported to file for further analysis .
- After conducting multiple runs and varying the number of clusters to 4/5/6 we find that we get least error rate for model when the number of clusters =5 and hence we consider that as stabilized model & choose the same for analysis. Please see error rate for various runs below. The measure of cohesion and separation averaged out for this clustering analysis and not calculated for individual clusters.The file is extracted as the output. For sake of simplicity the cluster number are transformed from 0 to 1 and so on as it is confusing to explain something five clusters with viz having diff numbers on them.Below is the summary of the different clusters than has been observed, the behavior traits analyzed & what makes each population segment different from each other is explained below.
PS- For sake of easier consumption cluster 0 from python is replaced as cluster 1 below & so on. Overall cluster distribution is below. For simplicity some labels below have been added manually. Also please see the appendix for all the viz graphics as generated from python & code details.
6. Next Steps
It is interesting to note that few of the communities that have been largely immune from current political polarization is Asian American community; this despite being the fact that they have been more successful in socio-economic terms than any other race in America including the white Americans in generic terms. This perhaps is indicator of successful integration of the community. There is lot of data of points in this survey which was not included for scope of this project but can be analyzed further to understand integration of this community with the larger USA population.
There is simile data being available for the other races by Pew viz for Blacks & Latinos and it would be interesting to run similar segmentation analysis and compare the results to this one to understand what explain inclusive integration of this community vs. other. Perhaps a kind of “social integration index” can also be calculated which can be used to compare various segments to get a bird’s eye view of American social dynamics.
Profiling the population is normally the first step in machine learning uses cases. Typically in customer marketing analytics such profiling results can be then be used to predict response propensity of an individual to certain offers by classification algorithms. It also serves as input for campaign management and direct marketing methods. All of which are possible next steps for using this as input data marketing analytics which maps them to their preferences & then target each segment accordingly.
The data for this project is being provided by Pew Research Institute. ( http://www.pewresearch.org/) Pew along with Gallop is the trusted source of data for any population based analytics throughout the world &. The source data links are included below and also the methods used by Pew along with the scientific methods of population sampling are described on the link.
The survey data is provided with the codebook that includes details of all the questions asked in the survey along with the categorical codes which makes processing of this data easier.The data is provided in only two file formats SPSS & Stata. For this project SPSS source file is used.
Below is the extract for the source data collection method from Pew site for this database. The original report referenced above is at location Website – http://www.pewresearch.org/
The Pew Research Center 2012 Asian-American Survey is based on telephone interviews conducted by landline and cell phone with a nationally representative sample of 3,511 Asian adults ages 18 and older living in the United States. The survey was conducted Jan. 3-March 27, 2012, in all 50 states, including Alaska and Hawaii, and the District of Columbia. The survey was conducted using a probability sample from multiple sources. The data are weighted to produce a final sample that is representative of Asian adults in the United States. Survey interviews were conducted under the direction of Abt SRBI, in English and Cantonese, Hindi, Japanese, Korean, Mandarin, Tagalog and Vietnamese. Respondents who identified as “Asian or Asian American, such as Chinese, Filipino, Indian, Japanese, Korean, or Vietnamese” were eligible to complete the survey interview, including those who identified with more than one race and regardless of Hispanic ethnicity. The question on racial identity also offered the following categories: white, black or African American, American Indian or Alaska Native, and Native Hawaiian or other Pacific Islander.U.S. Asian groups, subgroups, heritage groups and country-of-origin groups are used interchangeably to reference respondents’ self-classification into “specific Asian groups.” This self-identification may or may not match respondents’ country of birth or their parents’ country of birth. Self-classification is based on responses to an open-ended question asking for a respondent’s “specific Asian group.” Asian groups named in this open-ended question were “Chinese, Filipino, Indian, Japanese, Korean, Vietnamese, or of some other Asian background.” Respondents self-identified with more than 22 specific Asian groups. Those who identified with more than one Asian group were classified based on the group with which “they identify most.” Many questions on the survey used question wording customized to match the respondent’s self-identification into country-of-origin groups. A full description of the sampling design and methodology is provided in an appendix to the “Rise of Asian Americans” and “Asian Americans: A Mosaic of Faiths” reports. These reports also include a detailed topline of the survey’s results. The English language questionnaire is provided as a further point of reference.”