Happiness Score¶
Description¶
The World Happiness Report is a landmark survey of the state of global happiness. The dataset is obtained from Kaggle and the data is modified to only contain the 2015-2017 report for simplicity and avoiding the missing value.
Objective¶
The goal of this data analysis is to identify the factor that contributes to the happiness of the country.
Content¶
The column descriptions are as follows:
Country - Name of the country
Happiness Rank - Rank of the country based on the Happiness Score.
Happiness Score - A metric measured in 2015 by asking the sampled people the question: "How would you rate your happiness on a scale of 0 to 10 where 10 is the happiest."
Economy - The extent to which GDP contributes to the calculation of the Happiness Score. (Economic production)
Family - The extent to which Family contributes to the calculation of the Happiness Score (Social Support)
Health - The extent to which Life expectancy contributed to the calculation of the Happiness Score. (Life Expectancy)
Freedom - The extent to which Freedom contributed to the calculation of the Happiness Score.
Trust - The extent to which the Perception of Corruption contributes to the happiness score. (Absence of corruption)
Generosity - The extent to which Generosity contributed to the calculation of the Happiness Score.
Dystopia Residual - The extent to which Dystopia Residual contributed to the calculation of the Happiness Score.
Year - Year the survey result was collected
The dystopia is an imaginary country, which has the values equal to the world's lowest national averages for each of the six factors, i.e. lowest GDP, lowest social support, least life expectancy, and so on.
The purpose of dystopia is to have a benchmark because no country can perform worse than dystopia.
The residual, or unexplained components, is the numerical value that either over- or under-explain the average of each six variables.
Formula to calculate the Happiness Score is:
$Happiness Score = Economic + Family + Health + Freedom + Trust + Generosity + Dystopia Residual$
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('World_Happiness_2015_2017.csv')
print("Packge Imported, data read")
Exploring the Data:
- Display first 5 rows
- Concise summary of data
- Statistical summary
- Check null values
def overview(df):
print("-----------------------------First 5 Rows-----------------------------")
# Display in Ipython format
display(df.head())
print("-----------------------------Concise Summary-----------------------------")
# info() method do not require display()
df.info()
print("-----------------------------Statistical Summary-----------------------------")
display(df.describe())
print("-----------------------------Null Values-----------------------------")
percent = (df.isnull().mean().sort_values(ascending=False)*100).map(lambda row: '%.2f' % row)
# sum to get the counts
total = df.isnull().sum().sort_values(ascending=False)
# Create single data frame to represent the missng values results
display(pd.concat([total,percent], axis = 1, keys=['Total','%']))
overview(df)
- First 5 rows: Seems like the data is clean, all in numerical, no need to convert
- Concise summary: We can confirm that all the columns are in correct data types
- Statistical summary: There are 470 data counts, because it is each country over the 3 years
- Null values: There is no null value in this data set.
Pairwise Plot¶
Pairwise plot shows:
- Histogram: distribution of the category
- Scatter plot: correlation between two categories
g = sns.pairplot(df)
g.fig.suptitle('Pairwise Plot', fontsize=20)
g.fig.subplots_adjust(top=0.9)
Above is a summary of the correlation between two features in this data.
The diagonal part represents the distribution of the data for a specific feature.
For example, the "Year" only has 2015, 2016, and 2017, hence the distribution plot looks quite strange with three equal-size vertical bars.
The other graphs, you can see the big picture of the attributes, for example, if you look at the economy row with the happiness score column, you can see that there is a positive correlation, meaning that as the economy grow, people's happiness score grow as well.
# check the relationship between happiness score and other attributes
# Store necessary columns
attribute = data.loc[:,'Happiness Score':"Dystopia Residual"].copy()
# Calculate r and sort value descending
r = attribute.corr()['Happiness Score'].sort_values(ascending=False)
# Calculate r squared
r_squared = r**2
# Concatinate to create a single table
r_table = pd.concat([r,r_squared],axis=1, keys=['R','R^2'])
# Conditional background formatting
r_table.style.background_gradient(cmap="coolwarm",subset=['R','R^2']).set_precision(3)
# factors = ['Happiness Score','Economy (GDP per Capita)','Family','Health (Life Expectancy)','Freedom','Trust (Government Corruption)','Generosity','Dystopia Residual']
# corr_data = data[factors]
# corr_data.corr()
This is the $R$ and $R^2$-value, representing the correlation between happiness score and other features.
$R$-value is quite vague in terms of gaining the understanding, and that is why the $R^2$-value helps to give a better understanding.
For example, the economy is twice as more effective in terms of increasing people's happiness scores compare to freedom.
Since the dystopia residual is the baseline for this r value we can see that the Trust and generosity are not reliable sources for the correlation since it is lower than the Dystopia residual.
But why government corruption and generosity do not affect people's happiness? Well, anecdotally, even if the government is corrupted or not corrupted, the people do not care. Because many people do not heavily rely on government. And also, even if somebody is generous to you, that does not pay the rent or fill your stomach. In fact, money is quite a big factor in happiness, because if one has money, one usually does not have a problem in life. If one is living paycheck to paycheck, they don't care if the government is going to help the person or not, rather, they will be concerned about there payment.
$R$ and $R^2$: Bar Graph¶
# Create a function to plot r and r^2
def barplot(data,title):
plt.figure(figsize=(15,5))
g = sns.barplot(y=data.index, x=data.values)
g.set_title(title)
barplot(r,'R-value vs. Features')
barplot(r_squared,'R^2-value vs. Features')
The above two graphs are visual representation of how the each $R$ and $R^2$ value compares for each features.
R and R^2: Heat Map¶
# Usually, heat map plot a unneeded part since it is a symmetric triangle
# sns.heatmap(attribute.corr(), annot=True, cmap="coolwarm")
# Hence, we use mask to "mask" or cover those upper tirangle part
# ones_like return array filled with one and shape with given data, and dtype is boolean so it will fill with boolean True
# Then, triu will keep the top right triangle to true, while keeping bottom left triangle to false
# When we pass this mask to heat map, true will be masked and false will be displayed
mask = np.triu(np.ones_like(attribute.corr(),dtype=np.bool))
sns.heatmap(attribute.corr(), annot=True, mask=mask, cmap="coolwarm").set_title("R value")
plt.figure()
sns.heatmap(attribute.corr()**2, annot=True, mask=mask, cmap="coolwarm").set_title("R^2 value")
The above heat map represents not only the happiness score to other 6 features, but also compare every six features to each other.
As we can see, the quantifiable value, such as Economy and Health, has a significant $R^2$ value.
I was quite surprised that Family value still has quite a high $R^2$ value because social support is difficult to quantify. And as I thought, Freedom, Trust, and generosity are all low correlation due to its vague measurement.
Regression¶
x_vars = attribute.iloc[:,1:].columns.tolist()
y_vars = 'Happiness Score'
# Create grid of graph, with y axis being y_vars, and x_axis beign x_bars
g = sns.PairGrid(data, y_vars=[y_vars, ], x_vars=x_vars)
# Plot reggression plot to each grid, and change the size of the bubble by scatter_kws
# map will map the sns.regplot to g, and then also apply the scatter dictionary as optional
g.map(sns.regplot, scatter_kws={'s':0.8})
# Add suptitle, default value will over lap on the graph, so scale it slightly up by y
g.fig.suptitle("Happiness Score Regression", y =1.1)
# y_vars
The above regression confirms the correlation between happiness socre and Economy, Family, health and freedom.
Regression: Yearly Analysis¶
# Create grid of graph, with y axis being y_vars, and x_axis beign x_bars
g = sns.PairGrid(data, y_vars=[y_vars, ], x_vars=x_vars, hue="Year")
# Plot reggression plot to each grid, and change the size of the bubble by scatter_kws
g.map(sns.regplot, scatter_kws={'s':0.8})
g.add_legend()
# Add suptitle, default value will over lap on the graph, so scale it slightly up by y
g.fig.suptitle("Happiness Score Regression", y =1.1)
# y_vars
The above graph represents the happiness score to other features in different years.
- There is no significant different over year in Economy, Health, and freedom categories.
- The Trust and generosity has a shift in slope
Summary¶
- The highest correlation to happiness score is Economy in terms of $R$ and $R^2$-value. - If the GDP of the country is high, the life standard in the country should be high, resulting in a high happiness
- The second highest correlation to happiness score is Health or life expectancy - If the life expectancy is high, there are less external factors for the people to worry about, such as famine, war, not enough health care and so on, resulting in a high happiness