Analytics case study for understanding what factors contributed most to employee turnover in a company and predict the likelihood of leaving for an employee. The complete guide can be found here
"I quit..." This is the last thing anybody wants to hear from their employees. In a sense, it’s the employees who make the company. It’s the employees who do the work. It’s the employees who shape the company’s culture.
High rate of employee turnover can lead the company to huge monetary losses. Recognizing and understanding what factors are associated with employee turnover will allow companies and individuals to limit this from happening and may even increase employee productivity and growth.
These predictive insights give managers the opportunity to take corrective steps to build and preserve their successful business.
-
To understand what factors contributed most to employee turnover.
-
To perform clustering to find any meaningful patterns of employee traits.
-
To create a model that predicts the likelihood if a certain employee will leave the company or not.
-
To create or improve different retention strategies on targeted employees.
The implementation of this model will allow management to create better decision-making actions.
One of the most common problems at work is turnover.
Replacing a worker earning about 50,000 dollars cost the company about 10,000 dollars or 20% of that worker’s yearly income according to the Center of American Progress.
Replacing a high-level employee can cost multiple of that...
Cost include:
- Cost of off-boarding
- Cost of hiring (advertising, interviewing, hiring)
- Cost of onboarding a new person (training, management time)
- Lost productivity (a new person may take 1-2 years to reach the productivity of an existing person)
# Import the neccessary modules for data manipulation and visual representation
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as matplot
import seaborn as sns
%matplotlib inline
df = pd.read_csv('HR_comma_sep.csv.txt')
# Examine the dataset
df.head()
satisfaction_level | last_evaluation | number_project | average_montly_hours | time_spend_company | Work_accident | left | promotion_last_5years | sales | salary | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0.38 | 0.53 | 2 | 157 | 3 | 0 | 1 | 0 | sales | low |
1 | 0.80 | 0.86 | 5 | 262 | 6 | 0 | 1 | 0 | sales | medium |
2 | 0.11 | 0.88 | 7 | 272 | 4 | 0 | 1 | 0 | sales | medium |
3 | 0.72 | 0.87 | 5 | 223 | 5 | 0 | 1 | 0 | sales | low |
4 | 0.37 | 0.52 | 2 | 159 | 3 | 0 | 1 | 0 | sales | low |
# Can you check to see if there are any missing values in our data set
df.isnull().any()
satisfaction_level False
last_evaluation False
number_project False
average_montly_hours False
time_spend_company False
Work_accident False
left False
promotion_last_5years False
sales False
salary False
dtype: bool
# Rename Columns
# Renaming certain columns for better readability
df = df.rename(columns={'satisfaction_level': 'satisfaction',
'last_evaluation': 'evaluation',
'number_project': 'projectCount',
'average_montly_hours': 'averageMonthlyHours',
'time_spend_company': 'yearsAtCompany',
'Work_accident': 'workAccident',
'promotion_last_5years': 'promotion',
'sales' : 'department',
'left' : 'turnover'
})
df.head(3)
satisfaction | evaluation | projectCount | averageMonthlyHours | yearsAtCompany | workAccident | turnover | promotion | department | salary | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0.38 | 0.53 | 2 | 157 | 3 | 0 | 1 | 0 | sales | low |
1 | 0.80 | 0.86 | 5 | 262 | 6 | 0 | 1 | 0 | sales | medium |
2 | 0.11 | 0.88 | 7 | 272 | 4 | 0 | 1 | 0 | sales | medium |
# Check the type of our features. Are there any data inconsistencies?
df.dtypes
satisfaction float64
evaluation float64
projectCount int64
averageMonthlyHours int64
yearsAtCompany int64
workAccident int64
turnover int64
promotion int64
department object
salary object
dtype: object
# How many employees are in the dataset?
df.shape
(14999, 10)
# Calculate the turnover rate of our company's dataset. What's the rate of turnover?
turnover_rate = df.turnover.value_counts() / 14999
turnover_rate
0 0.761917
1 0.238083
Name: turnover, dtype: float64
# Display the statistical overview of the employees
df.describe()
satisfaction | evaluation | projectCount | averageMonthlyHours | yearsAtCompany | workAccident | turnover | promotion | |
---|---|---|---|---|---|---|---|---|
count | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 |
mean | 0.612834 | 0.716102 | 3.803054 | 201.050337 | 3.498233 | 0.144610 | 0.238083 | 0.021268 |
std | 0.248631 | 0.171169 | 1.232592 | 49.943099 | 1.460136 | 0.351719 | 0.425924 | 0.144281 |
min | 0.090000 | 0.360000 | 2.000000 | 96.000000 | 2.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 0.440000 | 0.560000 | 3.000000 | 156.000000 | 3.000000 | 0.000000 | 0.000000 | 0.000000 |
50% | 0.640000 | 0.720000 | 4.000000 | 200.000000 | 3.000000 | 0.000000 | 0.000000 | 0.000000 |
75% | 0.820000 | 0.870000 | 5.000000 | 245.000000 | 4.000000 | 0.000000 | 0.000000 | 0.000000 |
max | 1.000000 | 1.000000 | 7.000000 | 310.000000 | 10.000000 | 1.000000 | 1.000000 | 1.000000 |
# Display the mean summary of Employees (Turnover V.S. Non-turnover). What do you notice between the groups?
turnover_Summary = df.groupby('turnover')
turnover_Summary.mean()
satisfaction | evaluation | projectCount | averageMonthlyHours | yearsAtCompany | workAccident | promotion | |
---|---|---|---|---|---|---|---|
turnover | |||||||
0 | 0.666810 | 0.715473 | 3.786664 | 199.060203 | 3.380032 | 0.175009 | 0.026251 |
1 | 0.440098 | 0.718113 | 3.855503 | 207.419210 | 3.876505 | 0.047326 | 0.005321 |
# Create a correlation matrix. What features correlate the most with turnover? What other correlations did you find?
corr = df.corr()
corr = (corr)
sns.heatmap(corr,
xticklabels=corr.columns.values,
yticklabels=corr.columns.values)
plt.title('Heatmap of Correlation Matrix')
corr
satisfaction | evaluation | projectCount | averageMonthlyHours | yearsAtCompany | workAccident | turnover | promotion | |
---|---|---|---|---|---|---|---|---|
satisfaction | 1.000000 | 0.105021 | -0.142970 | -0.020048 | -0.100866 | 0.058697 | -0.388375 | 0.025605 |
evaluation | 0.105021 | 1.000000 | 0.349333 | 0.339742 | 0.131591 | -0.007104 | 0.006567 | -0.008684 |
projectCount | -0.142970 | 0.349333 | 1.000000 | 0.417211 | 0.196786 | -0.004741 | 0.023787 | -0.006064 |
averageMonthlyHours | -0.020048 | 0.339742 | 0.417211 | 1.000000 | 0.127755 | -0.010143 | 0.071287 | -0.003544 |
yearsAtCompany | -0.100866 | 0.131591 | 0.196786 | 0.127755 | 1.000000 | 0.002120 | 0.144822 | 0.067433 |
workAccident | 0.058697 | -0.007104 | -0.004741 | -0.010143 | 0.002120 | 1.000000 | -0.154622 | 0.039245 |
turnover | -0.388375 | 0.006567 | 0.023787 | 0.071287 | 0.144822 | -0.154622 | 1.000000 | -0.061788 |
promotion | 0.025605 | -0.008684 | -0.006064 | -0.003544 | 0.067433 | 0.039245 | -0.061788 | 1.000000 |
# Plot the distribution of Employee Satisfaction, Evaluation, and Project Count. What story can you tell?
# Set up the matplotlib figure
f, axes = plt.subplots(ncols=3, figsize=(15, 6))
# Graph Employee Satisfaction
sns.distplot(df.satisfaction, kde=False, color="g", ax=axes[0]).set_title('Employee Satisfaction Distribution')
axes[0].set_ylabel('Employee Count')
# Graph Employee Evaluation
sns.distplot(df.evaluation, kde=False, color="r", ax=axes[1]).set_title('Employee Evaluation Distribution')
axes[1].set_ylabel('Employee Count')
# Graph Employee Average Monthly Hours
sns.distplot(df.averageMonthlyHours, kde=False, color="b", ax=axes[2]).set_title('Employee Average Monthly Hours Distribution')
axes[2].set_ylabel('Employee Count')
Apply get_dummies() to the categorical variables. Seperate categorical variables and numeric variables, then combine them.
cat_var = ['department','salary','turnover','promotion']
num_var = ['satisfaction','evaluation','projectCount','averageMonthlyHours','yearsAtCompany', 'workAccident']
categorical_df = pd.get_dummies(df[cat_var], drop_first=True)
numerical_df = df[num_var]
new_df = pd.concat([categorical_df,numerical_df], axis=1)
new_df.head()
turnover | promotion | department_RandD | department_accounting | department_hr | department_management | department_marketing | department_product_mng | department_sales | department_support | department_technical | salary_low | salary_medium | satisfaction | evaluation | projectCount | averageMonthlyHours | yearsAtCompany | workAccident | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0.38 | 0.53 | 2 | 157 | 3 | 0 |
1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0.80 | 0.86 | 5 | 262 | 6 | 0 |
2 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0.11 | 0.88 | 7 | 272 | 4 | 0 |
3 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0.72 | 0.87 | 5 | 223 | 5 | 0 |
4 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0.37 | 0.52 | 2 | 159 | 3 | 0 |
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, precision_score, recall_score, confusion_matrix, precision_recall_curve
# Create the X and y set
X = new_df.iloc[:,1:]
y = new_df.iloc[:,0]
# Define train and test
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.15, random_state=123, stratify=y)
%%time
# Check accuracy of Logistic Model# Check
from sklearn.linear_model import LogisticRegression
# Define the Logistic Regression Model
lr = LogisticRegression(class_weight='balanced')
# Fit the Logistic Regression Model to the train set
lr.fit(X_train, y_train)
print ("Logistic accuracy is %2.2f" % accuracy_score(y_test, lr.predict(X_test)))
Logistic accuracy is 0.77
Wall time: 110 ms
%%time
from sklearn import model_selection
# Define the 10-Fold Cross Validation
kfold = model_selection.KFold(n_splits=10, random_state=7)
# Define the Logistic Regression Model
lrCV = LogisticRegression()
# Define the evaluation metric
scoring = 'roc_auc'
# Train the Logistic Regression Model on the 10-Fold Cross Validation
lr_results = model_selection.cross_val_score(lrCV, X_train, y_train, cv=kfold, scoring=scoring)
Wall time: 628 ms
# Print out the 10 scores from the training. Notice how you get a wide range of scores compared to one single training
lr_results
array([0.79845385, 0.8371952 , 0.82284329, 0.8179427 , 0.80693377,
0.83157279, 0.82354362, 0.82073686, 0.80722612, 0.83976854])
Let's use AUC as a general baseline to compare our model's performance. After comparing, we can then select the best one and look at its precision and recall.
# Print out the mean and standard deviation of the training score
lr_auc = lr_results.mean()
print("The Logistic Regression AUC: %.3f and the STD is (%.3f)" % (lr_auc, lr_results.std()))
The Logistic Regression AUC: 0.821 and the STD is (0.013)
from sklearn.metrics import roc_auc_score
print ("\n\n ---Logistic Regression Model---")
lr_auc = roc_auc_score(y_test, lr.predict(X_test))
print ("Logistic Regression AUC = %2.2f" % lr_auc)
print(classification_report(y_test, lr.predict(X_test)))
---Logistic Regression Model---
Logistic Regression AUC = 0.78
precision recall f1-score support
0 0.92 0.76 0.83 1714
1 0.50 0.80 0.62 536
avg / total 0.82 0.77 0.78 2250
Notice how the random forest classifier takes a while to run on the dataset. That is one downside to the algorithm, it takes a lot of computation. But it has a better performance than the sipler models like Logistic Regression
%%time
from sklearn.ensemble import RandomForestClassifier
# Random Forest Model
rf = RandomForestClassifier(
class_weight="balanced"
)
# Fit the RF Model
rf = rf.fit(X_train, y_train)
Wall time: 321 ms
%%time
rf_results = model_selection.cross_val_score(rf, X_train, y_train, cv=kfold, scoring=scoring)
rf_results
Wall time: 1.6 s
# Print out the mean and standard deviation of the training score
rf_auc = rf_results.mean()
print("The Random Forest AUC: %.3f and the STD is (%.3f)" % (rf_auc, rf_results.std()))
The Random Forest AUC: 0.988 and the STD is (0.004)
from sklearn.metrics import roc_auc_score
print ("\n\n ---Random Forest Model---")
rf_roc_auc = roc_auc_score(y_test, rf.predict(X_test))
print ("Random Forest AUC = %2.2f" % rf_roc_auc)
print(classification_report(y_test, rf.predict(X_test)))
---Random Forest Model---
Random Forest AUC = 0.99
precision recall f1-score support
0 0.99 1.00 0.99 1714
1 0.99 0.98 0.98 536
avg / total 0.99 0.99 0.99 2250
%%time
from sklearn.svm import SVC
svclassifier = SVC(kernel='rbf', probability=True)
svc = svclassifier.fit(X_train,y_train)
Wall time: 26.7 s
%%time
svc_result = model_selection.cross_val_score(svc, X_train, y_train, cv=kfold, scoring=scoring)
svc_result
Wall time: 46.2 s
# Print out the mean and standard deviation of the training score
svc_auc = svc_result.mean()
print("The Supper Vector Classifier AUC: %.3f and the STD is (%.3f)" % (rf_auc, rf_results.std()))
The Supper Vector Classifier AUC: 0.988 and the STD is (0.004)
from sklearn.metrics import roc_auc_score
print ("\n\n ---Support Vector Model---")
rf_roc_auc = roc_auc_score(y_test, rf.predict(X_test))
print ("Support Vector Classifier AUC = %2.2f" % rf_roc_auc)
print(classification_report(y_test, rf.predict(X_test)))
---Support Vector Model---
Support Vector Classifier AUC = 0.99
precision recall f1-score support
0 0.99 1.00 0.99 1714
1 0.99 0.98 0.98 536
avg / total 0.99 0.99 0.99 2250
# Create ROC Graph
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_test, lr.predict_proba(X_test)[:,1])
rf_fpr, rf_tpr, rf_thresholds = roc_curve(y_test, rf.predict_proba(X_test)[:,1])
svc_fpr, svc_tpr, svc_thresholds = roc_curve(y_test, svc.predict_proba(X_test)[:,1])
plt.figure()
# Plot Logistic Regression ROC
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % lr_auc)
# Plot Random Forest ROC
plt.plot(rf_fpr, rf_tpr, label='Random Forest (area = %0.2f)' % rf_auc)
# Plot Decision Tree ROC
plt.plot(svc_fpr, svc_tpr, label='Support Vector Classifier (area = %0.2f)' % svc_auc)
# Plot Base Rate ROC
plt.plot([0,1], [0,1],label='Base Rate' 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Graph')
plt.legend(loc="lower right")
plt.show()
# Get Feature Importances
feature_importances = pd.DataFrame(rf.feature_importances_,
index = X_train.columns,
columns=['importance']).sort_values('importance', ascending=False)
feature_importances = feature_importances.reset_index()
feature_importances
index | importance | |
---|---|---|
0 | satisfaction | 0.279718 |
1 | yearsAtCompany | 0.240698 |
2 | averageMonthlyHours | 0.178100 |
3 | evaluation | 0.129985 |
4 | projectCount | 0.119583 |
5 | workAccident | 0.013300 |
6 | salary_low | 0.011167 |
7 | department_technical | 0.005552 |
8 | department_sales | 0.004075 |
9 | salary_medium | 0.003387 |
10 | department_support | 0.003291 |
11 | promotion | 0.002225 |
12 | department_hr | 0.002103 |
13 | department_management | 0.001688 |
14 | department_accounting | 0.001502 |
15 | department_RandD | 0.001363 |
16 | department_marketing | 0.001333 |
17 | department_product_mng | 0.000928 |
sns.set(style="whitegrid")
# Initialize the matplotlib figure
f, ax = plt.subplots(figsize=(13, 7))
# Plot the total schools per city
sns.set_color_codes("pastel")
sns.barplot(x="importance", y='index', data=feature_importances,
label="Total", color="b")
Since this model is being used for people, we should refrain from soley relying on the output of our model. Instead, we can use it's probability output and design our own system to treat each employee accordingly.
- Safe Zone (Green) – Employees within this zone are considered safe.
- Low Risk Zone (Yellow) – Employees within this zone are too be taken into consideration of potential turnover. This is more of a long-term track.
- Medium Risk Zone (Orange) – Employees within this zone are at risk of turnover. Action should be taken and monitored accordingly.
- High Risk Zone (Red) – Employees within this zone are considered to have the highest chance of turnover. Action should be taken immediately.
rf.predict_proba(X_test)[175:200,]
array([[1. , 0. ],
[0. , 1. ],
[1. , 0. ],
[0. , 1. ],
[0.8, 0.2],
[0. , 1. ],
[1. , 0. ],
[0. , 1. ],
[1. , 0. ],
[1. , 0. ],
[1. , 0. ],
[1. , 0. ],
[1. , 0. ],
[1. , 0. ],
[1. , 0. ],
[1. , 0. ],
[1. , 0. ],
[0.9, 0.1],
[1. , 0. ],
[0.4, 0.6],
[1. , 0. ],
[1. , 0. ],
[0. , 1. ],
[1. , 0. ],
[0. , 1. ]])