A case study/project in prescriptive analytics to understand why employees leave a company and applying various machine learning models to predict the next leaver

Analytics case study for understanding what factors contributed most to employee turnover in a company and predict the likelihood of leaving for an employee. The complete guide can be found here


"I quit..." This is the last thing anybody wants to hear from their employees. In a sense, it’s the employees who make the company. It’s the employees who do the work. It’s the employees who shape the company’s culture.

High rate of employee turnover can lead the company to huge monetary losses. Recognizing and understanding what factors are associated with employee turnover will allow companies and individuals to limit this from happening and may even increase employee productivity and growth.

These predictive insights give managers the opportunity to take corrective steps to build and preserve their successful business.

HR Analytics


  • To understand what factors contributed most to employee turnover.

  • To perform clustering to find any meaningful patterns of employee traits.

  • To create a model that predicts the likelihood if a certain employee will leave the company or not.

  • To create or improve different retention strategies on targeted employees.

The implementation of this model will allow management to create better decision-making actions.

The Problem:

One of the most common problems at work is turnover.

Replacing a worker earning about 50,000 dollars cost the company about 10,000 dollars or 20% of that worker’s yearly income according to the Center of American Progress.

Replacing a high-level employee can cost multiple of that...

Cost include:

  • Cost of off-boarding
  • Cost of hiring (advertising, interviewing, hiring)
  • Cost of onboarding a new person (training, management time)
  • Lost productivity (a new person may take 1-2 years to reach the productivity of an existing person)

Import Packages

# Import the neccessary modules for data manipulation and visual representation
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as matplot
import seaborn as sns
%matplotlib inline

Read the Data

df = pd.read_csv('HR_comma_sep.csv.txt')
# Examine the dataset
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years sales salary
0 0.38 0.53 2 157 3 0 1 0 sales low
1 0.80 0.86 5 262 6 0 1 0 sales medium
2 0.11 0.88 7 272 4 0 1 0 sales medium
3 0.72 0.87 5 223 5 0 1 0 sales low
4 0.37 0.52 2 159 3 0 1 0 sales low

Data Quality Check

# Can you check to see if there are any missing values in our data set
satisfaction_level       False
last_evaluation          False
number_project           False
average_montly_hours     False
time_spend_company       False
Work_accident            False
left                     False
promotion_last_5years    False
sales                    False
salary                   False
dtype: bool
# Rename Columns
# Renaming certain columns for better readability
df = df.rename(columns={'satisfaction_level': 'satisfaction', 
                        'last_evaluation': 'evaluation',
                        'number_project': 'projectCount',
                        'average_montly_hours': 'averageMonthlyHours',
                        'time_spend_company': 'yearsAtCompany',
                        'Work_accident': 'workAccident',
                        'promotion_last_5years': 'promotion',
                        'sales' : 'department',
                        'left' : 'turnover'

satisfaction evaluation projectCount averageMonthlyHours yearsAtCompany workAccident turnover promotion department salary
0 0.38 0.53 2 157 3 0 1 0 sales low
1 0.80 0.86 5 262 6 0 1 0 sales medium
2 0.11 0.88 7 272 4 0 1 0 sales medium
# Check the type of our features. Are there any data inconsistencies?
satisfaction           float64
evaluation             float64
projectCount             int64
averageMonthlyHours      int64
yearsAtCompany           int64
workAccident             int64
turnover                 int64
promotion                int64
department              object
salary                  object
dtype: object

Exploratory Data Analysis

# How many employees are in the dataset?
(14999, 10)
# Calculate the turnover rate of our company's dataset. What's the rate of turnover?
turnover_rate = df.turnover.value_counts() / 14999
0    0.761917
1    0.238083
Name: turnover, dtype: float64
# Display the statistical overview of the employees
satisfaction evaluation projectCount averageMonthlyHours yearsAtCompany workAccident turnover promotion
count 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000
mean 0.612834 0.716102 3.803054 201.050337 3.498233 0.144610 0.238083 0.021268
std 0.248631 0.171169 1.232592 49.943099 1.460136 0.351719 0.425924 0.144281
min 0.090000 0.360000 2.000000 96.000000 2.000000 0.000000 0.000000 0.000000
25% 0.440000 0.560000 3.000000 156.000000 3.000000 0.000000 0.000000 0.000000
50% 0.640000 0.720000 4.000000 200.000000 3.000000 0.000000 0.000000 0.000000
75% 0.820000 0.870000 5.000000 245.000000 4.000000 0.000000 0.000000 0.000000
max 1.000000 1.000000 7.000000 310.000000 10.000000 1.000000 1.000000 1.000000
# Display the mean summary of Employees (Turnover V.S. Non-turnover). What do you notice between the groups?
turnover_Summary = df.groupby('turnover')
satisfaction evaluation projectCount averageMonthlyHours yearsAtCompany workAccident promotion
0 0.666810 0.715473 3.786664 199.060203 3.380032 0.175009 0.026251
1 0.440098 0.718113 3.855503 207.419210 3.876505 0.047326 0.005321
# Create a correlation matrix. What features correlate the most with turnover? What other correlations did you find?
corr = df.corr()
corr = (corr)
plt.title('Heatmap of Correlation Matrix')
satisfaction evaluation projectCount averageMonthlyHours yearsAtCompany workAccident turnover promotion
satisfaction 1.000000 0.105021 -0.142970 -0.020048 -0.100866 0.058697 -0.388375 0.025605
evaluation 0.105021 1.000000 0.349333 0.339742 0.131591 -0.007104 0.006567 -0.008684
projectCount -0.142970 0.349333 1.000000 0.417211 0.196786 -0.004741 0.023787 -0.006064
averageMonthlyHours -0.020048 0.339742 0.417211 1.000000 0.127755 -0.010143 0.071287 -0.003544
yearsAtCompany -0.100866 0.131591 0.196786 0.127755 1.000000 0.002120 0.144822 0.067433
workAccident 0.058697 -0.007104 -0.004741 -0.010143 0.002120 1.000000 -0.154622 0.039245
turnover -0.388375 0.006567 0.023787 0.071287 0.144822 -0.154622 1.000000 -0.061788
promotion 0.025605 -0.008684 -0.006064 -0.003544 0.067433 0.039245 -0.061788 1.000000


# Plot the distribution of Employee Satisfaction, Evaluation, and Project Count. What story can you tell?

# Set up the matplotlib figure
f, axes = plt.subplots(ncols=3, figsize=(15, 6))

# Graph Employee Satisfaction
sns.distplot(df.satisfaction, kde=False, color="g", ax=axes[0]).set_title('Employee Satisfaction Distribution')
axes[0].set_ylabel('Employee Count')

# Graph Employee Evaluation
sns.distplot(df.evaluation, kde=False, color="r", ax=axes[1]).set_title('Employee Evaluation Distribution')
axes[1].set_ylabel('Employee Count')

# Graph Employee Average Monthly Hours
sns.distplot(df.averageMonthlyHours, kde=False, color="b", ax=axes[2]).set_title('Employee Average Monthly Hours Distribution')
axes[2].set_ylabel('Employee Count')



Apply get_dummies() to the categorical variables. Seperate categorical variables and numeric variables, then combine them.

cat_var = ['department','salary','turnover','promotion']
num_var = ['satisfaction','evaluation','projectCount','averageMonthlyHours','yearsAtCompany', 'workAccident']
categorical_df = pd.get_dummies(df[cat_var], drop_first=True)
numerical_df = df[num_var]

new_df = pd.concat([categorical_df,numerical_df], axis=1)
turnover promotion department_RandD department_accounting department_hr department_management department_marketing department_product_mng department_sales department_support department_technical salary_low salary_medium satisfaction evaluation projectCount averageMonthlyHours yearsAtCompany workAccident
0 1 0 0 0 0 0 0 0 1 0 0 1 0 0.38 0.53 2 157 3 0
1 1 0 0 0 0 0 0 0 1 0 0 0 1 0.80 0.86 5 262 6 0
2 1 0 0 0 0 0 0 0 1 0 0 0 1 0.11 0.88 7 272 4 0
3 1 0 0 0 0 0 0 0 1 0 0 1 0 0.72 0.87 5 223 5 0
4 1 0 0 0 0 0 0 0 1 0 0 1 0 0.37 0.52 2 159 3 0

Split Train/Test Set

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, precision_score, recall_score, confusion_matrix, precision_recall_curve
# Create the X and y set
X = new_df.iloc[:,1:]
y = new_df.iloc[:,0]

# Define train and test
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.15, random_state=123, stratify=y)

Train Logistic Regression Model


# Check accuracy of Logistic Model# Check  
from sklearn.linear_model import LogisticRegression

# Define the Logistic Regression Model
lr = LogisticRegression(class_weight='balanced')

# Fit the Logistic Regression Model to the train set, y_train)
print ("Logistic accuracy is %2.2f" % accuracy_score(y_test, lr.predict(X_test)))
Logistic accuracy is 0.77
Wall time: 110 ms

Apply 10-Fold Cross Validation for Logistic Regression

from sklearn import model_selection

# Define the 10-Fold Cross Validation
kfold = model_selection.KFold(n_splits=10, random_state=7)

# Define the Logistic Regression Model
lrCV = LogisticRegression()

# Define the evaluation metric 
scoring = 'roc_auc'

# Train the Logistic Regression Model on the 10-Fold Cross Validation
lr_results = model_selection.cross_val_score(lrCV, X_train, y_train, cv=kfold, scoring=scoring)
Wall time: 628 ms
# Print out the 10 scores from the training. Notice how you get a wide range of scores compared to one single training
array([0.79845385, 0.8371952 , 0.82284329, 0.8179427 , 0.80693377,
       0.83157279, 0.82354362, 0.82073686, 0.80722612, 0.83976854])

Average Score

Let's use AUC as a general baseline to compare our model's performance. After comparing, we can then select the best one and look at its precision and recall.

# Print out the mean and standard deviation of the training score
lr_auc = lr_results.mean()
print("The Logistic Regression AUC: %.3f and the STD is (%.3f)" % (lr_auc, lr_results.std()))
The Logistic Regression AUC: 0.821 and the STD is (0.013)

Logistic Regression AUC (0.78)

from sklearn.metrics import roc_auc_score

print ("\n\n ---Logistic Regression Model---")
lr_auc = roc_auc_score(y_test, lr.predict(X_test))
print ("Logistic Regression AUC = %2.2f" % lr_auc)
print(classification_report(y_test, lr.predict(X_test)))
 ---Logistic Regression Model---
Logistic Regression AUC = 0.78
             precision    recall  f1-score   support

          0       0.92      0.76      0.83      1714
          1       0.50      0.80      0.62       536

avg / total       0.82      0.77      0.78      2250

Train Random Forest Classifier Model

Notice how the random forest classifier takes a while to run on the dataset. That is one downside to the algorithm, it takes a lot of computation. But it has a better performance than the sipler models like Logistic Regression


from sklearn.ensemble import RandomForestClassifier

# Random Forest Model
rf = RandomForestClassifier(

# Fit the RF Model
rf =, y_train)
Wall time: 321 ms

Apply 10-Fold Cross Validation for Random Forest

rf_results = model_selection.cross_val_score(rf, X_train, y_train, cv=kfold, scoring=scoring)
Wall time: 1.6 s

Average Score

# Print out the mean and standard deviation of the training score
rf_auc = rf_results.mean()
print("The Random Forest AUC: %.3f and the STD is (%.3f)" % (rf_auc, rf_results.std()))
The Random Forest AUC: 0.988 and the STD is (0.004)

Random Forest AUC (0.99)

from sklearn.metrics import roc_auc_score

print ("\n\n ---Random Forest Model---")
rf_roc_auc = roc_auc_score(y_test, rf.predict(X_test))
print ("Random Forest AUC = %2.2f" % rf_roc_auc)
print(classification_report(y_test, rf.predict(X_test)))
 ---Random Forest Model---
Random Forest AUC = 0.99
             precision    recall  f1-score   support

          0       0.99      1.00      0.99      1714
          1       0.99      0.98      0.98       536

avg / total       0.99      0.99      0.99      2250

Support Vector Classifier


from sklearn.svm import SVC 

svclassifier = SVC(kernel='rbf', probability=True)  

svc =,y_train)
Wall time: 26.7 s

svc_result = model_selection.cross_val_score(svc, X_train, y_train, cv=kfold, scoring=scoring)
Wall time: 46.2 s
# Print out the mean and standard deviation of the training score
svc_auc = svc_result.mean()
print("The Supper Vector Classifier AUC: %.3f and the STD is (%.3f)" % (rf_auc, rf_results.std()))
The Supper Vector Classifier AUC: 0.988 and the STD is (0.004)
from sklearn.metrics import roc_auc_score

print ("\n\n ---Support Vector Model---")
rf_roc_auc = roc_auc_score(y_test, rf.predict(X_test))
print ("Support Vector Classifier AUC = %2.2f" % rf_roc_auc)
print(classification_report(y_test, rf.predict(X_test)))
 ---Support Vector Model---
Support Vector Classifier AUC = 0.99
             precision    recall  f1-score   support

          0       0.99      1.00      0.99      1714
          1       0.99      0.98      0.98       536

avg / total       0.99      0.99      0.99      2250

ROC Graph

# Create ROC Graph
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_test, lr.predict_proba(X_test)[:,1])
rf_fpr, rf_tpr, rf_thresholds = roc_curve(y_test, rf.predict_proba(X_test)[:,1])
svc_fpr, svc_tpr, svc_thresholds = roc_curve(y_test, svc.predict_proba(X_test)[:,1])


# Plot Logistic Regression ROC
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % lr_auc)

# Plot Random Forest ROC
plt.plot(rf_fpr, rf_tpr, label='Random Forest (area = %0.2f)' % rf_auc)

# Plot Decision Tree ROC
plt.plot(svc_fpr, svc_tpr, label='Support Vector Classifier (area = %0.2f)' % svc_auc)

# Plot Base Rate ROC
plt.plot([0,1], [0,1],label='Base Rate' 'k--')

plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Graph')
plt.legend(loc="lower right")


Random Forest Feature Importances

# Get Feature Importances
feature_importances = pd.DataFrame(rf.feature_importances_,
                                   index = X_train.columns,
                                   columns=['importance']).sort_values('importance', ascending=False)
feature_importances = feature_importances.reset_index()
index importance
0 satisfaction 0.279718
1 yearsAtCompany 0.240698
2 averageMonthlyHours 0.178100
3 evaluation 0.129985
4 projectCount 0.119583
5 workAccident 0.013300
6 salary_low 0.011167
7 department_technical 0.005552
8 department_sales 0.004075
9 salary_medium 0.003387
10 department_support 0.003291
11 promotion 0.002225
12 department_hr 0.002103
13 department_management 0.001688
14 department_accounting 0.001502
15 department_RandD 0.001363
16 department_marketing 0.001333
17 department_product_mng 0.000928

# Initialize the matplotlib figure
f, ax = plt.subplots(figsize=(13, 7))

# Plot the total schools per city
sns.barplot(x="importance", y='index', data=feature_importances,
            label="Total", color="b")


Retention Plan

Since this model is being used for people, we should refrain from soley relying on the output of our model. Instead, we can use it's probability output and design our own system to treat each employee accordingly.

  1. Safe Zone (Green) – Employees within this zone are considered safe.
  2. Low Risk Zone (Yellow) – Employees within this zone are too be taken into consideration of potential turnover. This is more of a long-term track.
  3. Medium Risk Zone (Orange) – Employees within this zone are at risk of turnover. Action should be taken and monitored accordingly.
  4. High Risk Zone (Red) – Employees within this zone are considered to have the highest chance of turnover. Action should be taken immediately.


array([[1. , 0. ],
       [0. , 1. ],
       [1. , 0. ],
       [0. , 1. ],
       [0.8, 0.2],
       [0. , 1. ],
       [1. , 0. ],
       [0. , 1. ],
       [1. , 0. ],
       [1. , 0. ],
       [1. , 0. ],
       [1. , 0. ],
       [1. , 0. ],
       [1. , 0. ],
       [1. , 0. ],
       [1. , 0. ],
       [1. , 0. ],
       [0.9, 0.1],
       [1. , 0. ],
       [0.4, 0.6],
       [1. , 0. ],
       [1. , 0. ],
       [0. , 1. ],
       [1. , 0. ],
       [0. , 1. ]])


A case study/project in prescriptive analytics to understand why employees leave a company and applying various machine learning models to predict the next leaver







