McCurr Health Consultancy is an MNC that has thousands of employees spread across the globe. The company believes in hiring the best talent available and retaining them for as long as possible. A huge amount of resources is spent on retaining existing employees through various initiatives. The Head of People Operations wants to bring down the cost of retaining employees. For this, he proposes limiting the incentives to only those employees who are at risk of attrition. As a recently hired Data Scientist in the People Operations Department, you have been asked to identify patterns in characteristics of employees who leave the organization. Also, you have to use this information to predict if an employee is at risk of attrition. This information will be used to target them with incentives.
The data contains information on employees' demographic details, work-related metrics, and attrition flag.
In the real world, you will not find definitions for some of your variables. It is the part of the analysis to figure out what they might mean.
Kindly do not run the code cells containing Hyperparameter Tuning using GridSearchCV during the session, since they take considerable time to run.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# To scale the data using z-score
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
# Algorithms to use
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
# Metrics to evaluate the model
from sklearn import metrics
from sklearn.metrics import confusion_matrix, classification_report,recall_score,precision_score, accuracy_score
# For tuning the model
from sklearn.model_selection import GridSearchCV
# To ignore warnings
import warnings
warnings.filterwarnings("ignore")
# Loading the dataset
df = pd.read_excel('HR_Employee_Attrition_Dataset.xlsx')
# Looking at the first 5 records
df.head()
| EmployeeNumber | Attrition | Age | BusinessTravel | DailyRate | Department | DistanceFromHome | Education | EducationField | EnvironmentSatisfaction | ... | RelationshipSatisfaction | StandardHours | StockOptionLevel | TotalWorkingYears | TrainingTimesLastYear | WorkLifeBalance | YearsAtCompany | YearsInCurrentRole | YearsSinceLastPromotion | YearsWithCurrManager | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Yes | 41 | Travel_Rarely | 1102 | Sales | 1 | 2 | Life Sciences | 2 | ... | 1 | 80 | 0 | 8 | 0 | 1 | 6 | 4 | 0 | 5 |
| 1 | 2 | No | 49 | Travel_Frequently | 279 | Research & Development | 8 | 1 | Life Sciences | 3 | ... | 4 | 80 | 1 | 10 | 3 | 3 | 10 | 7 | 1 | 7 |
| 2 | 3 | Yes | 37 | Travel_Rarely | 1373 | Research & Development | 2 | 2 | Other | 4 | ... | 2 | 80 | 0 | 7 | 3 | 3 | 0 | 0 | 0 | 0 |
| 3 | 4 | No | 33 | Travel_Frequently | 1392 | Research & Development | 3 | 4 | Life Sciences | 4 | ... | 3 | 80 | 0 | 8 | 3 | 3 | 8 | 7 | 3 | 0 |
| 4 | 5 | No | 27 | Travel_Rarely | 591 | Research & Development | 2 | 1 | Medical | 1 | ... | 4 | 80 | 1 | 6 | 3 | 3 | 2 | 2 | 2 | 2 |
5 rows × 34 columns
# Let us see the info of the data
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2940 entries, 0 to 2939 Data columns (total 34 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 EmployeeNumber 2940 non-null int64 1 Attrition 2940 non-null object 2 Age 2940 non-null int64 3 BusinessTravel 2940 non-null object 4 DailyRate 2940 non-null int64 5 Department 2940 non-null object 6 DistanceFromHome 2940 non-null int64 7 Education 2940 non-null int64 8 EducationField 2940 non-null object 9 EnvironmentSatisfaction 2940 non-null int64 10 Gender 2940 non-null object 11 HourlyRate 2940 non-null int64 12 JobInvolvement 2940 non-null int64 13 JobLevel 2940 non-null int64 14 JobRole 2940 non-null object 15 JobSatisfaction 2940 non-null int64 16 MaritalStatus 2940 non-null object 17 MonthlyIncome 2940 non-null int64 18 MonthlyRate 2940 non-null int64 19 NumCompaniesWorked 2940 non-null int64 20 Over18 2940 non-null object 21 OverTime 2940 non-null object 22 PercentSalaryHike 2940 non-null int64 23 PerformanceRating 2940 non-null int64 24 RelationshipSatisfaction 2940 non-null int64 25 StandardHours 2940 non-null int64 26 StockOptionLevel 2940 non-null int64 27 TotalWorkingYears 2940 non-null int64 28 TrainingTimesLastYear 2940 non-null int64 29 WorkLifeBalance 2940 non-null int64 30 YearsAtCompany 2940 non-null int64 31 YearsInCurrentRole 2940 non-null int64 32 YearsSinceLastPromotion 2940 non-null int64 33 YearsWithCurrManager 2940 non-null int64 dtypes: int64(25), object(9) memory usage: 781.1+ KB
Observations:
Let's check the unique values in each column
# Checking the count of unique values in each column
df.nunique()
EmployeeNumber 2940 Attrition 2 Age 43 BusinessTravel 3 DailyRate 886 Department 3 DistanceFromHome 29 Education 5 EducationField 6 EnvironmentSatisfaction 4 Gender 2 HourlyRate 71 JobInvolvement 4 JobLevel 5 JobRole 9 JobSatisfaction 4 MaritalStatus 3 MonthlyIncome 1349 MonthlyRate 1427 NumCompaniesWorked 10 Over18 1 OverTime 2 PercentSalaryHike 15 PerformanceRating 2 RelationshipSatisfaction 4 StandardHours 1 StockOptionLevel 4 TotalWorkingYears 40 TrainingTimesLastYear 7 WorkLifeBalance 4 YearsAtCompany 37 YearsInCurrentRole 19 YearsSinceLastPromotion 16 YearsWithCurrManager 18 dtype: int64
Observations:
Let's drop the columns mentioned above and define lists for numerical and categorical columns to explore them separately.
# Dropping the columns
df = df.drop(['EmployeeNumber', 'Over18', 'StandardHours'] , axis = 1)
# Creating numerical columns
num_cols = ['DailyRate', 'Age', 'DistanceFromHome', 'MonthlyIncome', 'MonthlyRate', 'PercentSalaryHike', 'TotalWorkingYears',
'YearsAtCompany', 'NumCompaniesWorked', 'HourlyRate', 'YearsInCurrentRole', 'YearsSinceLastPromotion',
'YearsWithCurrManager', 'TrainingTimesLastYear']
# Creating categorical variables
cat_cols = ['Attrition', 'OverTime', 'BusinessTravel', 'Department', 'Education', 'EducationField', 'JobSatisfaction', 'EnvironmentSatisfaction',
'WorkLifeBalance', 'StockOptionLevel', 'Gender', 'PerformanceRating', 'JobInvolvement', 'JobLevel', 'JobRole', 'MaritalStatus', 'RelationshipSatisfaction']
We have explored this data earlier in the case study from the machine learning course week. Here, we will simply look at some basic univariate analysis and data preprocessing and move to the model building section.
# Checking summary statistics
df[num_cols].describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| DailyRate | 2940.0 | 802.485714 | 403.440447 | 102.0 | 465.0 | 802.0 | 1157.0 | 1499.0 |
| Age | 2940.0 | 36.923810 | 9.133819 | 18.0 | 30.0 | 36.0 | 43.0 | 60.0 |
| DistanceFromHome | 2940.0 | 9.192517 | 8.105485 | 1.0 | 2.0 | 7.0 | 14.0 | 29.0 |
| MonthlyIncome | 2940.0 | 6502.931293 | 4707.155770 | 1009.0 | 2911.0 | 4919.0 | 8380.0 | 19999.0 |
| MonthlyRate | 2940.0 | 14313.103401 | 7116.575021 | 2094.0 | 8045.0 | 14235.5 | 20462.0 | 26999.0 |
| PercentSalaryHike | 2940.0 | 15.209524 | 3.659315 | 11.0 | 12.0 | 14.0 | 18.0 | 25.0 |
| TotalWorkingYears | 2940.0 | 11.279592 | 7.779458 | 0.0 | 6.0 | 10.0 | 15.0 | 40.0 |
| YearsAtCompany | 2940.0 | 7.008163 | 6.125483 | 0.0 | 3.0 | 5.0 | 9.0 | 40.0 |
| NumCompaniesWorked | 2940.0 | 2.693197 | 2.497584 | 0.0 | 1.0 | 2.0 | 4.0 | 9.0 |
| HourlyRate | 2940.0 | 65.891156 | 20.325969 | 30.0 | 48.0 | 66.0 | 84.0 | 100.0 |
| YearsInCurrentRole | 2940.0 | 4.229252 | 3.622521 | 0.0 | 2.0 | 3.0 | 7.0 | 18.0 |
| YearsSinceLastPromotion | 2940.0 | 2.187755 | 3.221882 | 0.0 | 0.0 | 1.0 | 3.0 | 15.0 |
| YearsWithCurrManager | 2940.0 | 4.123129 | 3.567529 | 0.0 | 2.0 | 3.0 | 7.0 | 17.0 |
| TrainingTimesLastYear | 2940.0 | 2.799320 | 1.289051 | 0.0 | 2.0 | 3.0 | 3.0 | 6.0 |
Observations:
# Creating histograms
df[num_cols].hist(figsize = (14, 14))
plt.show()
Observations:
# Printing the % sub categories of each category.
for i in cat_cols:
print(df[i].value_counts(normalize = True))
print('*' * 40)
No 0.838776 Yes 0.161224 Name: Attrition, dtype: float64 **************************************** No 0.717007 Yes 0.282993 Name: OverTime, dtype: float64 **************************************** Travel_Rarely 0.709524 Travel_Frequently 0.188435 Non-Travel 0.102041 Name: BusinessTravel, dtype: float64 **************************************** Research & Development 0.653741 Sales 0.303401 Human Resources 0.042857 Name: Department, dtype: float64 **************************************** 3 0.389116 4 0.270748 2 0.191837 1 0.115646 5 0.032653 Name: Education, dtype: float64 **************************************** Life Sciences 0.412245 Medical 0.315646 Marketing 0.108163 Technical Degree 0.089796 Other 0.055782 Human Resources 0.018367 Name: EducationField, dtype: float64 **************************************** 4 0.312245 3 0.300680 1 0.196599 2 0.190476 Name: JobSatisfaction, dtype: float64 **************************************** 3 0.308163 4 0.303401 2 0.195238 1 0.193197 Name: EnvironmentSatisfaction, dtype: float64 **************************************** 3 0.607483 2 0.234014 4 0.104082 1 0.054422 Name: WorkLifeBalance, dtype: float64 **************************************** 0 0.429252 1 0.405442 2 0.107483 3 0.057823 Name: StockOptionLevel, dtype: float64 **************************************** Male 0.6 Female 0.4 Name: Gender, dtype: float64 **************************************** 3 0.846259 4 0.153741 Name: PerformanceRating, dtype: float64 **************************************** 3 0.590476 2 0.255102 4 0.097959 1 0.056463 Name: JobInvolvement, dtype: float64 **************************************** 1 0.369388 2 0.363265 3 0.148299 4 0.072109 5 0.046939 Name: JobLevel, dtype: float64 **************************************** Sales Executive 0.221769 Research Scientist 0.198639 Laboratory Technician 0.176190 Manufacturing Director 0.098639 Healthcare Representative 0.089116 Manager 0.069388 Sales Representative 0.056463 Research Director 0.054422 Human Resources 0.035374 Name: JobRole, dtype: float64 **************************************** Married 0.457823 Single 0.319728 Divorced 0.222449 Name: MaritalStatus, dtype: float64 **************************************** 3 0.312245 4 0.293878 2 0.206122 1 0.187755 Name: RelationshipSatisfaction, dtype: float64 ****************************************
Data Description:
Observations from EDA:
EmployeeNumber is an identifier which is unique for each employee and we can drop this column as it would not add any value to our analysis.Over18 and StandardHours have only 1 unique value. These columns will not add any value to our model hence we can drop them.age distribution is close to a normal distribution, with the majority of employees between the ages of 25 and 50.DistanceFromHome also has a right-skewed distribution, meaning most employees live close to work but there are a few that live further away.MonthlyIncome and TotalWorkingYears are skewed to the right, indicating that the majority of workers are in entry / mid-level positions in the organization.YearsAtCompany variable distribution shows a good proportion of workers with 10+ years, indicating a significant number of loyal employees at the organization.Now that we have explored our data, let's build the model.
Creating dummy variables for the categorical variables
# Creating a list of columns for which we will create dummy variables
to_get_dummies_for = ['BusinessTravel', 'Department', 'EducationField', 'Gender', 'MaritalStatus', 'JobRole']
# Creating dummy variables
df = pd.get_dummies(data = df, columns = to_get_dummies_for, drop_first = True)
# Mapping overtime and attrition
dict_OverTime = {'Yes': 1, 'No': 0}
dict_attrition = {'Yes': 1, 'No': 0}
df['OverTime'] = df.OverTime.map(dict_OverTime)
df['Attrition'] = df.Attrition.map(dict_attrition)
Separating the independent variables (X) and the dependent variable (Y)
# Separating the target variable and other variables
Y = df.Attrition
X = df.drop(['Attrition'], axis = 1)
Splitting the data into 70% train and 30% test set
# Splitting the data
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.3, random_state = 1, stratify = Y)
The model can make two types of wrong predictions:
Which case is more important?
How to reduce this loss i.e the need to reduce False Negatives?
Also, let's create a function to calculate and print the classification report and confusion matrix so that we don't have to rewrite the same code repeatedly for each model.
# Creating metric function
def metrics_score(actual, predicted):
print(classification_report(actual, predicted))
cm = confusion_matrix(actual, predicted)
plt.figure(figsize = (8, 5))
sns.heatmap(cm, annot = True, fmt = '.2f', xticklabels = ['Not Attriate', 'Attriate'], yticklabels = ['Not Attriate', 'Attriate'])
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()
def model_performance_classification(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# Predicting using the independent variables
pred = model.predict(predictors)
recall = recall_score(target, pred,average = 'macro') # To compute recall
precision = precision_score(target, pred, average = 'macro') # To compute precision
acc = accuracy_score(target, pred) # To compute accuracy score
# Creating a dataframe of metrics
df_perf = pd.DataFrame(
{
"Precision": precision,
"Recall": recall,
"Accuracy": acc,
},
index = [0],
)
return df_perf
# Building decision tree model
dt = DecisionTreeClassifier(class_weight = {0: 0.17, 1: 0.83}, random_state = 1)
# Fitting decision tree model
dt.fit(x_train, y_train)
DecisionTreeClassifier(class_weight={0: 0.17, 1: 0.83}, random_state=1)
Let's check the model performance of decision tree
# Checking performance on the training dataset
y_train_pred_dt = dt.predict(x_train)
metrics_score(y_train, y_train_pred_dt)
precision recall f1-score support
0 1.00 1.00 1.00 1726
1 1.00 1.00 1.00 332
accuracy 1.00 2058
macro avg 1.00 1.00 1.00 2058
weighted avg 1.00 1.00 1.00 2058
Observation:
# Checking performance on the test dataset
y_test_pred_dt = dt.predict(x_test)
metrics_score(y_test, y_test_pred_dt)
precision recall f1-score support
0 0.97 0.95 0.96 740
1 0.77 0.85 0.81 142
accuracy 0.93 882
macro avg 0.87 0.90 0.88 882
weighted avg 0.94 0.93 0.94 882
dtree_test = model_performance_classification(dt,x_test,y_test)
dtree_test
| Precision | Recall | Accuracy | |
|---|---|---|---|
| 0 | 0.869464 | 0.898211 | 0.93424 |
Observations:
Let's plot the feature importance and check the most important features.
# Plot the feature importance
importances = dt.feature_importances_
columns = X.columns
importance_df = pd.DataFrame(importances, index = columns, columns = ['Importance']).sort_values(by = 'Importance', ascending = False)
plt.figure(figsize = (13, 13))
sns.barplot(importance_df.Importance,importance_df.index)
<AxesSubplot:xlabel='Importance'>
Observations:
DailyRate, DistanceFromHome, JobInvolvement, and PercentSalaryHike.Let's try to tune the model and check if we could improve the results.
Criterion{“gini”, “entropy”}
The function is to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain.
max_depth
The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
min_samples_leaf
The minimum number of samples is required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.
# Choose the type of classifier
dtree_estimator = DecisionTreeClassifier(class_weight = {0: 0.17, 1: 0.83}, random_state = 1)
# Grid of parameters to choose from
parameters = {'max_depth': np.arange(2, 7),
'criterion': ['gini', 'entropy'],
'min_samples_leaf': [5, 10, 20, 25]
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(recall_score, pos_label = 1)
# Run the grid search
gridCV = GridSearchCV(dtree_estimator, parameters, scoring = scorer, cv = 10)
# Fitting the grid search on the train data
gridCV = gridCV.fit(x_train, y_train)
# Set the classifier to the best combination of parameters
dtree_estimator = gridCV.best_estimator_
# Fit the best estimator to the data
dtree_estimator.fit(x_train, y_train)
DecisionTreeClassifier(class_weight={0: 0.17, 1: 0.83}, max_depth=2,
min_samples_leaf=5, random_state=1)
# Checking performance on the training dataset
y_train_pred_dt = dtree_estimator.predict(x_train)
metrics_score(y_train, y_train_pred_dt)
precision recall f1-score support
0 0.92 0.72 0.81 1726
1 0.32 0.68 0.43 332
accuracy 0.71 2058
macro avg 0.62 0.70 0.62 2058
weighted avg 0.82 0.71 0.75 2058
Observation:
# Checking performance on the test dataset
y_test_pred_dt = dtree_estimator.predict(x_test)
metrics_score(y_test, y_test_pred_dt)
precision recall f1-score support
0 0.91 0.71 0.80 740
1 0.29 0.61 0.39 142
accuracy 0.70 882
macro avg 0.60 0.66 0.59 882
weighted avg 0.81 0.70 0.73 882
dtree_tuned_test = model_performance_classification(dtree_estimator,x_test,y_test)
dtree_tuned_test
| Precision | Recall | Accuracy | |
|---|---|---|---|
| 0 | 0.597186 | 0.661743 | 0.695011 |
Observations:
Let's look at the feature importance of this model and try to analyze why this is happening.
importances = dtree_estimator.feature_importances_
columns = X.columns
importance_df = pd.DataFrame(importances, index = columns, columns = ['Importance']).sort_values(by = 'Importance', ascending = False)
plt.figure(figsize = (13, 13))
sns.barplot(importance_df.Importance, importance_df.index)
<AxesSubplot:xlabel='Importance'>
Observations:
OverTime, TotalWorkingYears, and MonthlyIncome are the 3 most important features that describe why an employee is leaving the organization, which might imply that employees doing overtime may feel that their remuneration is not enough for their efforts.Let's plot the tree and check if the assumptions about overtime income.
As we know the decision tree keeps growing until the nodes are homogeneous, i.e., it has only one class, and the dataset here has a lot of features, it would be hard to visualize the whole tree with so many features. Therefore, we are only visualizing the tree up to max_depth = 4.
features = list(X.columns)
plt.figure(figsize = (30, 20))
tree.plot_tree(dt, max_depth = 4, feature_names = features, filled = True, fontsize = 12, node_ids = True, class_names = True)
plt.show()
Blue leaves represent the attrition, i.e., y[1] and the orange leaves represent the non-attrition, i.e., y[0]. Also, the more the number of observations in a leaf, the darker its color gets.
Observations:
Numcompaniesworked also seems to be an important variable in predicting if an employee's likely to attrite.Random Forest is a bagging algorithm where the base models are Decision Trees. Samples are taken from the training data and on each sample, a decision tree makes a prediction.
The results from all the decision trees are combined and the final prediction is made using voting (for classification problems) or averaging (for regression problems).
# Fitting the Random Forest classifier on the training data
rf_estimator = RandomForestClassifier(class_weight = {0: 0.17, 1: 0.83}, random_state = 1)
rf_estimator.fit(x_train, y_train)
RandomForestClassifier(class_weight={0: 0.17, 1: 0.83}, random_state=1)
# Checking performance on the training data
y_pred_train_rf = rf_estimator.predict(x_train)
metrics_score(y_train, y_pred_train_rf)
precision recall f1-score support
0 1.00 1.00 1.00 1726
1 1.00 1.00 1.00 332
accuracy 1.00 2058
macro avg 1.00 1.00 1.00 2058
weighted avg 1.00 1.00 1.00 2058
Observation:
# Checking performance on the testing data
y_pred_test_rf = rf_estimator.predict(x_test)
metrics_score(y_test, y_pred_test_rf)
precision recall f1-score support
0 0.96 0.99 0.98 740
1 0.97 0.79 0.87 142
accuracy 0.96 882
macro avg 0.96 0.89 0.92 882
weighted avg 0.96 0.96 0.96 882
rf_estimator_test = model_performance_classification(rf_estimator,x_test,y_test)
rf_estimator_test
| Precision | Recall | Accuracy | |
|---|---|---|---|
| 0 | 0.963176 | 0.891663 | 0.961451 |
Observations:
Let's check the feature importance of the Random Forest
importances = rf_estimator.feature_importances_
columns = X.columns
importance_df = pd.DataFrame(importances, index = columns, columns = ['Importance']).sort_values(by = 'Importance', ascending = False)
plt.figure(figsize = (13, 13))
sns.barplot(importance_df.Importance, importance_df.index)
<AxesSubplot:xlabel='Importance'>
Observations:
MonthlyIncome, Age, OverTime.n_estimators: The number of trees in the forest.
min_samples_split: The minimum number of samples required to split an internal node.
min_samples_leaf: The minimum number of samples required to be at a leaf node.
max_features{“auto”, “sqrt”, “log2”, 'None'}: The number of features to consider when looking for the best split.
If “auto”, then max_features=sqrt(n_features).
If “sqrt”, then max_features=sqrt(n_features) (same as “auto”).
If “log2”, then max_features=log2(n_features).
If None, then max_features=n_features.
# Choose the type of classifier
rf_estimator_tuned = RandomForestClassifier(class_weight = {0: 0.17, 1: 0.83}, random_state = 1)
# Grid of parameters to choose from
params_rf = {
"n_estimators": [100, 250, 500],
"min_samples_leaf": np.arange(1, 4, 1),
"max_features": [0.7, 0.9, 'auto'],
}
# Type of scoring used to compare parameter combinations - recall score for class 1
scorer = metrics.make_scorer(recall_score, pos_label = 1)
# Run the grid search
grid_obj = GridSearchCV(rf_estimator_tuned, params_rf, scoring = scorer, cv = 5)
grid_obj = grid_obj.fit(x_train, y_train)
# Set the classifier to the best combination of parameters
rf_estimator_tuned = grid_obj.best_estimator_
rf_estimator_tuned.fit(x_train, y_train)
RandomForestClassifier(class_weight={0: 0.17, 1: 0.83}, max_features=0.9,
min_samples_leaf=3, n_estimators=250, random_state=1)
# Checking performance on the training data
y_pred_train_rf_tuned = rf_estimator_tuned.predict(x_train)
metrics_score(y_train, y_pred_train_rf_tuned)
precision recall f1-score support
0 1.00 1.00 1.00 1726
1 0.99 1.00 1.00 332
accuracy 1.00 2058
macro avg 1.00 1.00 1.00 2058
weighted avg 1.00 1.00 1.00 2058
# Checking performance on the test data
y_pred_test_rf_tuned = rf_estimator_tuned.predict(x_test)
metrics_score(y_test, y_pred_test_rf_tuned)
precision recall f1-score support
0 0.97 0.98 0.97 740
1 0.88 0.83 0.86 142
accuracy 0.95 882
macro avg 0.92 0.90 0.91 882
weighted avg 0.95 0.95 0.95 882
rf_estimator_tuned_test = model_performance_classification(rf_estimator_tuned, x_test, y_test)
rf_estimator_tuned_test
| Precision | Recall | Accuracy | |
|---|---|---|---|
| 0 | 0.924256 | 0.904682 | 0.954649 |
Observations:
# Plotting feature importance
importances = rf_estimator_tuned.feature_importances_
columns = X.columns
importance_df = pd.DataFrame(importances, index = columns, columns = ['Importance']).sort_values(by = 'Importance', ascending = False)
plt.figure(figsize = (13, 13))
sns.barplot(importance_df.Importance, importance_df.index)
<AxesSubplot:xlabel='Importance'>
Observations:
OverTime, MonthlyIncome, Age,TotalWorkingYears, and DailyRate are the most important features.# Installing the xgboost library using the 'pip' command.
!pip install xgboost
Collecting xgboost
Downloading xgboost-1.6.1-py3-none-macosx_10_15_x86_64.macosx_11_0_x86_64.macosx_12_0_x86_64.whl (1.7 MB)
|████████████████████████████████| 1.7 MB 1.3 MB/s eta 0:00:01
Requirement already satisfied: scipy in /Users/rija/opt/anaconda3/lib/python3.9/site-packages (from xgboost) (1.7.3)
Requirement already satisfied: numpy in /Users/rija/opt/anaconda3/lib/python3.9/site-packages (from xgboost) (1.21.5)
Installing collected packages: xgboost
Successfully installed xgboost-1.6.1
# Importing the AdaBoostClassifier and GradientBoostingClassifier [Boosting]
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
# Importing the XGBClassifier from the xgboost library
from xgboost import XGBClassifier
# Adaboost Classifier
adaboost_model = AdaBoostClassifier(random_state = 1)
# Fitting the model
adaboost_model.fit(x_train, y_train)
# Model Performance on the test data
adaboost_model_perf_test = model_performance_classification(adaboost_model,x_test,y_test)
adaboost_model_perf_test
| Precision | Recall | Accuracy | |
|---|---|---|---|
| 0 | 0.814518 | 0.709136 | 0.884354 |
# Gradient Boost Classifier
gbc = GradientBoostingClassifier(random_state = 1)
# Fitting the model
gbc.fit(x_train, y_train)
# Model Performance on the test data
gbc_perf_test = model_performance_classification(gbc, x_test, y_test)
gbc_perf_test
| Precision | Recall | Accuracy | |
|---|---|---|---|
| 0 | 0.882845 | 0.728483 | 0.902494 |
# XGBoost Classifier
xgb = XGBClassifier(random_state = 1, eval_metric = 'logloss')
# Fitting the model
xgb.fit(x_train,y_train)
# Model Performance on the test data
xgb_perf_test = model_performance_classification(xgb,x_test,y_test)
xgb_perf_test
| Precision | Recall | Accuracy | |
|---|---|---|---|
| 0 | 0.959975 | 0.911439 | 0.965986 |
Hyperparameter tuning is a great technique in machine learning to develop the model with optimal parameters. If the size of the data increases, the computation time will increase during the training process.
1. Adaboost
Some important hyperparameters that can be tuned:
base_estimator object, default = None The base estimator from which the boosted ensemble is built. Support for sample weighting is required, as well as proper classes_ and nclasses attributes. If None, then the base estimator is DecisionTreeClassifier initialized with max_depth=1
n_estimators int, default = 50 The maximum number of estimators at which boosting is terminated. In the case of a perfect fit, the learning procedure is stopped early.
learning_rate float, default = 1.0 Weight is applied to each classifier at each boosting iteration. A higher learning rate increases the contribution of each classifier.
For a better understanding of each parameter in the AdaBoost classifier, please refer to thissource.
2. Gradient Boosting Algorithm
Some important hyperparameters that can be tuned:
n_estimators: The number of boosting stages that will be performed.
max_depth: Limits the number of nodes in the tree. The best value depends on the interaction of the input variables.
min_samples_split: The minimum number of samples required to split an internal node.
learning_rate: How much the contribution of each tree will shrink.
loss:Loss function to optimize.
For a better understanding of each parameter in the Gradient Boosting classifier, please refer to this source.
3. XGBoost Algorithm
Some important hyperparameters that can be tuned:
booster [default = gbtree ] Which booster to use. Can be gbtree, gblinear, or dart; gbtree and dart use tree-based models while gblinear uses linear functions.
min_child_weight [default = 1]
The minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning.The larger min_child_weight is, the more conservative the algorithm will be.
For a better understanding of each parameter in the XGBoost Classifier, please refer to this source.
models_test_comp_df = pd.concat(
[
dtree_test.T, dtree_tuned_test.T,rf_estimator_test.T,
rf_estimator_tuned_test.T, adaboost_model_perf_test.T,
gbc_perf_test.T, xgb_perf_test.T
],
axis = 1,
)
models_test_comp_df.columns = [
"Decision Tree classifier",
"Tuned Decision Tree classifier",
"Random Forest classifier",
"Tuned Random Forest classifier",
"Adaboost classifier",
"Gradientboost classifier",
"XGBoost classifier"
]
print("Test performance comparison:")
Test performance comparison:
models_test_comp_df
| Decision Tree classifier | Tuned Decision Tree classifier | Random Forest classifier | Tuned Random Forest classifier | Adaboost classifier | Gradientboost classifier | XGBoost classifier | |
|---|---|---|---|---|---|---|---|
| Precision | 0.869464 | 0.597186 | 0.963176 | 0.924256 | 0.814518 | 0.882845 | 0.959975 |
| Recall | 0.898211 | 0.661743 | 0.891663 | 0.904682 | 0.709136 | 0.728483 | 0.911439 |
| Accuracy | 0.934240 | 0.695011 | 0.961451 | 0.954649 | 0.884354 | 0.902494 | 0.965986 |
Observations: