This repository contains a classification analysis project aimed at predicting employee attrition (whether an employee will leave the organization or stay) based on various demographic, job satisfaction, and performance attributes.
The goal of this project is to predict employee attrition using classification algorithms. We aim to build a model that can predict whether an employee will leave the organization (attrition) or stay, based on their characteristics such as demographics, job satisfaction, and performance metrics.
The dataset used in this project is related to employee data and contains features such as:
- Age
- Job role
- Salary
- Years at the company
- Satisfaction levels
- Distance from home, etc.
The target variable is Attrition, which indicates whether the employee left the organization (Yes) or stayed (No).
- Age: Age of the employee.
- Attrition: Target variable indicating employee attrition.
- BusinessTravel: Frequency of business travel.
- DailyRate: Daily rate of the employee.
- DistanceFromHome: Distance of the employee’s residence from the office.
- Education: Level of education.
- EmployeeCount: Number of employees in the company.
- JobSatisfaction: Job satisfaction level of the employee.
- YearsAtCompany: Number of years the employee has worked at the company.
- Handle class imbalance using class weights.
- Normalize numerical features.
- Split the dataset into training and testing sets.
- Visualize feature relationships with the target variable.
- Identify key features influencing employee attrition.
- Detect and address outliers.
- Support Vector Machine (SVM)
- Logistic Regression
- Decision Trees
- Random Forest
- Accuracy
- Precision
- Recall
- F1 score
- ROC-AUC
To run this project locally, follow these steps:
-
Clone the repository:
git clone <repository_url>
-
Navigate to the project directory:
cd Employee_Attrition -
Install the required libraries:
using pip install pandas numpy seaborn matplotlib statsmodels scikit-learn
- Python 3.x
- pandas
- numpy
- matplotlib
- seaborn
- scikit-learn
-
Import the dataset:
df = pd.read_csv('employee_attrition.csv')
-
Perform preprocessing and feature encoding:
# Change target column 'Attrition' to numerical values df['Attrition'] = df['Attrition'].map({'Yes': 1, 'No': 0})
-
Split the data into training and testing sets:
X = df.drop('Attrition', axis=1) y = df['Attrition'] X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
-
Train and evaluate models (e.g., SVM):
from sklearn.svm import SVC from sklearn.preprocessing import StandardScaler # Feature Scaling scaler = StandardScaler() X_train = scaler.fit_transform(X_train) X_test = scaler.transform(X_test) # SVM model svm_model = SVC(kernel='linear', probability=True, class_weight='balanced') svm_model.fit(X_train, y_train) # Make predictions and evaluate metrics y_pred = svm_model.predict(X_test)
-
Generate evaluation metrics:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score print(f'Accuracy: {accuracy_score(y_test, y_pred):.3f}') print(f'Precision: {precision_score(y_test, y_pred):.3f}') print(f'Recall: {recall_score(y_test, y_pred):.3f}') print(f'F1 Score: {f1_score(y_test, y_pred):.3f}') print(f'ROC-AUC: {roc_auc_score(y_test, svm_model.predict_proba(X_test)[:,1]):.3f}')