The project aims to predict water potability based on water quality attributes using machine learning techniques.
-
Data Loading and Exploration:
- Loaded the dataset from Water Quality and Potability on Kaggle and explored its features.
- Checked for missing values and explored the distribution of the target variable.
-
Data Preprocessing:
- Handled missing values through imputation.
- Checked for outliers and decided to ignore them for the initial analysis.
- Split the data into training and testing sets.
- Scaled numerical features.
-
Modeling:
- Trained individual models (Random Forest and Gradient Boosting).
- Explored feature importance for model interpretation.
- Created an ensemble model using the VotingClassifier.
-
Evaluation:
- Evaluated the models using accuracy, precision, recall, and F1-score.
- Compared the performance of individual models and the ensemble.
- pH level and sulfate concentration are identified as the most influential features.
- The ensemble model showed improved accuracy and precision for predicting potable water.
- Hyperparameter tuning for individual models.
- Explore feature engineering techniques.
- Experiment with different ensemble strategies.
- Revisit class imbalance handling techniques.
The dataset used in this project is sourced from Water Quality and Potability on Kaggle.
Feel free to contribute or reach out for further collaboration!