Predicting Water Potability with an End-to-End MLOps Workflow

Hi there! 👋 This project is a hands-on example of how to apply MLOps practices to build a real-world solution that predicts if water is safe to drink. We use MLflow to monitor model training, DVC to manage data and model versions, and wrap it all up in a user-friendly desktop app built with Tkinter.

What’s This About?

Challenge: Figure out if water is potable using chemical properties.
Goal: Set up an efficient pipeline that tracks every experiment, keeps data organized, and lets us easily ship the final model in an app.

Step-by-Step Workflow

Initial Setup: We kick off the project using a Cookiecutter template and version control with Git.
Experiment Logging: All training runs and metrics are saved via MLflow on DagsHub.
Data and Pipeline Management: DVC helps us keep track of datasets and the steps in our machine learning pipeline.
Model Management: The best model gets registered in MLflow and is ready to be used anywhere.
App Deployment: A simple desktop app is created with Tkinter, pulling the latest model directly from MLflow for instant predictions.

Project Organization at a Glance

Getting Started

Cookiecutter helps create the initial project layout.
GitHub is used to store and collaborate on code.

Keeping Track of Experiments

Using MLflow + DagsHub:
- Each run is logged with its settings and results.
- We can compare models and see which works best.
Try Different Approaches:
- Start with Random Forest as the baseline.
- Test different algorithms (Logistic Regression, XGBoost).
- Compare how mean vs. median imputation affects results.
- Tune Random Forest parameters to squeeze out better accuracy.

Building the Pipeline with DVC

Versioning Data:
- DVC keeps versions of our datasets so we always know what was used when.
Stages in the Pipeline:
- Data Gathering: Bring in the dataset and organize it.
- Preprocessing: Fill in missing values (mean).
- Training: Train the Random Forest model.
- Evaluation: Log the results in MLflow.

Registering the Model

The best model is saved in the MLflow Registry.
We can now use it with FastAPI, Streamlit, or in this case – a desktop app.

The Desktop App 🖥️

The Tkinter app is super lightweight.
It pulls the latest model automatically and lets users input their own data to see if the water is drinkable.

What Did We Find?

Top Performer: Random Forest using mean for missing values.
Best Settings: n_estimators=1000, max_depth=None.

File and Folder Structure

├── LICENSE
├── Makefile             <- Commands for building, training, etc.
├── README.md            <- Project overview and setup guide
├── data/
│   ├── raw              <- Original dataset
│   ├── external         <- External sources
│   ├── interim          <- Intermediate processing
│   └── processed        <- Cleaned data for modeling
├── docs/                <- Project documentation
├── models/              <- Saved models and outputs
├── notebooks/           <- Jupyter notebooks (e.g., 1.0-jd-initial-checks)
├── references/          <- Notes, sources, references
├── reports/
│   └── figures          <- Charts and graphs
├── requirements.txt     <- Dependencies
├── setup.py             <- Installable package setup
├── src/
│   ├── __init__.py
│   ├── data/
│   │   └── make_dataset.py
│   ├── features/
│   │   └── build_features.py
│   ├── models/
│   │   ├── train_model.py
│   │   └── predict_model.py
│   └── visualization/
│       └── visualize.py
└── tox.ini              <- Testing config

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.dvc		.dvc
docs		docs
mlruns		mlruns
models		models
notebooks		notebooks
references		references
reports		reports
src		src
.dvcignore		.dvcignore
.gitignore		.gitignore
Decision_Tree.pkl		Decision_Tree.pkl
GUI.py		GUI.py
K-Nearest_Neighbors.pkl		K-Nearest_Neighbors.pkl
Logistic_Regression.pkl		Logistic_Regression.pkl
Makefile		Makefile
README.md		README.md
Random_Forest.pkl		Random_Forest.pkl
Support_Vector_Classifier.pkl		Support_Vector_Classifier.pkl
XG_Boost.pkl		XG_Boost.pkl
confusion_matrix.png		confusion_matrix.png
confusion_matrix_Best_Model.png		confusion_matrix_Best_Model.png
confusion_matrix_Decision_Tree.png		confusion_matrix_Decision_Tree.png
confusion_matrix_K-Nearest_Neighbors.png		confusion_matrix_K-Nearest_Neighbors.png
confusion_matrix_Logistic_Regression.png		confusion_matrix_Logistic_Regression.png
confusion_matrix_Random_Forest.png		confusion_matrix_Random_Forest.png
confusion_matrix_Random_Forest_Classifier.png		confusion_matrix_Random_Forest_Classifier.png
confusion_matrix_Support_Vector_Classifier.png		confusion_matrix_Support_Vector_Classifier.png
confusion_matrix_XG_Boost.png		confusion_matrix_XG_Boost.png
dvc.lock		dvc.lock
dvc.yaml		dvc.yaml
model.pkl		model.pkl
params.yaml		params.yaml
prediction.py		prediction.py
requirements.txt		requirements.txt
setup.py		setup.py
test_environment.py		test_environment.py
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Predicting Water Potability with an End-to-End MLOps Workflow

What’s This About?

Step-by-Step Workflow

Project Organization at a Glance

Getting Started

Keeping Track of Experiments

Building the Pipeline with DVC

Registering the Model

The Desktop App 🖥️

What Did We Find?

File and Folder Structure

About

Releases

Packages

Languages

mkdirer/Predicting-Water-Potability-with-an-End-to-End-MLOps-Workflow

Folders and files

Latest commit

History

Repository files navigation

Predicting Water Potability with an End-to-End MLOps Workflow

What’s This About?

Step-by-Step Workflow

Project Organization at a Glance

Getting Started

Keeping Track of Experiments

Building the Pipeline with DVC

Registering the Model

The Desktop App 🖥️

What Did We Find?

File and Folder Structure

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages