An end-to-end production ML pipeline that predicts industrial machine failures before they happen.
Optimizes for total business cost in dollars — not accuracy, not F1.
Try it now — no setup, no API key required:
👉 https://predictive-maintenance-deep-shah.streamlit.app/
The live application allows you to:
- Adjust real-time sensor sliders and watch the failure probability gauge update instantly
- Upload a CSV of machine readings and get a full fleet risk assessment in seconds
- Explore the business dashboard — compare reactive vs preventive vs AI-driven maintenance costs
- Drag the decision threshold slider and watch FP/FN counts and total cost update live
18.9% failure probability — SAFE. H-type machine with 25 minutes of tool wear, 2000 RPM, and 30 Nm torque. The gauge, risk badge, and cost analysis update in real time as sliders are adjusted. Cost if ignored: $1,891. Preventive maintenance: $500. Model recommendation: Save $1,391.
Physics-derived features visible at the bottom — Temp Differential: 9.50 K (above the 8.6 K Heat Dissipation threshold), Mechanical Power: 60,842 W (within safe operating range), Force Ratio: 0.01509 (well below the 0.035 Overstrain threshold). All three engineered features confirm the machine is operating within healthy parameters.
81.5% failure probability — DANGER. L-type machine with Temp_Diff of 7.40 K (below the 8.6 K Heat Dissipation threshold), 1,208 RPM (low), and 30 Nm torque. DANGER badge fires immediately. The model identifies Heat Dissipation Failure (HDF) as the active failure mode — the thermal gradient has collapsed, signaling imminent thermal failure.
Expected cost if ignored: $8,151. Preventive maintenance cost: $500. Model recommendation: Save $7,651 — act immediately. Physics features confirm the failure signal: Temp Differential at 7.40 K (below the 8.6 K threshold), Mechanical Power at 36,602 W. The 20:1 cost asymmetry ($10,000 failure vs $500 inspection) makes the maintenance decision unambiguous.
12 machines analyzed in one CSV upload. Fleet summary: 2 CRITICAL (16.7%), 3 MONITOR, 7 SAFE — $39,965 total cost at risk. The failure probability distribution chart separates the healthy cluster (left, below MONITOR threshold) from the at-risk machines (right, past the DANGER threshold line). Maintenance teams get an immediate prioritized action list.
Machines sorted by failure probability descending. MACHINE-003 and MACHINE-004 flagged DANGER in red (81.2% and 80.9% — both L-type with tool wear 240+ minutes). Three MONITOR machines follow in orange. Color-coded Risk_Level column and Expected_Cost_$ give maintenance teams an immediate dollar-ranked action list. Full results downloadable as CSV.
1,000-machine fleet simulation: Reactive maintenance costs $340,000/year. Full preventive costs $500,000/year. This model costs $79,000/year — catching 32 of 34 failures (94% recall). Savings vs reactive: $261,000 (76.8%). Savings vs full preventive: $421,000 (84.2%). Fleet size, failure rate, and cost parameters are all adjustable via the Adjust Assumptions panel.
LightGBM selected as champion via 5-fold cross-validated F1 mean (0.7857) — not by test-set score. CatBoost ranks second with lower CV std (0.051 vs 0.063), indicating more stable folds. Champion selection by CV score prevents the model selection bias that occurs when the test set is used to pick between models. All 9 models benchmarked under identical CV conditions.
- The Business Problem
- What Makes This Different
- System Architecture
- Technical Decisions & Rationale
- Results
- Business Impact
- Repository Structure
- Quickstart
- Streamlit App
- FastAPI — REST Endpoints
- Docker Deployment
- Drift Detection & Monitoring
- Running Tests
- Dataset
Every hour of unplanned downtime in heavy manufacturing costs between $10,000 and $250,000 depending on the industry. Yet the two standard maintenance strategies are both fundamentally broken:
| Strategy | What Goes Wrong | Hidden Cost |
|---|---|---|
| Reactive | Wait for failure, then fix it | Emergency repair + full production halt |
| Preventive (fixed schedule) | Service everything on a calendar | Replacing healthy components, unnecessary labor |
Predictive maintenance is the only strategy that is neither wasteful nor dangerous. It uses real-time sensor data to generate a maintenance alert only when a specific machine is genuinely showing signs of imminent failure — catching the failure before it happens, touching nothing that doesn't need attention.
This project builds a full production-structured ML pipeline on the AI4I 2020 Predictive Maintenance Dataset (UCI / Kaggle) — a realistic simulation of CNC machine sensor telemetry across 10,000 operating cycles with a 97:3 healthy-to-failure class ratio.
The majority of ML classification projects optimize for accuracy. Accuracy is the wrong metric for this problem. On a factory floor, errors are not symmetric:
- A missed failure (False Negative) = unplanned downtime, possible safety incident → $10,000
- A false alarm (False Positive) = a technician dispatched unnecessarily → $500
That is a 20:1 cost asymmetry. Every decision in this pipeline flows from that single insight.
| What a standard ML project does | What this pipeline does |
|---|---|
| Optimize accuracy or generic F1 | Optimize total dollar cost: (FP × $500) + (FN × $10,000) |
| Single train/test split | 3-way stratified split — train (60%) / val (20%) / test (20%) |
| Decision threshold fixed at 0.5 | Threshold searched on validation set, reported on test set |
GridSearchCV on F1 |
GridSearchCV on a custom business-cost scorer |
| SMOTE applied to the full dataset | SMOTE inside CV folds only — no synthetic leakage |
| Pick champion by test-set F1 | Pick champion by 5-fold cross-validated F1 mean |
| No unit tests | 14 pytest unit tests covering all core functions |
| Notebook only | Streamlit app + FastAPI + Docker + drift monitoring |
Raw CSV (Google Drive / local cache)
│
▼
┌─────────────────────────────────────┐
│ data_ingestion.py │
│ Download → Schema validation │
│ Deduplication → Null audit │
│ Target column sanity check │
└──────────────────┬──────────────────┘
│
▼
┌─────────────────────────────────────┐
│ feature_engineering.py │
│ Physics feature creation │
│ Drop leakage columns │
│ 3-way stratified split (60/20/20) │
└───────┬─────────────┬───────────────┘
│ │
X_train X_val, X_test
y_train y_val, y_test
│ │
▼ │
┌─────────────────────────────────────┐
│ modeling.py │
│ 9-model zoo benchmarked via │
│ 5-fold StratifiedKFold CV │
│ │
│ Each fold pipeline: │
│ preprocessor (fit on fold only) │
│ → SMOTE (train fold only) │
│ → classifier │
│ │
│ Champion = highest CV_F1_Mean │
│ GridSearchCV on business-cost │
└──────────────────┬──────────────────┘
│
▼
┌─────────────────────────────────────┐
│ evaluation.py │
│ optimize_threshold(X_val, y_val) │ ← val set ONLY
│ Final report on (X_test, y_test) │ ← test set, first touch here
│ Confusion matrix · ROC · Features │
│ Save model → artifacts/models/ │
└─────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────────┐
│ PRODUCTION SYSTEM │
│ │
│ ┌─────────────────────┐ ┌──────────────────────────────┐ │
│ │ streamlit_app.py │ │ api/main.py (FastAPI) │ │
│ │ │ │ │ │
│ │ Tab 1: Live Pred. │ │ POST /predict │ │
│ │ Tab 2: Batch │ │ POST /predict-batch │ │
│ │ Tab 3: Dashboard │ │ GET /health │ │
│ └──────────┬──────────┘ └──────────────┬───────────────┘ │
│ └─────────────┬─────────────────┘ │
│ ▼ │
│ ┌────────────────────────┐ │
│ │ lightgbm_champion.pkl │ │
│ └────────────────────────┘ │
│ ▼ │
│ ┌────────────────────────┐ │
│ │ monitoring.py │ │
│ │ KS drift detection │ │
│ │ → drift_alerts.csv │ │
│ └────────────────────────┘ │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ Docker / docker-compose (port 8000) │ │
│ └───────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────┘
Three features were engineered from first principles of thermodynamics and rotational mechanics rather than feeding raw sensor readings directly into the model.
| Feature | Formula | Physical Interpretation |
|---|---|---|
Temp_Diff |
Process Temp − Air Temp | Thermal gradient: a rising value signals heat retention preceding thermal failure |
Power |
Torque [Nm] × RPM | Mechanical power input to spindle: sustained peaks accelerate tool wear |
Force_Ratio |
Torque / (RPM + ε) | Load per revolution: high ratio at low speed indicates heavy cutting conditions |
The ε = 1e-5 guard in Force_Ratio prevents division-by-zero. The feature importance chart confirms Power ranks 2nd and Temp_Diff 3rd — above every raw sensor reading. Domain-driven features outperformed raw sensor data.
The dataset is 96.6% healthy machines and 3.4% failures. Three decisions handle this correctly:
Stratified splits preserve the 3.4% failure rate across all three subsets. SMOTE inside CV folds via imblearn.Pipeline ensures synthetic minority samples are generated from training data only — the common mistake of applying SMOTE before CV inflates CV metrics by leaking synthetic copies of validation samples into training folds. Business-cost scorer explicitly encodes the 20:1 class cost asymmetry into hyperparameter search.
If the decision threshold were optimised on the test set and then reported on the same set, the reported cost would be the minimum achievable on that specific sample — overly optimistic and non-generalising. The validation set is used exclusively for threshold search. The test set is touched exactly once — in evaluation.py — for the final unbiased report.
Selecting the champion model by test-set score is model selection bias. Once you use the test set to make a decision, it is no longer a clean estimate of generalisation. All 9 models are ranked by 5-fold cross-validated F1 mean. The test set is only used for the final report after both champion and threshold are locked in.
GridSearchCV minimizes (FP × $500) + (FN × $10,000) via a custom make_scorer with greater_is_better=False. The tuner directly searches for the configuration that saves the most money — not the one that maximises an abstract metric.
Type encodes a genuine quality tier: L (Low) < M (Medium) < H (High). OrdinalEncoder with categories=[['L', 'M', 'H']] preserves this ordering as integers (0, 1, 2). OneHotEncoder would discard the ordinal structure. The handle_unknown='use_encoded_value', unknown_value=-1 guard ensures the pipeline never crashes on unseen categories at inference time.
| Rank | Model | CV F1 Mean | CV F1 Std | CV AUC | Test F1 | Test AUC |
|---|---|---|---|---|---|---|
| 🥇 | LightGBM | 0.7857 | 0.0626 | 0.9707 | 0.7808 | 0.9847 |
| 🥈 | CatBoost | 0.7758 | 0.0512 | 0.9709 | 0.7200 | 0.9782 |
| 🥉 | XGBoost | 0.7543 | 0.0615 | 0.9638 | 0.7125 | 0.9799 |
| 4 | Random Forest | 0.7346 | 0.0522 | 0.9698 | 0.7355 | 0.9727 |
| 5 | Gradient Boosting | 0.6227 | 0.0217 | 0.9726 | 0.5957 | 0.9794 |
| 6 | Decision Tree | 0.5953 | 0.0370 | 0.8653 | 0.6067 | 0.8826 |
| 7 | SVC | 0.4972 | 0.0263 | 0.9621 | 0.4917 | 0.9731 |
| 8 | Logistic Regression | 0.2857 | 0.0147 | 0.9191 | 0.3021 | 0.9316 |
| 9 | Gaussian NB | 0.2654 | 0.0200 | 0.9075 | 0.2821 | 0.9038 |
LightGBM vs CatBoost: LightGBM wins on CV F1 mean (0.786 vs 0.776). CatBoost has lower CV std (0.051 vs 0.063) — more stable across folds. In production, an ensemble of both would be the natural next step.
Threshold optimized on validation set: 0.32
precision recall f1-score support
0 0.9977 0.9063 0.9498 1932
1 0.2612 0.9412 0.4089 68
accuracy 0.9075 2000
macro avg 0.6295 0.9238 0.6794 2000
weighted avg 0.9652 0.9075 0.9320 2000
The model catches 64 of 68 actual failures (94.1% recall). 4 failures missed. 181 false alarms — a deliberate trade-off given a missed failure costs 20× more than a false alarm.
Confusion Matrix
64 failures correctly flagged. 4 missed at $10,000 each ($40,000). 181 false alarms at $500 each ($90,500). Total projected test-set cost: $130,500.
ROC Curve
AUC = 0.9847. The curve immediately reaches ~80% True Positive Rate at near-zero False Positive Rate.
Feature Importance
Tool wear [min] ranks first. Power and Temp_Diff — both engineered features — rank 2nd and 3rd, above every raw sensor reading. Domain engineering validated.
| Outcome | Count | Unit Cost | Total |
|---|---|---|---|
| False Negatives — missed failures | 4 | $10,000 | $40,000 |
| False Positives — unnecessary inspections | 181 | $500 | $90,500 |
| Total projected cost | $130,500 |
| Strategy | Failures Caught | Annual Cost | Saving vs Reactive |
|---|---|---|---|
| Reactive — wait for breakdown | 0% | $340,000 | — |
| Preventive — fixed schedule | 100% | $500,000 | −$160,000 |
| This Model — LightGBM, threshold 0.32 | 94% | $79,000 | $261,000 (76.8%) |
predictive-maintenance-engine/
│
├── assets/
│ └── screenshots/
│ ├── 01_live_prediction_safe.png
│ ├── 02_cost_analysis_safe.png
│ ├── 03_live_prediction_danger.png
│ ├── 04_cost_analysis_danger.png
│ ├── 05_batch_analysis_summary.png
│ ├── 06_batch_analysis_table.png
│ ├── 07_business_dashboard.png
│ └── 08_model_leaderboard.png
│
├── artifacts/ # Auto-generated — gitignored
│ ├── graphs/
│ │ ├── confusion_matrix.png
│ │ ├── roc_curve.png
│ │ └── feature_importance.png
│ └── model_leaderboard.csv
│
├── api/
│ ├── __init__.py
│ └── main.py # /predict, /predict-batch, /health
│
├── src/
│ ├── __init__.py
│ ├── config.py
│ ├── data_ingestion.py
│ ├── feature_engineering.py
│ ├── modeling.py
│ └── evaluation.py
│
├── tests/
│ └── test_pipeline.py # 14 pytest unit tests
│
├── main_execution.ipynb # Training pipeline (Colab)
├── run_pipeline.py # Training pipeline (local)
├── streamlit_app.py # Streamlit dashboard
├── monitoring.py # KS drift detection
├── Dockerfile
├── docker-compose.yml
├── requirements.txt
└── README.md
Visit https://predictive-maintenance-deep-shah.streamlit.app/ directly in your browser.
1. Upload the project to Google Drive:
MyDrive/
└── predictive-maintenance-engine/
├── src/
├── api/
├── tests/
├── streamlit_app.py
├── monitoring.py
└── requirements.txt
2. Open main_execution.ipynb in Google Colab and run all cells.
The pipeline mounts Drive, downloads the dataset automatically via gdown, trains all 9 models, tunes the champion, and saves every artifact back to Drive.
git clone https://github.com/DeepShah111/predictive-maintenance-engine.git
cd predictive-maintenance-engine
python -m venv venv
venv\Scripts\activate # Windows
# source venv/bin/activate # Mac/Linux
pip install -r requirements.txt
python run_pipeline.pystreamlit run streamlit_app.py
# → http://localhost:8501| Tab | What it does |
|---|---|
| ⚡ Live Prediction | Sensor sliders → real-time failure probability gauge + risk level + cost impact |
| 📂 Batch Analysis | Upload CSV → ranked fleet risk table + distribution chart + downloadable results |
| 📊 Business Dashboard | Strategy cost comparison + live threshold slider with FP/FN/cost update |
Live deployment: https://predictive-maintenance-deep-shah.streamlit.app/
uvicorn api.main:app --reload --port 8000
# → http://localhost:8000/docs| Method | Endpoint | Description |
|---|---|---|
GET |
/health |
Model loaded status, threshold, version |
POST |
/predict |
Single reading → probability + risk level + recommended action |
POST |
/predict-batch |
List of readings → predictions + fleet summary |
Example — Single Prediction:
curl -X POST http://localhost:8000/predict \
-H "Content-Type: application/json" \
-d '{
"machine_type": "L",
"air_temperature_K": 302.0,
"process_temperature_K": 309.0,
"rotational_speed_rpm": 1200,
"torque_Nm": 65.0,
"tool_wear_min": 240,
"machine_id": "MACHINE-001"
}'Expected response:
{
"machine_id": "MACHINE-001",
"failure_probability": 0.812,
"failure_probability_pct": 81.2,
"risk_level": "DANGER",
"recommended_action": "IMMEDIATE maintenance required. Take machine offline.",
"expected_cost_if_ignored": 8120.0,
"physics_features": {
"Temp_Diff": 7.0,
"Power": 72600.0,
"Force_Ratio": 0.054167
},
"model_name": "Lightgbm",
"threshold_used": 0.32
}# Build and run
docker compose up --build
# → API available at http://localhost:8000
# Stop
docker compose downThe artifacts/ directory is mounted as a read-only volume so the container always uses the latest trained model without a rebuild.
The monitoring.py module detects covariate shift between training and production data using the Kolmogorov-Smirnov test (α = 0.05).
from monitoring import DriftMonitor
import pandas as pd
monitor = DriftMonitor()
alerts = monitor.check_drift(pd.read_csv("new_readings.csv"), tag="production_batch_1")
if alerts:
for a in alerts:
print(f"DRIFT: {a['feature']} — shift {a['mean_shift_pct']:.1f}%")CLI usage:
python monitoring.py --csv new_sensor_data.csv --tag production_jan_2025All alerts logged to artifacts/drift_alerts.csv with timestamp, KS statistic, p-value, and mean shift percentage.
python -m pytest tests/ -vcollected 14 items
tests/test_pipeline.py::test_physics_features_columns_created PASSED
tests/test_pipeline.py::test_physics_features_temp_diff_value PASSED
tests/test_pipeline.py::test_physics_features_power_value PASSED
tests/test_pipeline.py::test_physics_features_no_infinities PASSED
tests/test_pipeline.py::test_leakage_cols_dropped_after_split PASSED
tests/test_pipeline.py::test_get_preprocessor_returns_column_transformer PASSED
tests/test_pipeline.py::test_clean_data_removes_duplicates PASSED
tests/test_pipeline.py::test_clean_data_index_is_contiguous PASSED
tests/test_pipeline.py::test_build_features_and_split_returns_six_objects PASSED
tests/test_pipeline.py::test_build_features_and_split_sizes PASSED
tests/test_pipeline.py::test_build_features_and_split_class_balance PASSED
tests/test_pipeline.py::test_total_cost_metric_correct_value PASSED
tests/test_pipeline.py::test_total_cost_metric_degenerate_returns_inf PASSED
tests/test_pipeline.py::test_schema_validation_raises_on_missing_columns PASSED
14 passed in ~18s
AI4I 2020 Predictive Maintenance Dataset
| Property | Value |
|---|---|
| Source | UCI ML Repository · Kaggle |
| Rows | 10,000 |
| Features used | 11 (8 numerical + 1 categorical + 3 physics-derived) |
| Target | Machine failure (binary: 0 = healthy, 1 = failure) |
| Class distribution | 96.6% healthy / 3.4% failure |
| Leakage columns dropped | UDI, Product ID, TWF, HDF, PWF, OSF, RNF |
The leakage columns (TWF through RNF) are individual failure-mode sub-flags set to 1 only when Machine failure is also 1. Keeping them would let the model read the answer directly — they are dropped before any modelling step. The dataset downloads automatically on first run via gdown.
Built as a portfolio project demonstrating production ML engineering practices.
Structured for clarity, correctness, and interview-readiness.
🚀 Live Demo |
📁 GitHub










