Lightweight end-to-end example that trains and serves a Random Forest model to predict heart disease from tabular patient data.
This repository includes:
- Training script (
train.py) that performs K-Fold validation and saves a model + encoder. - A pre-trained model file (
rf_model_40_trees_depth_10_min_samples_leaf_1.bin). - A Flask-based prediction endpoint (
predict.py) and a small test client (predict_test.py).
- Dataset:
Data/heart.csv(patient features +HeartDiseasetarget). - Model: RandomForestClassifier with saved One-Hot encoder (DictVectorizer-style) and classifier stored together in a
.binfile. - API: POST
/predictaccepts a single patient JSON and returnsheart_disease_probabilityandheart_disease(bool).
- Use Docker (recommended, ensures correct Python and deps):
docker build -t heart-predict:latest .
docker run -p 9696:9696 heart-predict:latest- Test the running server (from another shell):
python predict_test.pyOr use curl with a JSON body (example below).
The project uses Pipfile (Python 3.12). To install with pipenv:
pip install pipenv
pipenv install --deploy --systemAlternatively create a virtualenv and install dependencies derived from the Pipfile.
Start app locally:
gunicorn --bind=0.0.0.0:9696 predict:appThen run python predict_test.py to send a sample request.
Run full training and save a new model file:
python train.pyWhat happens:
- Reads
Data/heart.csvand auto-detects categorical vs numerical columns. - Runs K-Fold cross-validation and prints per-fold accuracy.
- Retrains on the full training set and writes the encoder + model to a
.binfile.
Data/heart.csv— dataset.train.py— training + CV script.predict.py— Flask app (loads.binfile and serves/predict).predict_test.py— example client that POSTs a sample patient.rf_model_40_trees_depth_10_min_samples_leaf_1.bin— included saved model.Dockerfile,Pipfile,Pipfile.lock— runtime and packaging.
- Ensure JSON keys and categorical values you send to
/predictare identical (names & categories) to those used during training — otherwise the encoder may produce different feature vectors or raise an error. train.pycontains a small LabelEncoder snippet that is unused in the training flow; if you modify preprocessing, make it explicit and keep it consistent between training and inference.- There are no automated tests; consider adding a small unit test that loads the
.binand checks prediction output types/shape.