A clean, end-to-end machine learning system for decision intelligence using simulated real-world data.
This repository is designed as a portfolio-grade ML systems project. It emphasizes the engineering discipline expected in production ML teams: modular architecture, reproducibility, testing, and lightweight governance.
What reviewers should notice quickly:
- End-to-end ML workflow with explicit data simulation, feature engineering, training, and artifact outputs
- Reusable package-first implementation in
src/decision_engine/ - Notebook as a consumer layer, not an implementation layer
- Baseline model + validation + documentation artifacts (
model_card, experiment template)
This project demonstrates how to:
- Build structured datasets from multiple signals
- Engineer features for predictive modeling
- Train and evaluate a baseline classification model
- Simulate decision-making workflows in a reproducible way
data/– simulated datasetssrc/decision_engine/– modular, reusable ML packagenotebooks/– EDA and modelingmodels/– saved artifactsdocs/– model governance artifactstemplates/– reusable experiment logging templatessagemaker/– optional SageMaker job entrypoints only (no AWS SDK inrequirements.txt)
Show a production-style approach to building ML systems — not just notebooks.
- Clear separation of concerns (simulation, features, model, IO, pipeline orchestration)
- Reusable Python package under
src/decision_engine/ - Thin notebook layer that consumes package functions instead of embedding core logic
- Configurable command-line entrypoint for reproducible runs
- Basic test suite for feature and pipeline validation
flowchart LR
A["CLI / Notebook Entry\nsrc/main.py or notebook"] --> B["Pipeline Orchestration\nsrc/decision_engine/pipeline.py"]
B --> C["Data Simulation\nsrc/decision_engine/data/simulator.py"]
C --> D["Raw Dataset\ndata/raw/simulated_decision_data.csv"]
D --> E["Feature Engineering\nsrc/decision_engine/features/transform.py"]
E --> F["Feature Table\ndata/processed/features.csv"]
F --> G["Baseline Training\nsrc/decision_engine/models/baseline.py"]
G --> H["Model Artifact\nmodels/baseline_logreg.joblib"]
G --> I["Evaluation Metrics\nmodels/metrics.json"]
J["Governance and Experimentation"] --> K["docs/model_card.md"]
J --> L["templates/experiment_log_template.md"]
Core orchestration is centralized in src/decision_engine/pipeline.py.
flowchart LR
A["Source + Batch Data\nAmazon S3"] --> B["Feature Processing\nSageMaker Processing Jobs"]
B --> C["Feature Store\nSageMaker Feature Store (optional)"]
B --> D["Training Pipeline\nSageMaker Pipelines + Training Jobs"]
C --> D
D --> E["Model Registry\nSageMaker Model Registry"]
E --> F["Approval Workflow\nManual or automated gate"]
F --> G["Deployment\nSageMaker Endpoint / Batch Transform"]
G --> H["Monitoring\nModel Monitor + CloudWatch"]
H --> I["Retraining Trigger\nEventBridge + Pipelines"]
I --> D
J["Experiment Tracking\nSageMaker Experiments"] --> D
K["Artifacts + Metrics\nS3 + CloudWatch + Reports"] --> E
How this repo maps to SageMaker components
src/decision_engine/data/simulator.py-> processing job input generation prototypesrc/decision_engine/features/transform.py-> feature transformation step in Processing/Pipelinessrc/decision_engine/models/baseline.py-> training script entry for SageMaker Training Jobsdocs/model_card.md-> model governance input for approval in Model Registrytemplates/experiment_log_template.md-> experiment traceability aligned with SageMaker Experiments
make setup
make run
make checkOr run directly:
source .venv/bin/activate
python src/main.py --n-users 3000 --random-state 42 --test-size 0.25Notebook:
jupyter notebook notebooks/01_decision_engine_baseline.ipynb.venv/bin/python -m black src tests sagemaker
.venv/bin/python -m ruff check src tests sagemaker
.venv/bin/python -m pytest -qConfiguration lives in pyproject.toml.
- Deterministic simulation and train/test split via
random_state - All outputs are generated from code (no hidden manual data steps)
- Single command execution path through
src/main.pyandmake run - Experiment capture template provided at
templates/experiment_log_template.md
- Recommended Python:
3.10+ - Install dependencies from
requirements.txt - CI runs on Ubuntu with Python
3.10
- Dataset is synthetic and intended for demonstration, not production risk decisions
- Baseline model is scaled logistic regression by design; no extensive model selection included
- Fairness, calibration, and drift monitoring are noted in the model card but not fully implemented
- Add calibration and threshold optimization workflows
- Add feature drift checks and retraining criteria
- Add model comparison framework (tree ensembles vs baseline linear model)
- Model card:
docs/model_card.md - Experiment template:
templates/experiment_log_template.md - SageMaker migration guide:
docs/sagemaker_migration.md - SageMaker starter scripts:
sagemaker/README.md