End-to-end Computer Vision capstone on AWS SageMaker — Baseline Training → HPO → Distributed Training (DDP) → Real-Time Endpoint → Lambda Integration → Defensive Cleanup
Built with:
Python | PyTorch | Torchvision | Amazon SageMaker | boto3 | CloudWatch | AWS Lambda
In distribution centers, bins must contain an expected number of items for reliable fulfillment. Manual bin counting is slow, costly, and does not scale.
This project builds an image classification model that predicts a bin item count class (1–5) from an input image, then deploys it as a real-time SageMaker endpoint and exposes inference via AWS Lambda.
This repository demonstrates production-style ML engineering on AWS:
- Data preparation and deterministic splits
- Baseline training (single instance)
- Hyperparameter tuning (HPO)
- Multi-instance distributed training (DDP)
- Real-time deployment and smoke testing
- Lambda-triggered inference (HTTP URL / S3 URI / API Gateway payloads)
- Defensive cleanup to avoid ongoing charges
Inventory Management/
├── Documentation/
│ ├── Capstone Project Report.pdf
│ ├── Lambda Test Events Response.txt
│ └── proposal.pdf
├── Images/
│ ├── All_Training_Jobs.png
│ ├── Best_Model_Metrics.png
│ ├── CloudWatch_SS1.png
│ ├── CloudWatch_SS2.png
│ ├── CostExplorer_SS1.png
│ ├── CostExplorer_SS2.png
│ ├── Final_model.png
│ ├── Final_S3.png
│ ├── Hyperparameter_jobs.png
│ ├── Hyperparameter_training_jobs.png
│ ├── Lambda_API_Gateway_Test.png
│ ├── Lambda_Architecture.png
│ ├── Lambda_code.png
│ ├── Lambda_Function.png
│ ├── Lambda_HTTP_URL_Test.png
│ ├── Lambda_S3_URI_Test.png
│ ├── Project_Architecture.png
│ └── Sagemaker_models.png
├── local_eval/
│ ├── model/
│ │ ├── metrics.json
│ │ ├── model.pth
│ │ └── model.tar.gz
│ ├── confusion_matrix.png
│ └── metrics.json
├── file_list.json
├── final_inference.py
├── inference.py
├── Lambda.py
├── README.md
├── sagemaker.ipynb
├── train.py
└── train1.py
- sagemaker.ipynb
The end-to-end orchestration notebook for: splitting → training → HPO → DDP → endpoint deploy → Lambda integration → cleanup.
- train.py
Entry point for single-instance training (baseline + non-distributed workflow). - train1.py
Entry point for multi-instance distributed training (DDP) (2 instances). Includes distributed-safe logging + checkpointing patterns.
- inference.py
Inference handler used in the initial deployment tests. - final_inference.py
Final inference handler used for the final endpoint (and Lambda integration). Returns a consistent JSON schema: predicted_label, predicted_index, confidence, probabilities, class_labels
- Lambda.py
AWS Lambda handler that:- Accepts request as HTTP image URL or S3 URI or API Gateway-style payload
- Fetches image bytes
- Invokes SageMaker endpoint (invoke_endpoint)
- Returns prediction JSON response
Given a bin image x, predict the item-count label y ∈ {1,2,3,4,5}, where y represents EXPECTED_QUANTITY.
This is a supervised multi-class classification problem:
- Input: JPG image
- Output: one of five count classes (1–5)
This project tracks standard multi-class classification metrics:
- Accuracy
- Macro F1-score (primary)
- Macro Precision / Macro Recall
- Confusion Matrix
Macro metrics are emphasized because they treat each class equally and help reveal performance issues under class imbalance.
| Stage | What it proves | Key Evidence |
|---|---|---|
| Baseline (Single Instance) | End-to-end training + test evaluation works | Test Acc: 0.3203, Test Macro-F1: 0.3168 |
| HPO (Hyperparameter Tuning) | Objective improved via systematic search | Best objective (val_macro_f1): 0.387539 |
| DDP (2 instances) | Multi-instance training produces deployable artifact | Status: Completed, InstanceCount: 2, Artifact: s3://.../cbc-ddp-251225173021/output/model.tar.gz |
| Real-time Deploy | Endpoint is healthy and returns structured JSON | Endpoint InService + successful smoke test response |
| Lambda Integration | Serverless invocation works for multiple input types | HTTP URL / S3 URI / API Gateway payloads validated |
| Cleanup | No lingering inference resources | Endpoint + EndpointConfig deleted (defensive cleanup) |
- Baseline training (train.py)
- HPO tuning job (objective metric: val_macro_f1)
- Distributed training on 2 GPU instances (train1.py)
- Real-time endpoint deployment + smoke test (final_inference.py)
- Lambda → SageMaker endpoint integration (Lambda.py)
- Defensive cleanup (endpoint, endpoint config, model objects)
Managed Spot Training was attempted to reduce cost, but the experiment was not a like-for-like comparison because:
- The baseline used a GPU instance (ml.g4dn.xlarge)
- The spot run used a CPU instance (ml.m5.2xlarge), which is dramatically slower for CNN training (ResNet50)
- The job hit MaxRuntime and was stopped (MaxRuntimeExceeded)
How to implement Spot Training properly:
- Use Spot with the SAME FAMILY of GPU instances as baseline (e.g., g4dn.xlarge) so runtime comparisons are meaningful
- Enable checkpointing (already done in this project) so interrupted spot runs can resume
- Increase MaxWaitTimeInSeconds and MaxRuntimeInSeconds appropriately for spot variability
- Keep training code resilient to restarts (idempotent data loading, safe checkpoint writes)
- Python 3.10+
- AWS account with permissions for:
- SageMaker (training, HPO, deployment)
- S3 (read/write artifacts)
- CloudWatch Logs
- Lambda (create + invoke)
- IAM roles (execution role for SageMaker and Lambda)
- Recommended environment: SageMaker Studio / Udacity workspace configured for SageMaker
If you run locally:
- pip install sagemaker boto3 torch torchvision Pillow numpy pandas scikit-learn matplotlib
(If you run inside SageMaker-managed containers, many dependencies are already present.)
Open:
- sagemaker.ipynb
Follow the cells in order:
- Data split + upload to S3
- Baseline training job (train.py)
- HPO job (train.py)
- DDP training job (train1.py)
- Deploy endpoint (final_inference.py)
- Smoke test invocation
- Create Lambda integration
- Cleanup (endpoint + endpoint config + model objects)
Artifacts are stored in S3 and referenced in the notebook output logs.
This project uses raw image bytes:
- Content-Type: application/x-image
- Body: image bytes
Lambda supports multiple formats:
-
HTTP URL input:
- { "image_url": "https://..." }
-
S3 URI input:
- { "s3_uri": "s3://bucket/key.jpg" }
-
API Gateway style payload:
- Standard event wrapper that contains a JSON body with one of the above.
Lambda fetches bytes, calls invoke_endpoint, and returns the model response JSON.
After testing inference, delete resources:
- SageMaker Endpoint
- SageMaker EndpointConfig
- SageMaker Model objects (if created during repack/deploy)
This repository includes defensive cleanup logic in the notebook, and the project logs confirm endpoint and config deletion after validation.
Minimum recommended:
- Images/All_Training_Jobs.png (proof of completed jobs)
- Images/Hyperparameter_training_jobs.png (HPO evidence)
- Images/Best_Model_Metrics.png or Images/Final_model.png (final model artifact / metrics)
- Images/CloudWatch_SS1.png + Images/CloudWatch_SS2.png (endpoint monitoring)
- Images/Lambda_Architecture.png + Images/Lambda_HTTP_URL_Test.png + Images/Lambda_S3_URI_Test.png (Lambda validation)
- Images/CostExplorer_SS1.png + Images/CostExplorer_SS2.png (cost evidence + cleanup discipline)
Amazon SageMaker Distributed Training (Developer Guide) https://docs.aws.amazon.com/sagemaker/latest/dg/distributed-training.html
Amazon SageMaker Training Toolkit (How SageMaker runs user scripts, env vars, MPI launch) https://github.com/aws/sagemaker-training-toolkit
Amazon SageMaker Hyperparameter Tuning (Developer Guide) https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning.html
SageMaker API: CreateHyperParameterTuningJob https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateHyperParameterTuningJob.html
SageMaker API: CreateTrainingJob (EnableManagedSpotTraining / MaxRuntimeInSeconds / MaxWaitTimeInSeconds) https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html
Troubleshoot Real-Time Inference / Deployment (Health checks, common failures) https://docs.aws.amazon.com/sagemaker/latest/dg/deploy-model-troubleshoot.html
SageMaker Hosting: Endpoints (Concepts) https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-hosting.html
boto3 SageMaker Runtime: invoke_endpoint https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker-runtime/client/invoke_endpoint.html
AWS Lambda Python handler basics https://docs.aws.amazon.com/lambda/latest/dg/python-handler.html
Torchvision ResNet model documentation https://pytorch.org/vision/stable/models/resnet.html
Torchvision model weights enums (pretrained → weights migration) https://pytorch.org/vision/stable/models.html
Torchvision ImageFolder dataset https://pytorch.org/vision/stable/generated/torchvision.datasets.ImageFolder.html
PyTorch tutorial: Saving and loading models (state_dict, checkpoints) https://pytorch.org/tutorials/beginner/saving_loading_models.html
Amazon CloudWatch Logs (concepts) https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/WhatIsCloudWatchLogs.html
Brejesh Balakrishnan
LinkedIn: https://www.linkedin.com/in/brejesh-balakrishnan-7855051b9/
If you have questions, suggestions, or want to discuss improvements (better backbones, augmentation, focal loss, class weighting, and stronger evaluation), feel free to connect on LinkedIn.










