Sentinel AI is an autonomous MLOps "guardian" built on Google Cloud Vertex AI that moves beyond reactive alerting to create a predictive, self-healing, and continuously optimizing ecosystem for machine learning models. The system prevents incidents before they occur through advanced pattern analysis and multi-agent orchestration.
✅ PRODUCTION READY - Sentinel AI is fully deployed on GCP with predictive capabilities!
- ✅ Complete Multi-Agent System: All 6 specialized agents operational (Conductor, Diagnostic, Verification, Reporting, Remediation, Predictive)
- ✅ GCP Production Deployment: Successfully deployed and running on Google Cloud Platform
- ✅ Predictive Incident Prevention: Proactive analysis prevents incidents before they occur
- ✅ Application Default Credentials: Secure authentication configured for GCP resources
- ✅ End-to-End Demonstrations: Both reactive and predictive scenarios fully functional
- ✅ Agent Communication: Message bus and agent registry for inter-agent coordination
- ✅ Graceful Startup/Shutdown: Proper initialization and cleanup of all components
- ✅ Comprehensive Logging: Structured logging and monitoring throughout the system
- 🔮 Data Drift Prediction: Analyzes feature distribution trends to predict critical drift
- ⚡ Performance Degradation Prediction: Forecasts system failures before they occur
- 💾 Resource Exhaustion Prediction: Prevents capacity issues through usage pattern analysis
- 🤖 Model Staleness Prediction: Predicts optimal retraining schedules based on performance decay
- 🎯 Proactive Alert System: Time-to-incident estimation with prevention recommendations
- 🛠️ Automated Prevention: Actionable steps to prevent predicted incidents
- ✅ GCP Resources: BigQuery datasets, Cloud Storage buckets, Pub/Sub topics configured
- ✅ Cloud Run Service: Auto-scaling deployment with 4GB RAM, 1 CPU
- ✅ API Integration: Vertex AI, BigQuery, Pub/Sub, Cloud Storage, Monitoring APIs enabled
- ✅ Security: Application Default Credentials for secure authentication
- ✅ Monitoring: Real-time logging and performance tracking
Maximize model performance and business value while minimizing human intervention and operational costs through intelligent automation and multi-agent orchestration.
- Proactive & Predictive: Shift from "model has drifted" to "model will drift, here's the plan"
- Generative Diagnosis & Solutions: Leverage Gemini's reasoning for root cause analysis and novel solutions
- Business-Aware Optimization: Every action evaluated for cost-benefit and business impact
- Governed Autonomy: Configurable human-in-the-loop checkpoints for safety and control
- Explainable MLOps: Complex technical events translated to human-readable summaries
- Adaptive Learning: System learns from past incidents to improve future decisions
The system consists of 6 specialized agents orchestrated by a central conductor:
- Central "brain" managing incident lifecycle (DETECTED → DIAGNOSING → RESOLVED)
- Intelligent task delegation to specialized agents
- State management and governance rule application
- Status: ✅ Deployed and Operational
- Tools: Vertex AI Agent Builder, LangChain, Vertex AI Vector Search
- Proactive incident prevention through pattern analysis and trend forecasting
- Predicts data drift, performance degradation, resource exhaustion, and model staleness
- Generates time-to-incident estimates with prevention recommendations
- Status: ✅ Deployed and Operational
- Tools: Statistical Analysis, Machine Learning Models, Gemini Pro
- Uses Gemini's reasoning to diagnose root causes of incidents
- Synthesizes monitoring data, model lineage, and historical context
- Status: ✅ Deployed and Operational
- Tools: Gemini Pro, Vertex AI ML Metadata, Vertex AI Workbench
- Formulates concrete remediation plans for identified incidents
- Suggests retraining, hyperparameter optimization, or feature engineering
- Status: ✅ Deployed and Operational
- Tools: Gemini Pro, Vertex AI Pipelines, Vertex AI Vizier
- Validates remediation effectiveness and system health
- Ensures fixes resolve issues without introducing new problems
- Status: ✅ Deployed and Operational
- Tools: Vertex AI Model Monitoring, BigQuery ML, Statistical Analysis
- Generates human-readable reports and summaries
- Translates technical incidents into business-impact language
- Status: ✅ Deployed and Operational
- Tools: Gemini Pro, BigQuery Analytics, Visualization APIs
- Tools: BigQuery, Gemini Pro
- Executes approved remediation plans
- Manages training pipelines, model evaluation, and safe production rollouts
- Tools: Vertex AI Pipelines, Model Registry, Endpoints
- Generates natural language summaries for stakeholders
- Makes system actions explainable and transparent
- Tools: Gemini Pro, Looker, BigQuery
- Python 3.9+
- uv package manager (recommended) or pip
- Google Cloud Project with Vertex AI API enabled (for production)
- Service account with appropriate permissions (for production)
# Clone the repository
git clone <repository-url>
cd sentinel-ai
# Install dependencies using uv (recommended)
uv sync
# Or using pip
pip install -r requirements.txtThe project includes a fully functional demonstration that showcases the complete Sentinel AI system:
# Run the conductor agent demonstration
python examples/demo_conductor.pyThe demo will:
- Initialize all core components (MessageBus, AgentRegistry, ConductorAgent)
- Create and start 4 specialized agents (remediation, verification, diagnostic, reporting)
- Run 3 demonstration scenarios:
- Data Drift Detection - Simulates model drift incident
- Performance Degradation - Simulates model performance issues
- Model Error Handling - Simulates model execution errors
- Show complete incident lifecycle management
- Demonstrate multi-agent orchestration and communication
- Python 3.9+
- Google Cloud SDK (
gcloud) uvpackage manager (recommended) orpip
git clone <repository-url>
cd sentinal-ai
# Install dependencies with uv (recommended)
uv sync
# Or with pip
pip install -r requirements.txt# Login and set up Application Default Credentials
gcloud auth application-default login
gcloud config set project YOUR_PROJECT_IDReactive Demo (Original System):
uv run python examples/demo_conductor.pyPredictive Demo (NEW!):
uv run python examples/demo_predictive.py# Automated GCP setup and deployment
./scripts/setup_gcp.sh
uv run python deploy_all_agents.py- ✅ GCP Project:
mdepew-assets - ✅ Cloud Run Service:
mdepew-agent(auto-scaling, 4GB RAM, 1 CPU) - ✅ Authentication: Application Default Credentials
- ✅ APIs Enabled: Vertex AI, BigQuery, Pub/Sub, Cloud Storage, Monitoring
- ✅ Resources Created: BigQuery datasets, Storage buckets, Pub/Sub topics
- Cloud Run Service: https://mdepew-agent-194822035697.us-central1.run.app
- GCP Console: https://console.cloud.google.com/run?project=mdepew-assets
- BigQuery Data: https://console.cloud.google.com/bigquery?project=mdepew-assets
- Monitoring: https://console.cloud.google.com/monitoring?project=mdepew-assets
scripts/setup_gcp.sh- Automated GCP resource setupdeploy_all_agents.py- Complete agent deployment and testingscripts/deploy_cloud_run.sh- Cloud Run deploymentDockerfile- Production container image
sentinel-ai/
├── src/
│ ├── agents/ # Agent implementations
│ ├── services/ # Vertex AI service integrations
│ ├── models/ # Data models and schemas
│ ├── communication/ # Agent communication patterns
│ ├── governance/ # Autonomy and governance controls
│ ├── config/ # Configuration management
│ └── utils/ # Utility functions
├── examples/ # Usage examples and demos
├── tests/ # Test suite
├── docs/ # Documentation
├── config.yaml # Main configuration file
├── requirements.txt # Python dependencies
└── README.md # This file
The system uses Python-based configuration management located in src/config/settings.py. The configuration includes:
- GCP Settings: Project ID, regions, service configurations
- Agent Configuration: Individual agent settings and thresholds
- Governance: Autonomy levels and approval requirements
- Logging: Structured logging with contextual information
- Communication: Message bus and agent registry settings
src/config/settings.py- Main configuration managementpyproject.toml- Project metadata and dependenciesrequirements.txt- Python package dependencies
The demo runs with default mock configurations and doesn't require GCP credentials. For production deployment, update the settings in src/config/settings.py with your specific GCP project details.
- Detect & Alert: Drift Agent detects anomaly, alerts Conductor
- Diagnose: Conductor invokes Diagnostic Agent for root cause analysis
- Plan: Remediation Agent generates concrete action plan
- Analyze: Economist Agent performs cost-benefit analysis
- Govern & Decide: Conductor checks autonomy level and governance rules
- Execute & Verify: Verification Agent executes plan and manages rollout
- Report & Learn: Reporting Agent provides summaries, system learns from incident
- Autonomy Levels: Report Only, Supervised Execution, Fully Autonomous
- Human-in-the-Loop: Configurable checkpoints for critical operations
- Cost Controls: Automatic cost-benefit analysis before expensive operations
- Audit Trail: Complete logging of all decisions and actions
- Prometheus metrics for system health
- Structured logging with correlation IDs
- Health check endpoints
- Integration with Google Cloud Monitoring
pytest tests/black src/ tests/
isort src/ tests/- Fork the repository
- Create a feature branch
- Make your changes
- Add tests
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
For questions and support:
- Create an issue in the repository
- Check the documentation
- Review the examples
Sentinel AI - Autonomous MLOps for the Future 🚀