Autonomous database health monitoring and self-healing system.
AegisDB detects data quality failures, diagnoses root cause using AI, proposes fixes for human review, and applies them to production — safely, with a full audit trail.
Frontend: furyfist/aegisdb-frontend | Demo: Watch on YouTube
- Overview
- Architecture
- Features
- Tech Stack
- Prerequisites
- Quick Start
- Environment Variables
- Running the Services
- End-to-End Pipeline Walkthrough
- Project Structure
- API Reference
- Known Limitations
- Contributing
Production databases silently accumulate data quality failures. A NULL in a critical column, a referential integrity breach, a value outside expected range — OpenMetadata flags these, but someone still has to write the fix, test it, and run it manually. That loop takes time and tribal knowledge most teams don't have.
AegisDB closes the loop automatically:
- OpenMetadata fires a webhook when a test fails
- AegisDB classifies the failure, diagnoses the root cause with Groq LLM and a RAG knowledge base
- It sandbox-tests a fix in an ephemeral Postgres container
- A human approves or rejects the proposed fix via Slack or the web UI
- On approval, AegisDB applies the fix to production inside a transaction with post-apply assertions
- Every outcome — applied, rolled back, escalated, or dry-run — is written to an audit log
- Successfully applied fixes annotate the affected column in OpenMetadata and update the Slack card with incident links
No tickets. No tribal knowledge lost. Every table AegisDB has ever touched becomes a self-documenting asset.
OpenMetadata (test failure)
│
▼
FastAPI Webhook
│ background task
▼
Detector → Diagnosis (Groq LLM + ChromaDB RAG)
│
├── low confidence → Escalation Stream
│
└── high confidence → Sandbox Preview (testcontainers Postgres)
│
▼
Proposal Created → Human Review (Slack / Web UI)
│
├── Rejected → Escalation, ChromaDB learns reason
│
└── Approved → Repair Agent (sandbox re-validation)
│
▼
Apply Agent → Production Postgres
│
├── COMMIT → Audit Log + OpenMetadata annotation + Slack update
└── ROLLBACK → Escalation
All inter-agent communication flows through Redis Streams — no direct agent-to-agent calls. Every message is consumer-group acknowledged (at-least-once delivery).
Safe-first execution Every fix is sandbox-tested in an ephemeral Postgres container before a human ever sees the proposal. Production applies run inside an explicit transaction with post-apply SQL assertions. Automatic rollback on assertion failure.
AI-powered diagnosis Groq LLM diagnoses root cause with full table schema context. ChromaDB provides RAG from past fixes — the system gets smarter with every approval.
Human-in-the-loop No fix reaches production without explicit human approval. Proposals include a data diff (before/after rows), confidence score, rollback SQL, and a plain-English explanation.
Conversational Slack bot Not just notifications — a full agent embedded in Slack. Posts live anomaly cards, self-updates as the pipeline progresses, answers questions in proposal threads with Gemini context, stores rejection reasons in ChromaDB.
Auto-documentation
On every successful production apply: a structured FixReport is written to Postgres, the fixed column's description in OpenMetadata is annotated, and the Slack card is updated with links to the incident timeline and audit entry.
Full audit trail
Every pipeline outcome — applied, dry run, rolled back, skipped, escalated — is written to _aegisdb_audit with the exact SQL that ran, row counts, assertion results, and confidence score.
Dry-run mode
Toggle DRY_RUN at runtime without restart. The full pipeline runs — diagnosis, sandbox, proposal, approval — but nothing touches production. Safe for demos and staging.
Database profiler Direct Postgres profiler that detects NULL violations, uniqueness violations, range outliers (IQR), referential integrity breaks, and email format errors across all tables. No OpenMetadata required for this path.
| Layer | Technology |
|---|---|
| Framework | FastAPI 0.115.0 + Uvicorn |
| Language | Python 3.12 |
| LLM | Groq (groq) |
| Vector store | ChromaDB 0.5.0 + all-MiniLM-L6-v2 embeddings |
| Message broker | Redis Streams (redis-py 5.0.7) |
| Target DB driver | asyncpg 0.30.0 (async) + SQLAlchemy 2.0 |
| Sandbox | testcontainers-python 4.14.2 (ephemeral Postgres) |
| Data catalog | OpenMetadata 1.6.1 |
| Validation | Pydantic 2.7.0 |
| Infrastructure | Docker Compose |
| Layer | Technology |
|---|---|
| Framework | Next.js (App Router) |
| Language | TypeScript |
| Styling | Tailwind CSS |
| Data fetching | React Query (polling) |
- Docker and Docker Compose (for OpenMetadata, Postgres, Redis)
- Python 3.12
- Node.js 18+ (for frontend)
- Groq API key — get one here
- 8 GB RAM minimum (OpenMetadata stack is heavy)
git clone https://github.com/your-org/aegisdb.git
cd aegisdbcp .env.example .envEdit .env — at minimum set:
GROQ_API_KEY=your_key_here
OM_ADMIN_PASSWORD=your_om_password
TARGET_DB_PASSWORD=aegisdb_passFull variable reference → Environment Variables
docker compose -f docker/openmetadata/docker-compose.yml up -dWait ~2 minutes for OpenMetadata to be healthy. Check at http://localhost:8585.
pip install -r requirements.txtuvicorn src.main:app --host 0.0.0.0 --port 8001 --reloadExpected output:
[Boot] ChromaDB ready
[Boot] Audit table ready
[Boot] Event store ready
[Boot] Profiling store ready
[Boot] Connection registry ready
[Boot] Proposal store ready
[Boot] Event bus ready
[Boot] Diagnosis consumer running
[Boot] Repair agent running
[Boot] Apply agent running
AegisDB ready ✓
python -m src.slack.botRequires SLACK_BOT_TOKEN and SLACK_APP_TOKEN in .env. See Slack Setup.
cd frontend
npm install
npm run devOpen http://localhost:3000.
Create a .env file in the project root. A .env.example is provided.
| Variable | Default | Required | Description |
|---|---|---|---|
GROQ_API_KEY |
— | ✅ | Groq API key |
LLM_MODEL |
llama-3.3-70b-versatile |
Groq model name | |
LLM_MAX_TOKENS |
4096 |
Max tokens per LLM call | |
OM_HOST |
http://localhost:8585 |
✅ | OpenMetadata server URL |
OM_ADMIN_EMAIL |
admin@open-metadata.org |
✅ | |
OM_ADMIN_PASSWORD |
— | ✅ | |
REDIS_HOST |
localhost |
✅ | |
REDIS_PORT |
6379 |
✅ | |
REDIS_STREAM_NAME |
aegisdb:events |
||
REDIS_REPAIR_STREAM |
aegisdb:repair |
||
REDIS_ESCALATION_STREAM |
aegisdb:escalation |
||
REDIS_APPLY_STREAM |
aegisdb:apply |
||
REDIS_CONSUMER_GROUP |
aegisdb-agents |
||
APP_HOST |
0.0.0.0 |
||
APP_PORT |
8000 |
Frontend proxies to 8001 | |
TARGET_DB_HOST |
localhost |
✅ | |
TARGET_DB_PORT |
5433 |
✅ | |
TARGET_DB_NAME |
northwind |
✅ | |
TARGET_DB_USER |
aegisdb_user |
✅ | |
TARGET_DB_PASSWORD |
aegisdb_pass |
✅ | |
CHROMA_PERSIST_DIR |
./data/chromadb |
||
CHROMA_COLLECTION |
aegisdb_fixes |
||
CONFIDENCE_THRESHOLD |
0.70 |
Min confidence to create proposal | |
SANDBOX_MAX_RETRIES |
3 |
||
SANDBOX_SAMPLE_ROWS |
500 |
Max rows seeded into sandbox | |
SANDBOX_DIFF_ROWS |
20 |
Max rows shown in data diff | |
SANDBOX_TIMEOUT_SECONDS |
120 |
||
DRY_RUN |
true |
Set to false for live applies |
|
APPLY_STATEMENT_TIMEOUT_MS |
30000 |
||
POST_APPLY_VERIFY |
true |
Run assertions after production apply | |
SLACK_BOT_TOKEN |
— | Slack only | xoxb-... |
SLACK_APP_TOKEN |
— | Slack only | xapp-... (Socket Mode) |
SLACK_CHANNEL_ID |
— | Slack only | Channel for anomaly cards |
Security: Never commit
.env. The.gitignorealready excludes it. Credentials submitted viaPOST /connectare never stored — onlyhost:port/dbis retained.
# Start OpenMetadata stack + target Postgres + Redis
docker compose -f docker/openmetadata/docker-compose.yml up -d
# Stop everything
docker compose -f docker/openmetadata/docker-compose.yml down
# Stop and wipe volumes (full reset)
docker compose -f docker/openmetadata/docker-compose.yml down -vpsql postgresql://aegisdb_user:aegisdb_pass@localhost:5433/northwind \
-f scripts/seed_dirty_data.sql# Check current state
curl http://localhost:8001/api/v1/status
# Toggle (switches between true/false in memory — resets on restart)
curl -X POST http://localhost:8001/api/v1/dry-run/togglecurl -X POST http://localhost:8001/api/v1/webhook/om-test-failure \
-H "Content-Type: application/json" \
-d '{
"eventType": "entityUpdated",
"entityType": "testCase",
"entityFQN": "northwind_svc.northwind.public.employees.region.employees_region_not_null",
"entity": {
"name": "employees_region_not_null",
"fullyQualifiedName": "northwind_svc.northwind.public.employees.region.employees_region_not_null",
"testCaseResult": {
"testCaseStatus": "Failed",
"result": "Found 5 null values in column region"
}
},
"timestamp": 1712345678000
}'# Stop the server first, then:
rm -rf ./data/chromadb/
# Restart — 5 bootstrap entries are re-seeded automaticallyOpenMetadata runs a data quality test and finds a NULL violation in employees.region. It fires a webhook to AegisDB at POST /webhook/om-test-failure. AegisDB returns 200 immediately and processes the event in the background.
Step 2 — Classify and diagnose
The detector classifies the failure as NULL_VIOLATION with severity low. The diagnosis agent queries ChromaDB for similar past fixes (RAG), builds a prompt with full table schema context, and calls Groq LLM. It returns a DiagnosisResult with a confidence score, root cause in plain English, a fix SQL, and rollback SQL.
Step 3 — Sandbox preview and proposal

If confidence ≥ 0.70, AegisDB spins up an ephemeral Postgres container, seeds it with up to 500 rows from the target table, runs the fix, validates assertions, and captures a before/after data diff. A proposal is created in pending_approval status.
Step 4 — Human review The proposal appears in the web UI (and Slack). The operator reviews: the plain-English root cause, the exact SQL that will run, a diff of affected rows, confidence score, and rollback SQL. They click Approve or Reject.
Step 5 — Apply to production On approval, the repair agent re-validates in a fresh sandbox (up to 3 retries). On pass, the apply agent executes the fix inside a Postgres transaction with a 30-second statement timeout. Post-apply assertions run. On pass → COMMIT. On fail → ROLLBACK → escalation.
Step 6 — Audit and documentation

The audit log records every detail: the SQL that ran, rows affected, assertion results, confidence score, and whether it was a dry run. The fixed column's description in OpenMetadata is annotated. The Slack card updates with links to the incident timeline and audit entry.
aegisdb/
├── src/
│ ├── agents/
│ │ ├── detector.py # Rule-based failure classifier
│ │ ├── diagnosis.py # LLM diagnosis via Groq
│ │ ├── profiler.py # Direct Postgres profiling engine
│ │ ├── repair.py # Sandbox orchestrator (post-approval)
│ │ └── apply.py # Production fix executor
│ ├── api/
│ │ ├── webhook.py # OpenMetadata webhook receiver
│ │ ├── dashboard.py # Audit / stream / escalation endpoints
│ │ ├── profiler_routes.py # Profiling API
│ │ ├── onboarding_routes.py # Database connect + re-profile
│ │ ├── proposal_routes.py # Proposal approve / reject / detail
│ │ └── table_routes.py # Live table data
│ ├── core/
│ │ ├── config.py # Settings via pydantic-settings (.env)
│ │ └── models.py # All Pydantic models
│ ├── db/
│ │ ├── audit_log.py # _aegisdb_audit CRUD
│ │ ├── event_store.py # _aegisdb_events CRUD
│ │ ├── vector_store.py # ChromaDB wrapper
│ │ ├── profiling_store.py # _aegisdb_profiling_reports CRUD
│ │ ├── proposal_store.py # _aegisdb_proposals CRUD
│ │ └── connection_registry.py
│ ├── sandbox/
│ │ ├── executor.py # testcontainers Postgres orchestration
│ │ └── validator.py # SQL assertions
│ ├── services/
│ │ ├── om_client.py # OpenMetadata REST client
│ │ ├── event_bus.py # Redis Streams publisher
│ │ ├── stream_consumer.py # Redis Streams consumer
│ │ └── om_ingestion.py # In-process ingestion
│ └── main.py # FastAPI app + boot sequence
├── frontend/ # Next.js frontend
├── docker/openmetadata/ # Docker Compose + env
├── scripts/
│ └── seed_dirty_data.sql # Test dirty data for Northwind
├── data/chromadb/ # ChromaDB persistent store (gitignored)
├── docs/
│ ├── ARCHITECTURE.md # Full backend architecture reference
│ ├── slack-integration.md # Slack bot technical reference
│ └── autodoc.md # Auto-documentation feature reference
├── .env.example
└── requirements.txt
Interactive docs available at http://localhost:8001/docs (Swagger UI) once the backend is running.
| Method | Route | Description |
|---|---|---|
POST |
/api/v1/webhook/om-test-failure |
Receive OpenMetadata test failure event |
GET |
/api/v1/health |
Health check |
GET |
/api/v1/status |
Full pipeline status |
GET |
/api/v1/audit |
Audit log |
GET |
/api/v1/audit/{event_id} |
Single audit entry |
GET |
/api/v1/escalations |
Escalation queue |
POST |
/api/v1/dry-run/toggle |
Toggle live / safe mode |
POST |
/api/v1/connect |
Connect and onboard a database |
GET |
/api/v1/connections |
List connections |
GET |
/api/v1/connections/{id}/health |
Database health score |
POST |
/api/v1/profile |
Direct profiling (no OM required) |
GET |
/api/v1/proposals/pending |
Pending approval queue |
POST |
/api/v1/proposals/{id}/approve |
Approve fix |
POST |
/api/v1/proposals/{id}/reject |
Reject fix |
Full API specification → docs/ARCHITECTURE.md § API Reference
AegisDB isn't just a notification bot — it's a conversational SRE embedded in your workspace.
Receive detailed anomaly cards with data diffs, confidence scores, and one-click approval buttons directly in Slack.
Ask questions in the incident thread. The bot uses Groq LLM and the ChromaDB RAG store to explain why a fix is safe or how it compares to past incidents.
Upon approval, cards update in real-time to provide a receipt of the fix, including exact rows healed and a link to the audit trail.
- Create a Slack app at
https://api.slack.com/apps - Enable Socket Mode — generates an App-Level Token (
xapp-...) - Add Bot Token Scopes:
chat:write,chat:write.public,channels:history,reactions:read - Install the app to your workspace — copy the Bot Token (
xoxb-...) - Set
SLACK_BOT_TOKEN,SLACK_APP_TOKEN,SLACK_CHANNEL_IDin.env - Start the bot:
python -m src.slack.bot
The bot connects via Socket Mode — no public URL or ngrok required.
- No authentication. All API endpoints are publicly accessible. Do not expose to the internet without an auth layer in front.
- No credential storage. Database credentials are validated at connect time but never persisted. Re-profiling always requires re-submitting credentials.
DRY_RUNis not persisted. Toggling via API resets to.envvalue on next restart. Default istrue.- Validator assertion coverage.
format_violationandschema_driftcategories skip assertions and returnpassed=True. Range bounds are hardcoded to[0, 9_999_999]. - Proposals stuck at
executing. Proposals that fail mid-pipeline are not automatically reset. Fire a new webhook to generate a fresh proposal. - Sandbox on Windows. testcontainers may produce Docker API 500 errors on container removal. Set
TESTCONTAINERS_HOST_OVERRIDE=localhostif needed. - OM FQN mismatch. Tables connected via
POST /connectmay not resolve in OpenMetadata table lookups. Pipeline continues withenrichment_success=false— does not block diagnosis or repair.
- Branch from
main - Backend changes → update
docs/ARCHITECTURE.mdif any API contract, env var, or data model changes - Frontend changes → update
lib/types.tsandlib/api.tsbefore touching components - All new endpoints must be reflected in the API route table in
ARCHITECTURE.md - Do not commit
.env,./data/chromadb/, or any credentials
MIT




