Full-stack Release & Incident Management System for tracking projects, releases, incidents, and environment health.
- Backend: FastAPI, Python, SQLAlchemy, PostgreSQL
- Auth: JWT-based authentication, role-based access control (Admin/User)
- Frontend: React, TypeScript, Vite, Tailwind CSS, Recharts
- Infra: Docker, Docker Compose
- Projects with environment-specific releases (Dev / QA / Prod)
- Incidents linked to releases with severity + status
- JWT auth with login/register and protected APIs
- Role-based access (admin vs normal user)
- Inline editing for releases and incidents
- Analytics:
- Incident severity distribution
- Incidents by status
- Releases by environment and status
- Environment health (Dev / QA / Prod success rate)
- Incident trend by release
- CRUD with delete for projects, releases, incidents
- AI Triage workflow with cached hypotheses + draft on-call message
The AI Triage Agent generates grounded, ranked root-cause hypotheses when an incident is selected, and drafts an escalation-ready Slack message for the on-call responder. Results are cached per incident to avoid re-running the model on every page view, and responders can submit thumbs-up/down feedback for future tuning. The goal is to speed initial incident triage while keeping the reasoning inspectable.
flowchart TD
A["Incident selected in UI"] --> B["GET /api/incidents/{id}/triage"]
B --> C{"Fresh cached triage < 1h?"}
C -- Yes --> D["Return triage_results row"]
C -- No --> E["Generate triage"]
E --> F["Fetch incident context + last 24h audit logs"]
F --> G["LLM call with structured JSON schema"]
G --> H["Validate citation IDs against input audit logs"]
H --> I["Persist triage_results"]
I --> D
D --> J["Render hypotheses, sources, draft Slack message"]
J --> K["POST /api/triage/{triage_id}/feedback"]
docs/screenshots/triage-tab-loading.pngdocs/screenshots/triage-hypotheses-and-citations.pngdocs/screenshots/triage-insufficient-context.pngdocs/screenshots/triage-feedback-submitted.png
- The service assembles incident context and recent audit log entries (last 24 hours for affected services).
- The prompt instructs the model to cite only audit log IDs present in the provided input.
- The model returns structured JSON (
hypotheses[],draft_message) via JSON-schema response formatting. - A post-processing validation step checks every cited ID against the input audit-log ID set.
- Any hypothesis containing invalid/nonexistent citations is dropped before persistence and response.
This keeps the output grounded in observable system history instead of free-form speculation.
- The model can still be wrong even when grounded; hypotheses are suggestions, not root-cause proof.
- Quality depends heavily on audit-log coverage, consistency, and timestamp accuracy.
- Sparse incident metadata (missing affected services/notes) can reduce confidence or yield insufficient-context results.
- LLM latency/timeouts may occur; the API returns error states in these cases.
- Feedback is currently basic and should be combined with offline evaluation before production automation.
Backend:
cd backend
python -m venv venv
venv/Scripts/activate # Windows
pip install -r requirements.txt
uvicorn app.main:app --reloadFrontend:
cd frontend
npm install
npm run dev