Skip to content

docs: KEEP-573 four-dashboard observability spec for review#1318

Merged
OleksandrUA merged 2 commits into
stagingfrom
KEEP-573-four-dashboard-spec
Jun 11, 2026
Merged

docs: KEEP-573 four-dashboard observability spec for review#1318
OleksandrUA merged 2 commits into
stagingfrom
KEEP-573-four-dashboard-spec

Conversation

@OleksandrUA

Copy link
Copy Markdown

Draft of the four-dashboard observability spec called out in KEEP-573 Phase 1 E.

Splits the current single "KeeperHub" dashboard by audience: SLO exec view, platform on-call view, per-org customer-support view, growth/revenue view. Each section in the doc gives variables, panel list with PromQL/LogQL queries, linked alerts, suggested owner, open questions.

Closes the last unticked Phase 1 acceptance item on the parent ticket. Implementation lands in follow-up PRs in grafana/keeperhub-dashboards/git-sync/ (new path adopted 2026-05-20).

Review asks

  • Sanity-check the panel list per dashboard - did I miss anything the audience actually needs?
  • The five "open questions" at the bottom of the doc need decisions before Phase 2 starts.
  • Owner suggestions at the bottom of each dashboard section - confirm or reassign.

Test plan

  • Spec reviewed by Simon
  • Open questions resolved
  • Owners finalized per dashboard

Draft spec for the four-dashboard rebuild called out in KEEP-573 Phase 1 E.
Splits the current single "KeeperHub" dashboard by audience:

- A. Managed Client SLO (exec + Sky/Ajna account team)
- B. Platform Health (TechOps/DevOps on-call)
- C. Customer Workflows (per-org support debugging)
- D. Growth + Revenue (founders + revenue side)

Each dashboard section: variables, panel list with PromQL/LogQL,
linked alerts, owner suggestion, open questions. Implementation plan
targets the new grafana git-sync path adopted on 2026-05-20, with
files landing under grafana/keeperhub-dashboards/git-sync/.

Open questions for the review at the bottom of the doc.

Closes the Phase 1 E acceptance item on the parent ticket. Owners
finalized in this PR's review.
Bulk error_type reclassification (backfill re-run, classifier rule
change, manual SQL fix) makes the DB-sourced gauge's error_type
label-value series move at one scrape. PromQL's increase() reads the
gain as new errors while the loss is treated as a counter reset,
producing a phantom positive bump that contaminates SLI panels for
one window-length.

Documents the symptom + how to recognise it so future engineers don't
chase a phantom incident, and references the KEEP-592 analysis.
@OleksandrUA OleksandrUA merged commit 8a1f995 into staging Jun 11, 2026
41 checks passed
@OleksandrUA OleksandrUA deleted the KEEP-573-four-dashboard-spec branch June 11, 2026 09:49
@github-actions

Copy link
Copy Markdown

🧹 PR Environment Cleaned Up

The PR environment has been successfully deleted.

Deleted Resources:

  • Namespace: pr-1318
  • All Helm releases (Keeperhub, Scheduler, Event services)
  • PostgreSQL Database (including data)
  • LocalStack, Redis
  • All associated secrets and configs

All resources have been cleaned up and will no longer incur costs.

@github-actions

Copy link
Copy Markdown

ℹ️ No PR Environment to Clean Up

No PR environment was found for this PR. This is expected if:

  • The PR never had the deploy-pr-environment label
  • The environment was already cleaned up
  • The deployment never completed successfully

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant