-
Notifications
You must be signed in to change notification settings - Fork 32
Open
Description
Here’s your observability upgrade proposal formatted as a GitHub feature request:
Feature Request: Upgrade Observability Infrastructure (Grafana + Prometheus)
Is your feature request related to a problem? Please describe.
The current Open Libra observability infrastructure lacks:
- Publicly accessible metrics/dashboards for community transparency
- Scalable node monitoring as the network grows
- Modernized dashboards based on 2022 templates and Aptos-inspired designs
- Secure, dynamic target management for Prometheus (currently IP-based)
This makes it difficult for node operators and the community to monitor network health effectively.
Describe the solution you'd like
Grafana Implementation
-
Public Access:
- Hosted Grafana instance with view-only permission for the public
- Anonymous access to selected dashboards (SSL secured via Certbot)
-
Dashboards:
- Modernize existing 2022 dashboards as foundation
- Incorporate design patterns from Aptos dashboards (reference implementation)
- Modular panels for node health, network performance, and consensus metrics
-
Admin Team:
Prometheus Implementation
-
Node Monitoring:
- Start with cooperative validators/VFNs
- Adopt UUID system (based on node public keys) to replace IP addresses
- Dynamic target updates via peer list (inspired by David Boreham’s design)
-
Scalability:
- Borrow from 0L seed-peers for peer discovery
- Scheduled scraping jobs with secure authentication
Describe alternatives you've considered
-
Third-Party SaaS (e.g., Datadog, New Relic):
- Rejected due to cost and desire for community-controlled infrastructure
-
IP-Based Prometheus Targets:
- Current system is brittle and exposes node IPs; UUIDs are more secure
-
Static Dashboards:
- Considered but opted for modular designs to accommodate future metrics
Additional Context
References
- Example Prometheus config: 0l-monitoring
- Peer list management: seed-peers
Screenshots (Mockups)
(Note: Attach dashboard mockups or Aptos reference screenshots here in GitHub issue)
Implementation Phases
- Phase 1: Grafana setup + basic dashboards
- Phase 2: UUID-based Prometheus integration
- Phase 3: Alerting + scaling optimizations
Metadata
Metadata
Assignees
Labels
No labels
Type
Projects
Status
Backlog