Skip to content

Open Libra Observability upgrade: Community Grafana #403

@sasuke0787

Description

@sasuke0787

Here’s your observability upgrade proposal formatted as a GitHub feature request:


Feature Request: Upgrade Observability Infrastructure (Grafana + Prometheus)

Is your feature request related to a problem? Please describe.

The current Open Libra observability infrastructure lacks:

  • Publicly accessible metrics/dashboards for community transparency
  • Scalable node monitoring as the network grows
  • Modernized dashboards based on 2022 templates and Aptos-inspired designs
  • Secure, dynamic target management for Prometheus (currently IP-based)

This makes it difficult for node operators and the community to monitor network health effectively.


Describe the solution you'd like

Grafana Implementation

  1. Public Access:

    • Hosted Grafana instance with view-only permission for the public
    • Anonymous access to selected dashboards (SSL secured via Certbot)
  2. Dashboards:

    • Modernize existing 2022 dashboards as foundation
    • Incorporate design patterns from Aptos dashboards (reference implementation)
    • Modular panels for node health, network performance, and consensus metrics
  3. Admin Team:

Prometheus Implementation

  1. Node Monitoring:

    • Start with cooperative validators/VFNs
    • Adopt UUID system (based on node public keys) to replace IP addresses
    • Dynamic target updates via peer list (inspired by David Boreham’s design)
  2. Scalability:

    • Borrow from 0L seed-peers for peer discovery
    • Scheduled scraping jobs with secure authentication

Describe alternatives you've considered

  1. Third-Party SaaS (e.g., Datadog, New Relic):

    • Rejected due to cost and desire for community-controlled infrastructure
  2. IP-Based Prometheus Targets:

    • Current system is brittle and exposes node IPs; UUIDs are more secure
  3. Static Dashboards:

    • Considered but opted for modular designs to accommodate future metrics

Additional Context

References

Screenshots (Mockups)

(Note: Attach dashboard mockups or Aptos reference screenshots here in GitHub issue)

Implementation Phases

  1. Phase 1: Grafana setup + basic dashboards
  2. Phase 2: UUID-based Prometheus integration
  3. Phase 3: Alerting + scaling optimizations

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

Status

Backlog

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions