Skip to content

feat(monitoring): expose share rejection metrics on Prometheus surface#475

Open
gimballock wants to merge 2 commits intostratum-mining:mainfrom
fossatmara:feat/share-rejection-prometheus-metrics
Open

feat(monitoring): expose share rejection metrics on Prometheus surface#475
gimballock wants to merge 2 commits intostratum-mining:mainfrom
fossatmara:feat/share-rejection-prometheus-metrics

Conversation

@gimballock
Copy link
Copy Markdown

Summary

The JSON API already reports per-channel shares_rejected (HashMap<String, u32> keyed by error_code) and shares_submitted, but none of this data reaches the Prometheus /metrics endpoint. Operators using Prometheus for alerting and dashboards have no way to track share rejection rates or reasons over time.

Motivation

  • Time-series alerting — rejection rate (rejected / submitted) is the most direct signal that something is wrong (miner misconfiguration, difficulty drift, network latency). Without Prometheus exposure, rate-based alerting is not possible.
  • Reason breakdown — the existing HashMap<String, u32> already distinguishes reasons (stale, duplicate-share, etc.). Surfacing this per-reason enables operators to distinguish transient stale-share spikes (normal after a new block) from sustained protocol errors.
  • API vs Prometheus gap — the JSON API serves this data as a point-in-time snapshot. Trend analysis, rate() queries, and recording rules require Prometheus exposure.

Current State

Server-side Prometheus (what exists):

  • sv2_server_shares_accepted_total{channel_id, user_identity}
  • No shares_submitted or shares_rejected gauges

Server-side JSON API (what exists but is not in Prometheus):

  • shares_submitted: u32
  • shares_rejected: HashMap<String, u32>

Client-side:

  • sv2_client_shares_accepted_total exists in Prometheus
  • No rejection data in Prometheus or in the client monitoring types — though upstream stratum-core client ShareAccounting already tracks rejected_shares: u32

Changes

Server metrics

  1. sv2_server_shares_submitted_total{channel_id, user_identity} — gauge from shares_submitted, enables rejection-rate denominator
  2. sv2_server_shares_rejected_total{channel_id, user_identity, error_code} — gauge from iterating shares_rejected map entries

Client metrics

  1. sv2_client_shares_rejected_total{client_id, channel_id, user_identity} — gauge from upstream ShareAccounting::get_rejected_shares() (scalar u32, no per-reason breakdown available at this layer)
  2. Added shares_rejected: u32 field to ExtendedChannelInfo / StandardChannelInfo in client monitoring types

Stale label cleanup

  1. Tracks (channel_id, user_identity, error_code) triples in PreviousLabelSets and removes stale combinations on refresh, same pattern used for existing share/hashrate labels

Cardinality

The error_code label is nominally unbounded (the SV2 spec allows arbitrary strings), but in practice the server ShareValidationError enum defines ~9 well-known variants. Upstream stratum-mining/stratum#2142 is working toward typed error_code constants which will further bound this.

Client-side limitation

The server-side ShareAccounting in stratum-core does not currently track rejected shares (stratum-mining/stratum#2119). Client monitoring shares_rejected defaults to 0 until that is addressed upstream. The Prometheus gauge and API field are in place for when the data becomes available.

Example PromQL

# Rejection rate
rate(sv2_server_shares_rejected_total[5m]) / rate(sv2_server_shares_submitted_total[5m])

# Breakdown by reason
sum by (error_code) (rate(sv2_server_shares_rejected_total[5m]))

# Stale share spike
rate(sv2_server_shares_rejected_total{error_code="stale"}[1m])

Testing

All existing tests pass (80 in stratum-apps, 6 in pool, miner-apps all green). Existing test helpers updated for new fields.

Related

Eric Price added 2 commits May 2, 2026 20:54
…hotCache::refresh (stratum-mining#337)

Move all Prometheus gauge updates (set + stale-label removal) out of the
/metrics HTTP handler and into SnapshotCache::refresh(), which runs as a
periodic background task. This eliminates the GaugeVec reset gap where
label series momentarily disappeared on every scrape.

Changes:
- SnapshotCache now owns PrometheusMetrics and PreviousLabelSets
- refresh() updates snapshot data AND Prometheus gauges atomically
- /metrics handler reduced to: set uptime gauge, gather, encode
- ServerState simplified (no more PreviousLabelSets or Mutex)
- Tests updated to wire metrics through cache via with_metrics()
- Integration tests: replace fixed-sleep assertions with
  poll_until_metric_gte (100ms poll, 5s deadline) for CI resilience
- Clone impl preserves previous_labels for correct stale-label detection
- debug-level tracing on stale label removal errors
- debug_assert on with_metrics double-attachment

Closes stratum-mining#337
Add Prometheus gauges for share submission and rejection data that was
previously only available through the JSON monitoring API:

Server metrics:
- sv2_server_shares_submitted_total{channel_id, user_identity}
- sv2_server_shares_rejected_total{channel_id, user_identity, error_code}

Client metrics:
- sv2_client_shares_rejected_total{client_id, channel_id, user_identity}

These enable time-series alerting on rejection rates and per-reason
breakdown (stale, duplicate-share, etc.) via rate() queries and
recording rules.

Implementation:
- Register new GaugeVecs in PrometheusMetrics
- Populate from existing shares_rejected HashMap (server) and
  rejected_shares u32 (client) in SnapshotCache::update_metrics
- Track server rejection label triples in PreviousLabelSets for
  stale series cleanup
- Add shares_rejected field to client ExtendedChannelInfo and
  StandardChannelInfo (defaults to 0 until stratum#2119 adds
  rejection tracking to server-side ShareAccounting)
@gimballock gimballock force-pushed the feat/share-rejection-prometheus-metrics branch from d847f42 to aa30316 Compare May 3, 2026 00:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant