Skip to content

feat(invoker,network): add deployment metrics and connection pool instrumentation#4567

Open
robertream wants to merge 4 commits intorestatedev:mainfrom
robertream:invoker-deployment-metrics
Open

feat(invoker,network): add deployment metrics and connection pool instrumentation#4567
robertream wants to merge 4 commits intorestatedev:mainfrom
robertream:invoker-deployment-metrics

Conversation

@robertream
Copy link
Copy Markdown

@robertream robertream commented Apr 8, 2026

Summary

Deployment-level observability to the invoker subsystem, combining
issues #4553 and #4454 into a single change. Lets operators distinguish
slow user services from internal Restate bottlenecks.

Invoker metrics

  • New metrics:

    • restate.invoker.http_request_duration.seconds (TTFB)
    • restate.invoker.http_total_duration.seconds (total duration)
    • restate.invoker.queue_duration.seconds (internal wait time)
    • restate.invoker.active_invocations (concurrency gauge)
    • restate.invoker.http_status_code.total (non-200 responses)
    • restate.invoker.throttle_balance (token bucket debt)
  • Label enrichment on invocation_tasks, task_duration, eager_state_truncated,
    enqueue, and concurrency metrics with partition_id, service_name, and
    deployment_id where applicable.

  • ServiceMetrics struct with gauge/counter/histogram helpers. All service
    metric recording goes through the struct methods including TaskMetrics
    struct handles INVOCATION_TASKS lifecycle counters with status/transient
    labels.

  • ResponseStream::instrument(metric) builder for HTTP timing, enqueued_at
    timestamp on InvokeCommand for queue duration, and event-driven throttle
    balance recording at slot acquire/release.

Connection pool instrumentation (#4559)

Two scoped metric structs for the networking layer:

NetworkMetrics (swimlane-scoped, pre-cached per variant):

  • pool.connections.active gauge
  • connection.pending gauge
  • connection.acquisition.duration histogram (cache=hit/miss, result=success/error)
  • connection.handshake.duration histogram (result=success/error)
  • rpc.duration histogram (result=success/error)

ConnectionMetrics (swimlane + peer_name + direction, stored on Connection):

  • connection.opened/closed counters (+ legacy connection_created/dropped)
  • permit_acquisition.duration histogram
  • connection.duration (lifetime) histogram

Key design decisions:

  • NetworkMetrics pre-built in a static array indexed by Swimlane
  • ConnectionMetrics created via NetworkMetrics::connection(peer, direction)
  • All labels are &'static str via LazyIntern — zero allocation on emit
  • Legacy metrics preserved with enriched labels for dashboard compatibility
  • LazyIntern extracted to restate_core::metric_definitions for reuse

Verification

  • cargo check passes
  • cargo clippy --all-targets -- -D warnings clean
  • cargo fmt --all -- --check clean
  • cargo nextest run -p restate-invoker-impl -p restate-core — 65/65 passing

Closes #4553
Closes #4454
Closes #4559

* Deployment-level observability to the invoker subsystem, combining
  issues restatedev#4553 and restatedev#4454 into a single change. Lets operators distinguish
  slow user services from internal Restate bottlenecks.

  New metrics:
  - restate.invoker.http_request_duration.seconds (TTFB)
  - restate.invoker.http_total_duration.seconds (total duration)
  - restate.invoker.queue_duration.seconds (internal wait time)
  - restate.invoker.active_invocations (concurrency gauge)
  - restate.invoker.http_status_code.total (non-200 responses)
  - restate.invoker.throttle_balance (token bucket debt)

* Label enrichment on invocation_tasks, task_duration, eager_state_truncated,
  enqueue, and concurrency metrics with partition_id, service_name, and
  deployment_id where applicable.

* ServiceMetrics struct with gauge/counter/histogram helpers.All service
  metric recording goes through the struct methods including TaskMetrics
  struct handles INVOCATION_TASKS lifecycle counters with status/transient
  labels.

* ResponseStream::instrument(metric) builder for HTTP timing, enqueued_at
  timestamp on InvokeCommand for queue duration, and event-driven throttle
  balance recording at slot acquire/release.

* LazyIntern for generic string interpolation including STATUS_CODE_LOOKUP
  for zero-allocation HTTP status code interning.
@robertream robertream marked this pull request as ready for review April 8, 2026 00:48
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

…v#4559)

Add comprehensive metrics to the networking layer's connection pool via
two scoped structs:

NetworkMetrics (swimlane-scoped, pre-cached per variant):
- pool.connections.active gauge
- connection.pending gauge
- connection.acquisition.duration histogram (cache=hit/miss, result)
- connection.handshake.duration histogram (result=success/error)
- rpc.duration histogram (result=success/error)

ConnectionMetrics (swimlane + peer_name + direction, stored on Connection):
- connection.opened/closed counters (+ legacy connection_created/dropped)
- permit_acquisition.duration histogram
- connection.duration (lifetime) histogram

Key design decisions:
- NetworkMetrics instances pre-built in a static array indexed by Swimlane
- ConnectionMetrics created via NetworkMetrics::connection(peer, direction)
- All labels are &'static str via LazyIntern for peer IDs — zero allocation
- Legacy metrics (connection_created/dropped) preserved with enriched labels
- Pool gauge decremented in deregister() to avoid double-count on drain+drop
- LazyIntern extracted to restate_core::metric_definitions for reuse

Also renames invoker metric constants to follow OTel naming conventions.
@robertream robertream changed the title feat(invoker): add deployment performance metrics and label enrichment feat(invoker,network): add deployment metrics and connection pool instrumentation Apr 16, 2026
robertream and others added 2 commits April 20, 2026 15:14
… import

Clippy flags new_without_default on LazyIntern<K>::new(). Add the
trivial Default impl. Also remove an unused std::sync::LazyLock import
in the network metric_definitions introduced by the connection pool
instrumentation commit.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Instrument ServiceDiscovery with counters and histograms for deployment
endpoint reachability — currently only visible in logs. Three metrics:

- restate.deployment.discovery.attempts.total{outcome}
  (success | retryable_exhausted | permanent_failure)
- restate.deployment.discovery.retries.total{reason}
  (server_error | rate_limited | connection | body_error | other)
- restate.deployment.discovery.duration.seconds{outcome}

describe_metrics() called from ServiceDiscovery::new(), following the
crate-internal pattern used across the codebase.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add invoker deployment performance metrics More invoker metrics [connection-pool] Add instrumentation to measure various pool metric

1 participant