feat(invoker,network): add deployment metrics and connection pool instrumentation#4567
Open
robertream wants to merge 4 commits intorestatedev:mainfrom
Open
feat(invoker,network): add deployment metrics and connection pool instrumentation#4567robertream wants to merge 4 commits intorestatedev:mainfrom
robertream wants to merge 4 commits intorestatedev:mainfrom
Conversation
* Deployment-level observability to the invoker subsystem, combining issues restatedev#4553 and restatedev#4454 into a single change. Lets operators distinguish slow user services from internal Restate bottlenecks. New metrics: - restate.invoker.http_request_duration.seconds (TTFB) - restate.invoker.http_total_duration.seconds (total duration) - restate.invoker.queue_duration.seconds (internal wait time) - restate.invoker.active_invocations (concurrency gauge) - restate.invoker.http_status_code.total (non-200 responses) - restate.invoker.throttle_balance (token bucket debt) * Label enrichment on invocation_tasks, task_duration, eager_state_truncated, enqueue, and concurrency metrics with partition_id, service_name, and deployment_id where applicable. * ServiceMetrics struct with gauge/counter/histogram helpers.All service metric recording goes through the struct methods including TaskMetrics struct handles INVOCATION_TASKS lifecycle counters with status/transient labels. * ResponseStream::instrument(metric) builder for HTTP timing, enqueued_at timestamp on InvokeCommand for queue duration, and event-driven throttle balance recording at slot acquire/release. * LazyIntern for generic string interpolation including STATUS_CODE_LOOKUP for zero-allocation HTTP status code interning.
…v#4559) Add comprehensive metrics to the networking layer's connection pool via two scoped structs: NetworkMetrics (swimlane-scoped, pre-cached per variant): - pool.connections.active gauge - connection.pending gauge - connection.acquisition.duration histogram (cache=hit/miss, result) - connection.handshake.duration histogram (result=success/error) - rpc.duration histogram (result=success/error) ConnectionMetrics (swimlane + peer_name + direction, stored on Connection): - connection.opened/closed counters (+ legacy connection_created/dropped) - permit_acquisition.duration histogram - connection.duration (lifetime) histogram Key design decisions: - NetworkMetrics instances pre-built in a static array indexed by Swimlane - ConnectionMetrics created via NetworkMetrics::connection(peer, direction) - All labels are &'static str via LazyIntern for peer IDs — zero allocation - Legacy metrics (connection_created/dropped) preserved with enriched labels - Pool gauge decremented in deregister() to avoid double-count on drain+drop - LazyIntern extracted to restate_core::metric_definitions for reuse Also renames invoker metric constants to follow OTel naming conventions.
… import Clippy flags new_without_default on LazyIntern<K>::new(). Add the trivial Default impl. Also remove an unused std::sync::LazyLock import in the network metric_definitions introduced by the connection pool instrumentation commit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Instrument ServiceDiscovery with counters and histograms for deployment
endpoint reachability — currently only visible in logs. Three metrics:
- restate.deployment.discovery.attempts.total{outcome}
(success | retryable_exhausted | permanent_failure)
- restate.deployment.discovery.retries.total{reason}
(server_error | rate_limited | connection | body_error | other)
- restate.deployment.discovery.duration.seconds{outcome}
describe_metrics() called from ServiceDiscovery::new(), following the
crate-internal pattern used across the codebase.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Deployment-level observability to the invoker subsystem, combining
issues #4553 and #4454 into a single change. Lets operators distinguish
slow user services from internal Restate bottlenecks.
Invoker metrics
New metrics:
Label enrichment on invocation_tasks, task_duration, eager_state_truncated,
enqueue, and concurrency metrics with partition_id, service_name, and
deployment_id where applicable.
ServiceMetrics struct with gauge/counter/histogram helpers. All service
metric recording goes through the struct methods including TaskMetrics
struct handles INVOCATION_TASKS lifecycle counters with status/transient
labels.
ResponseStream::instrument(metric) builder for HTTP timing, enqueued_at
timestamp on InvokeCommand for queue duration, and event-driven throttle
balance recording at slot acquire/release.
Connection pool instrumentation (#4559)
Two scoped metric structs for the networking layer:
NetworkMetrics (swimlane-scoped, pre-cached per variant):
pool.connections.activegaugeconnection.pendinggaugeconnection.acquisition.durationhistogram (cache=hit/miss, result=success/error)connection.handshake.durationhistogram (result=success/error)rpc.durationhistogram (result=success/error)ConnectionMetrics (swimlane + peer_name + direction, stored on Connection):
connection.opened/closedcounters (+ legacy connection_created/dropped)permit_acquisition.durationhistogramconnection.duration(lifetime) histogramKey design decisions:
NetworkMetrics::connection(peer, direction)&'static strviaLazyIntern— zero allocation on emitLazyInternextracted torestate_core::metric_definitionsfor reuseVerification
cargo checkpassescargo clippy --all-targets -- -D warningscleancargo fmt --all -- --checkcleancargo nextest run -p restate-invoker-impl -p restate-core— 65/65 passingCloses #4553
Closes #4454
Closes #4559