enhancement(api)!: replace custom Health RPC with standard gRPC health service#25139
enhancement(api)!: replace custom Health RPC with standard gRPC health service#25139
Conversation
…h service Replace the custom ObservabilityService/Health RPC on the observability API with the standard grpc.health.v1.Health service. This enables native Kubernetes gRPC health probes, grpc-health-probe, and other standard tooling to work out of the box on port 8686. The vector-api-client now uses the standard health client for its health() method, maintaining the same API surface for vector top/tap. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Client::health() now checks response ServingStatus and returns NotServing error if not SERVING; uses named service "vector.observability.v1.ObservabilityService" instead of empty string - Store health_reporter handle in the server task and set NOT_SERVING when shutdown signal arrives, so in-flight probes see the transition - Fix stale doc example in lib.rs and grpcurl example in api.md - Add health integration test in tests/vector_api/health.rs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 447a821c92
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Use only the empty service name ("") for health checks, which is the
standard whole-server health slot. This is what Kubernetes gRPC probes,
grpc-health-probe, and the Helm chart all query by default.
Removes the named service registration since there is only one service
on this port and no reason for service-specific health granularity today.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…grpc-health-check # Conflicts: # lib/vector-api-client/src/client.rs
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5cf0aeeeaf
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| rx.await.ok(); | ||
| // Signal NOT_SERVING before the server stops accepting requests, | ||
| // so in-flight health probes see the transition during graceful drain. | ||
| health_reporter | ||
| .set_service_status("", tonic_health::ServingStatus::NotServing) |
There was a problem hiding this comment.
Flip gRPC health to NOT_SERVING before draining topology
This marks health as NOT_SERVING only inside the server shutdown future after rx.await, which is triggered when GrpcServer is dropped. On graceful shutdown, TopologyController::stop still awaits self.topology.stop().await before api_server is dropped, so the health endpoint stays SERVING during most of the drain window; with the old running-based health RPC removed in this commit, Kubernetes gRPC readiness probes can continue to pass until near process exit, keeping terminating pods routable longer than intended.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8584e75bd7
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| rx.await.ok(); | ||
| // Signal NOT_SERVING before the server stops accepting requests, | ||
| // so in-flight health probes see the transition during graceful drain. | ||
| health_reporter | ||
| .set_service_status("", tonic_health::ServingStatus::NotServing) |
There was a problem hiding this comment.
Mark gRPC health NOT_SERVING before topology drain
The shutdown transition to NOT_SERVING is currently delayed until rx.await resolves inside serve_with_incoming_shutdown, which only happens when GrpcServer is dropped; on graceful shutdown, TopologyController::stop still awaits self.topology.stop().await before that drop, so readiness remains SERVING during most of drain and Kubernetes gRPC probes can keep routing to a terminating pod. Fresh evidence in this commit is that the previous running-based ObservabilityService::health path was removed, leaving no earlier readiness flip.
Useful? React with 👍 / 👎.
Summary
Replace the custom (unreleased)
ObservabilityService/HealthRPC on the observability API (port 8686) with the standardgrpc.health.v1.Healthservice. This enables native Kubernetes gRPC health probes,grpc-health-probe, and other standard gRPC health-checking tooling to work out of the box.tonic_healthstandard health service to the gRPC server with reflection supportrunning: Arc<AtomicBool>parameter from the service and serverThis change is required for vectordotdev/helm-charts#540,
which adds default gRPC readiness probes on port 8686. Kubernetes native gRPC probes call the standard
grpc.health.v1.Health/CheckRPC with an empty service name, whichtonic_healthhandles by default.Vector configuration
No configuration changes required. The standard health service is automatically available on the
existing API address (default
0.0.0.0:8686).How did you test this PR?
cargo check --features api,api-clientpassesmake check-clippypassesmake check-fmtpassestonic_health::server::health_reporter()responds to empty service name checks withSERVINGby default (same pattern already used and tested in the vector source)Note: it's safe to ignore the failing proto check here since the affected proto file is not released yet.
Change Type
Is this a breaking change?
Does this PR include user facing changes?
no-changeloglabel to this PR.References