Skip to content

enhancement(api)!: replace custom Health RPC with standard gRPC health service#25139

Open
pront wants to merge 6 commits intomasterfrom
pavlos/standardize-grpc-health-check
Open

enhancement(api)!: replace custom Health RPC with standard gRPC health service#25139
pront wants to merge 6 commits intomasterfrom
pavlos/standardize-grpc-health-check

Conversation

@pront
Copy link
Copy Markdown
Member

@pront pront commented Apr 7, 2026

Summary

Replace the custom (unreleased) ObservabilityService/Health RPC on the observability API (port 8686) with the standard grpc.health.v1.Health service. This enables native Kubernetes gRPC health probes, grpc-health-probe, and other standard gRPC health-checking tooling to work out of the box.

  • Added tonic_health standard health service to the gRPC server with reflection support
  • Removed the now-unused running: Arc<AtomicBool> parameter from the service and server

This change is required for vectordotdev/helm-charts#540,
which adds default gRPC readiness probes on port 8686. Kubernetes native gRPC probes call the standard
grpc.health.v1.Health/Check RPC with an empty service name, which tonic_health handles by default.

Vector configuration

No configuration changes required. The standard health service is automatically available on the
existing API address (default 0.0.0.0:8686).

How did you test this PR?

  • cargo check --features api,api-client passes
  • make check-clippy passes
  • make check-fmt passes
  • Verified tonic_health::server::health_reporter() responds to empty service name checks with
    SERVING by default (same pattern already used and tested in the vector source)

Note: it's safe to ignore the failing proto check here since the affected proto file is not released yet.

Change Type

  • New feature
  • Bug fix
  • Dependencies
  • Non-functional (chore, refactoring, docs)
  • Performance

Is this a breaking change?

  • Yes
  • No

Does this PR include user facing changes?

  • Yes. Please add a changelog fragment based on our guidelines.
  • No. A maintainer will apply the no-changelog label to this PR.

References

pront and others added 2 commits April 7, 2026 14:12
…h service

Replace the custom ObservabilityService/Health RPC on the observability
API with the standard grpc.health.v1.Health service. This enables native
Kubernetes gRPC health probes, grpc-health-probe, and other standard
tooling to work out of the box on port 8686.

The vector-api-client now uses the standard health client for its
health() method, maintaining the same API surface for vector top/tap.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions bot added the domain: topology Anything related to Vector's topology code label Apr 7, 2026
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@pront pront added the no-changelog Changes in this PR do not need user-facing explanations in the release changelog label Apr 7, 2026
- Client::health() now checks response ServingStatus and returns
  NotServing error if not SERVING; uses named service
  "vector.observability.v1.ObservabilityService" instead of empty string
- Store health_reporter handle in the server task and set NOT_SERVING
  when shutdown signal arrives, so in-flight probes see the transition
- Fix stale doc example in lib.rs and grpcurl example in api.md
- Add health integration test in tests/vector_api/health.rs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@pront pront marked this pull request as ready for review April 7, 2026 18:57
@pront pront requested review from a team as code owners April 7, 2026 18:57
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 447a821c92

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@domalessi domalessi self-assigned this Apr 7, 2026
pront and others added 2 commits April 7, 2026 16:54
Use only the empty service name ("") for health checks, which is the
standard whole-server health slot. This is what Kubernetes gRPC probes,
grpc-health-probe, and the Helm chart all query by default.

Removes the named service registration since there is only one service
on this port and no reason for service-specific health granularity today.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…grpc-health-check

# Conflicts:
#	lib/vector-api-client/src/client.rs
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5cf0aeeeaf

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 68 to +72
rx.await.ok();
// Signal NOT_SERVING before the server stops accepting requests,
// so in-flight health probes see the transition during graceful drain.
health_reporter
.set_service_status("", tonic_health::ServingStatus::NotServing)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Flip gRPC health to NOT_SERVING before draining topology

This marks health as NOT_SERVING only inside the server shutdown future after rx.await, which is triggered when GrpcServer is dropped. On graceful shutdown, TopologyController::stop still awaits self.topology.stop().await before api_server is dropped, so the health endpoint stays SERVING during most of the drain window; with the old running-based health RPC removed in this commit, Kubernetes gRPC readiness probes can continue to pass until near process exit, keeping terminating pods routable longer than intended.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8584e75bd7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 68 to +72
rx.await.ok();
// Signal NOT_SERVING before the server stops accepting requests,
// so in-flight health probes see the transition during graceful drain.
health_reporter
.set_service_status("", tonic_health::ServingStatus::NotServing)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Mark gRPC health NOT_SERVING before topology drain

The shutdown transition to NOT_SERVING is currently delayed until rx.await resolves inside serve_with_incoming_shutdown, which only happens when GrpcServer is dropped; on graceful shutdown, TopologyController::stop still awaits self.topology.stop().await before that drop, so readiness remains SERVING during most of drain and Kubernetes gRPC probes can keep routing to a terminating pod. Fresh evidence in this commit is that the previous running-based ObservabilityService::health path was removed, leaving no earlier readiness flip.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

domain: topology Anything related to Vector's topology code no-changelog Changes in this PR do not need user-facing explanations in the release changelog

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants