-
Notifications
You must be signed in to change notification settings - Fork 71
COO-1384: fix(monitoringstack): correctly configure OTLP receiver #943
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
COO-1384: fix(monitoringstack): correctly configure OTLP receiver #943
Conversation
|
Hi @too-common-name. Thanks for your PR. I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
I think that we can just default to the new way to enable the OTEL receiver because we control the Prometheus + operator versions in COO. |
|
That is true for the default configuration. However, the reason I added this inference logic is that I was able to reproduce a hard crash when manually overriding the image. |
|
We don't support custom images. |
|
Ok, I will remove the version inference. Thank you Simon! |
eef773b to
f30338c
Compare
f30338c to
cef2c6e
Compare
simonpasquier
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
I think that for a better coverage, we should exercise the OTLP endpoint in the e2e tests but I can take in a follow-up PR.
|
/retitle COO-1384: fix(monitoringstack): correctly configure OTLP receiver |
|
@too-common-name: This pull request references COO-1384 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.21.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: simonpasquier, too-common-name The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Context: Prometheus 3.0 changes
Starting from Prometheus 3.0.0-beta1, the mechanism to enable the OTLP Receiver has changed:
--enable-feature=otlp-write-receiverThe Prometheus Operator is able to handle this logic automatically if
enableOtlpHttpReceiverandVersionare setted. You can check it here at line 1190.The Problem
The current implementation was hardcoding the legacy feature flag:
This fails to enable OTLP on modern/default Prometheus versions (which require the new mechanism).
The Fix Logic
I switched the implementation to use the proper upstream API field:
This enables the downstream operator to manage the configuration dynamically. However, this introduced a regression for legacy versions.
Because the Observability Operator was not passing the
Versionfield to thePrometheusCR, the downstream operator defaulted to its internal version (>= 3.0.0).To address this, the PR implements a version inference strategy, in particular:
EnableOTLPReceiver: Allows the downstream operator to handle modern flags correctly.RELATED_IMAGE_PROMETHEUStag and explicitly sets theVersionfield in thePrometheusCR.Version: "2.45.0", correctly falls back to the legacy feature flag, and the crash is avoided.Architectural Note / Limitations
This fix relies on parsing the image tag (e.g., extracting
2.45.0fromquay.io/...:v2.45.0).:latestor:custom), parsing will fail, and behavior falls back to the default (assuming 3.0.0).Suggestion: To fully support custom images, we should update the
MonitoringStackCRD to accept an explicitspec.prometheusConfig.versionfield.Verification
Manual verification on OCP
v2.45.0. Applied stack with OTLP enabled. The Prometheus Operator skipped the enablement since OTLP is not supported.v2.47.0. Applied stack with OTLP enabled. Pod started successfully. Downstream operator applied legacy feature flag.Tests
Assert_OTLP_receiver_flag_is_set_when_enabled_in_CRto verify the happy path (Default/Modern Version).managed_fieldsexpectations to reflect thatenableOTLPReceiveris now managed field.Checklist
[x] Fixes OTLP configuration for modern Prometheus (v3.0+)
[x] Fixes regression/crash for legacy Prometheus (v2.x)
[x] Verified manually on cluster
[x] Updated E2E tests
[x] Ran
make lint