Skip to content

Conversation

@too-common-name
Copy link
Contributor

Context: Prometheus 3.0 changes

Starting from Prometheus 3.0.0-beta1, the mechanism to enable the OTLP Receiver has changed:

  • Legacy (< 3.0.0): Requires --enable-feature=otlp-write-receiver
  • Modern (>= 3.0.0): Requires --web.enable-otlp-receiver`

The Prometheus Operator is able to handle this logic automatically if enableOtlpHttpReceiver and Version are setted. You can check it here at line 1190.

The Problem

The current implementation was hardcoding the legacy feature flag:

EnableFeatures: []monv1.EnableFeature{"otlp-write-receiver"} // Old logic

This fails to enable OTLP on modern/default Prometheus versions (which require the new mechanism).

The Fix Logic

I switched the implementation to use the proper upstream API field:

EnableOTLPReceiver: config.EnableOtlpHttpReceiver

This enables the downstream operator to manage the configuration dynamically. However, this introduced a regression for legacy versions.
Because the Observability Operator was not passing the Version field to the Prometheus CR, the downstream operator defaulted to its internal version (>= 3.0.0).

  • Scenario: A user runs image v2.45.0 (Legacy).
  • Behavior: Operator assumes v3.0.0, sees EnableOTLPReceiver=true, and injects the modern flag (--web.enable-otlp-receiver).
  • Result: Prometheus v2.45.0 crashes.

To address this, the PR implements a version inference strategy, in particular:

  1. Use EnableOTLPReceiver: Allows the downstream operator to handle modern flags correctly.
  2. Infer Version: The reconciler now parses the version from the RELATED_IMAGE_PROMETHEUS tag and explicitly sets the Version field in the Prometheus CR.
  3. Result: The downstream operator sees Version: "2.45.0", correctly falls back to the legacy feature flag, and the crash is avoided.

Architectural Note / Limitations

This fix relies on parsing the image tag (e.g., extracting 2.45.0 from quay.io/...:v2.45.0).

  • Standard Usage: Works perfectly for pinned version tags.
  • Custom Images: If a user provides a custom image with a non-semantic tag (e.g., :latest or :custom), parsing will fail, and behavior falls back to the default (assuming 3.0.0).

Suggestion: To fully support custom images, we should update the MonitoringStack CRD to accept an explicit spec.prometheusConfig.version field.

Verification

Manual verification on OCP

  1. OTLP unsupported test: Configured operator with Prometheus v2.45.0. Applied stack with OTLP enabled. The Prometheus Operator skipped the enablement since OTLP is not supported.
  2. Legacy test: Configured operator with Prometheus v2.47.0. Applied stack with OTLP enabled. Pod started successfully. Downstream operator applied legacy feature flag.
  3. Modern test: Configured operator with default image. Applied stack with OTLP enabled. Pod started successfully. Downstream operator applied modern argument.

Tests

  • Added E2E test Assert_OTLP_receiver_flag_is_set_when_enabled_in_CR to verify the happy path (Default/Modern Version).
  • Updated managed_fields expectations to reflect that enableOTLPReceiver is now managed field.

Checklist

[x] Fixes OTLP configuration for modern Prometheus (v3.0+)
[x] Fixes regression/crash for legacy Prometheus (v2.x)
[x] Verified manually on cluster
[x] Updated E2E tests
[x] Ran make lint

@openshift-ci openshift-ci bot requested review from jan--f and slashpai November 27, 2025 13:03
@openshift-ci
Copy link

openshift-ci bot commented Nov 27, 2025

Hi @too-common-name. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@simonpasquier
Copy link
Contributor

I think that we can just default to the new way to enable the OTEL receiver because we control the Prometheus + operator versions in COO.

@too-common-name
Copy link
Contributor Author

That is true for the default configuration. However, the reason I added this inference logic is that I was able to reproduce a hard crash when manually overriding the image.
When I explicitly set the image to v2.45.0 (via --images), the downstream Prometheus Operator didn't know the version had changed. It defaulted to the latest logic and passed the modern OTLP flag, causing the v2.45.0 pod to crash.

@simonpasquier
Copy link
Contributor

We don't support custom images.

@too-common-name
Copy link
Contributor Author

Ok, I will remove the version inference. Thank you Simon!

@too-common-name too-common-name force-pushed the fix/prometheus-otlp-web-receiver branch from eef773b to f30338c Compare December 1, 2025 09:51
@too-common-name too-common-name force-pushed the fix/prometheus-otlp-web-receiver branch from f30338c to cef2c6e Compare December 2, 2025 14:04
Copy link
Contributor

@simonpasquier simonpasquier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

I think that for a better coverage, we should exercise the OTLP endpoint in the e2e tests but I can take in a follow-up PR.

@simonpasquier
Copy link
Contributor

simonpasquier commented Dec 4, 2025

/retitle COO-1384: fix(monitoringstack): correctly configure OTLP receiver

@openshift-ci openshift-ci bot changed the title fix(monitoringstack): correctly configure OTLP receiver for both legacy and modern Prometheus versions COO-1384: fix(monitoringstack): correctly configure OTLP receiver Dec 4, 2025
@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Dec 4, 2025

@too-common-name: This pull request references COO-1384 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.21.0" version, but no target version was set.

In response to this:

Context: Prometheus 3.0 changes

Starting from Prometheus 3.0.0-beta1, the mechanism to enable the OTLP Receiver has changed:

  • Legacy (< 3.0.0): Requires --enable-feature=otlp-write-receiver
  • Modern (>= 3.0.0): Requires --web.enable-otlp-receiver`

The Prometheus Operator is able to handle this logic automatically if enableOtlpHttpReceiver and Version are setted. You can check it here at line 1190.

The Problem

The current implementation was hardcoding the legacy feature flag:

EnableFeatures: []monv1.EnableFeature{"otlp-write-receiver"} // Old logic

This fails to enable OTLP on modern/default Prometheus versions (which require the new mechanism).

The Fix Logic

I switched the implementation to use the proper upstream API field:

EnableOTLPReceiver: config.EnableOtlpHttpReceiver

This enables the downstream operator to manage the configuration dynamically. However, this introduced a regression for legacy versions.
Because the Observability Operator was not passing the Version field to the Prometheus CR, the downstream operator defaulted to its internal version (>= 3.0.0).

  • Scenario: A user runs image v2.45.0 (Legacy).
  • Behavior: Operator assumes v3.0.0, sees EnableOTLPReceiver=true, and injects the modern flag (--web.enable-otlp-receiver).
  • Result: Prometheus v2.45.0 crashes.

To address this, the PR implements a version inference strategy, in particular:

  1. Use EnableOTLPReceiver: Allows the downstream operator to handle modern flags correctly.
  2. Infer Version: The reconciler now parses the version from the RELATED_IMAGE_PROMETHEUS tag and explicitly sets the Version field in the Prometheus CR.
  3. Result: The downstream operator sees Version: "2.45.0", correctly falls back to the legacy feature flag, and the crash is avoided.

Architectural Note / Limitations

This fix relies on parsing the image tag (e.g., extracting 2.45.0 from quay.io/...:v2.45.0).

  • Standard Usage: Works perfectly for pinned version tags.
  • Custom Images: If a user provides a custom image with a non-semantic tag (e.g., :latest or :custom), parsing will fail, and behavior falls back to the default (assuming 3.0.0).

Suggestion: To fully support custom images, we should update the MonitoringStack CRD to accept an explicit spec.prometheusConfig.version field.

Verification

Manual verification on OCP

  1. OTLP unsupported test: Configured operator with Prometheus v2.45.0. Applied stack with OTLP enabled. The Prometheus Operator skipped the enablement since OTLP is not supported.
  2. Legacy test: Configured operator with Prometheus v2.47.0. Applied stack with OTLP enabled. Pod started successfully. Downstream operator applied legacy feature flag.
  3. Modern test: Configured operator with default image. Applied stack with OTLP enabled. Pod started successfully. Downstream operator applied modern argument.

Tests

  • Added E2E test Assert_OTLP_receiver_flag_is_set_when_enabled_in_CR to verify the happy path (Default/Modern Version).
  • Updated managed_fields expectations to reflect that enableOTLPReceiver is now managed field.

Checklist

[x] Fixes OTLP configuration for modern Prometheus (v3.0+)
[x] Fixes regression/crash for legacy Prometheus (v2.x)
[x] Verified manually on cluster
[x] Updated E2E tests
[x] Ran make lint

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@simonpasquier
Copy link
Contributor

/approve

@openshift-ci
Copy link

openshift-ci bot commented Dec 4, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: simonpasquier, too-common-name

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved label Dec 4, 2025
@openshift-merge-bot openshift-merge-bot bot merged commit 3c84569 into rhobs:main Dec 4, 2025
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants