Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat/re-allow multiple workers #36134

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

bmiguel-teixeira
Copy link
Contributor

@bmiguel-teixeira bmiguel-teixeira commented Nov 1, 2024

Description

This MR does the following:

  • Re adds the ability to allow multiple workers in this exporter due to:
  1. Out of Order is no longer an issue now that it is fully supported in Prometheus. Nonetheless, I am setting the default worker as 1 to avoid OoO in Vanilla Prometheus Settings.

  2. With a single worker, and for a collector with a large load, this becomes "blocking". Example: Imagine a scenario in which a collector is collecting lots of targets, and with a slow prometheus/unstable network, a single worker can easily bottleneck the off-shipping if retries are enabled.

Link to tracking issue

N/A

Testing

Documentation

docs auto-updated. Readme.md is now correct in its explanation of the `num_consumers since its no longer hard-coded at 1. Additional docs added.

@dashpole
Copy link
Contributor

dashpole commented Nov 1, 2024

cc @jmichalek132 @ArthurSens

@dashpole dashpole added the enhancement New feature or request label Nov 1, 2024
@dashpole
Copy link
Contributor

dashpole commented Nov 1, 2024

I'll be OOO for the next few weeks, so I won't be able to review until then.

Copy link
Member

@ArthurSens ArthurSens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @bmiguel-teixeira! The contributions sound solid, but I'm a bit concerned that we're doing two separate things in one single PR here. Generally that's not a good practice because in case we have to revert one thing, we end up reverting more than what's needed, not to mention that it makes the PR biggers and a bit harder to review

Could you choose only one new functionality for this PR and open another one for the other?


Regarding the changes, could you add tests for the telemetry added? Send PRW requests to a mock server and assert that the metric you've added increments as you expected.

For allowing multiple workers, it would be nice if we add extra documentation making it clear that Out-of-order needs to be enabled in Prometheus for it to work :)

@bmiguel-teixeira
Copy link
Contributor Author

Sure. I will open a dedicated PR with the additional telemetry and keep de queue changes in this one which already has context.

@bmiguel-teixeira
Copy link
Contributor Author

hi @ArthurSens

Removed the additional telemetry to be added in a secondary PR. Also added a bit of docs to explain the toggle and use case. Please take a look

Cheers

@bmiguel-teixeira bmiguel-teixeira changed the title feat/add outbound metrics and re-allow multiple workers feat/re-allow multiple workers Nov 13, 2024
Copy link
Member

@ArthurSens ArthurSens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the code is simple so it does LGTM, but I'm struggling to test this.

Do you have any examples of apps/configs that will generate out of order datapoints? All attempts I've tried provide things in order so I can't be sure if this is working as expected 😅

exporter/prometheusremotewriteexporter/README.md Outdated Show resolved Hide resolved
@ArthurSens
Copy link
Member

(there's a linting failure)

@bmiguel-teixeira
Copy link
Contributor Author

Hi @ArthurSens

Just submitted your recomendation to fix the spelling issues.

In regards to testing and simulating locally the out of order issues.
Here is my setup.

Prometheus Config

global:
  scrape_interval: 1s 
  evaluation_interval: 1s

#storage:
#  tsdb:
#    out_of_order_time_window: 10m

Otel Config

receivers:
  prometheus:
    config:
      scrape_configs:
      - job_name: 'node-1'
        scrape_interval: 1s
        static_configs:
          - targets: ['127.0.0.1:8081']

exporters:
  prometheusremotewrite:
    endpoint: http://localhost:9090/api/v1/write
    remote_write_queue:
      enabled: true
      num_consumers: XXX
      queue_size: 1000
    retry_on_failure:
      enabled: true
      initial_interval: 1s
      max_interval: 5s
      max_elapsed_time: 30s

service:
  telemetry:
    logs:
      level: "DEBUG"

  pipelines:
    metrics:
      receivers: [prometheus]
      processors: []
      exporters: [prometheusremotewrite]

Scenario 1: (Vanilla)

  • Prometheus with NO out of order window.
  • PrometheusRemoteWrite with 1 consumer (or queue disabled)
  1. Setup environment (let it run few couple minutes)
  2. Stop Prometheus
  3. Wait couple seconds for timeout errors
  4. Restart Prometheus (ensure you still have old data)

Outcome: Metrics are reingested, in order with single worker, ALL GOOD.

Scenario 2 (PrometheusRemoteWrite 5 Consumers + Vanilla Prometheus)

  • Prometheus with NO out of order window.
  • PrometheusRemoteWrite with 5 consumers
  1. Setup environment (let it run few couple minutes)
  2. Stop Prometheus
  3. Wait couple seconds for timeout errors (and for items to build up in the queue)
  4. Restart Prometheus (ensure you still have old data)

Outcome: OutOfOrder Errors in Prometheus after boot up, since samples will be re-tried with no specific order.

Scenario 3 (PrometheusRemoteWrite 5 Consumers + Prometheus with OutOfOrder Window)

  • Prometheus with WITH out of order window set at 10m
  • PrometheusRemoteWrite with 5 consumers
  1. Setup environment (let it run few couple minutes)
  2. Stop Prometheus
  3. Wait couple seconds for timeout errors (and for items to build up in the queue)
  4. Restart Prometheus (ensure you still have old data)

Outcome: No OutOfOrder issues because even though we have multiple workers, Prometheus can accept mixed/old samples and update tsdb until 10 minutes "ago".

Let me know if you need any more info.

Cheers.

Copy link
Member

@ArthurSens ArthurSens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect; thank you for the instructions! I just tested it, and it works perfectly. There's just one CI failure that we need to fix before approving here.

Could you run make generate and commit the changes?

@bmiguel-teixeira
Copy link
Contributor Author

@ArthurSens All done!

@@ -87,6 +90,10 @@ var _ component.Config = (*Config)(nil)

// Validate checks if the exporter configuration is valid
func (cfg *Config) Validate() error {
if cfg.MaxBatchRequestParallelism <= 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like 0 should also not be valid?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah. nice catch

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

if enableMultipleWorkersFeatureGate.IsEnabled() {
concurrency = cfg.MaxBatchRequestParallelism
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to always always use cfg.MaxBatchRequestParallelism, if it is set by the user. That way, existing users can migrate from RemoteWriteQueue.NumConsumers to cfg.MaxBatchRequestParallelism right away. If the feature gate is disabled, and a user has set NumConsumers to a non-default value, it would be nice to emit a warning instructing them to migrate. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. please check if you agree

@ArthurSens
Copy link
Member

@edma2 found a bug when batching time series with multiple workers. Here is a PR trying to fix it: #36524

Should we solve the bug before re-allowing multiple workers?

This exporter has feature gate: `+exporter.prometheusremotewritexporter.EnableMultipleWorkers`.

When this feature gate is enabled, `num_consumers` will be used as the worker counter for handling batches from the queue, and `max_batch_request_parallelism` will be used for parallelism on single batch bigger than `max_batch_size_bytes`.
Enabling this feature gate, with `num_consumers` higher than 1 requires the target destination to supports ingestion of OutOfOrder samples. See [Multiple Consumers and OutOfOrder](#multiple-consumers-and-outoforder) for more info
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this value needs to default to 1 if this feature gate is enabled. When the feature gate is ultimately removed I don't think it appropriate to have a default value that is known to not work with any number of potential receivers and would not work with the default configuration of those receivers that are known to be able to support it in some configurations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I aggree that will help with transition, and that the exporter should have "sane" defaults at all times.
done. please check

return fmt.Errorf("max_batch_request_parallelism can't be set to below 1")
}
if enableMultipleWorkersFeatureGate.IsEnabled() && cfg.MaxBatchRequestParallelism == nil {
return fmt.Errorf("enabling featuregate `+exporter.prometheusremotewritexporter.EnableMultipleWorkers` requires setting `max_batch_request_parallelism` in the configuration")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to require people to set this.

Comment on lines 123 to 133
concurrency := cfg.RemoteWriteQueue.NumConsumers
if enableMultipleWorkersFeatureGate.IsEnabled() || cfg.MaxBatchRequestParallelism != nil {
concurrency = *cfg.MaxBatchRequestParallelism
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
concurrency := cfg.RemoteWriteQueue.NumConsumers
if enableMultipleWorkersFeatureGate.IsEnabled() || cfg.MaxBatchRequestParallelism != nil {
concurrency = *cfg.MaxBatchRequestParallelism
}
concurrency := 5
if !enableMultipleWorkersFeatureGate.IsEnabled() {
concurrency = cfg.RemoteWriteQueue.NumConsumers
}
if cfg.MaxBatchRequestParallelism != nil {
concurrency = *cfg.MaxBatchRequestParallelism
}

We want:

  • A default of 5
  • To always use MaxBatchRequestParallelism if it is set
  • To only use NumConsumers if the feature gate is disabled.

This suggestion accomplishes that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This way it will be a "change in behavior". There may be some users that set the cfg.RemoteWriteQueue.NumConsumers to increase the concurrency. This way, it will always be set to 5 unless they explicitely set *cfg.MaxBatchRequestParallelism.

Hence why I checked for the featuregate above to make sure *cfg.MaxBatchRequestParallelism was explicitly set by the user, and if not, default to cfg.RemoteWriteQueue.NumConsumers.

I may have misunderstood but I thought the plan was to keep full retro compability.
Are we okay with setting a default to 5, and if users want to increased parallelism they need to set *cfg.MaxBatchRequestParallelism?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will only be a change in behavior once the feature gate is enabled. It shouldn't be a breaking change in this PR (unless i'm mistaken). The reason we have a feature gate at all is to make the migration from the breaking change easier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i missed the gate difference. ill update this over the weekend

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Please check

exporter/prometheusremotewriteexporter/factory.go Outdated Show resolved Hide resolved
@ArthurSens
Copy link
Member

Hey @bmiguel-teixeira, once #36601 is merged, I think we should be safe to proceed with multiple workers again :)

Could you rebase your PR on top of main once that happens?

Copy link
Member

@ArthurSens ArthurSens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While reading the changes in README, I'm confused about the difference between num_consumers and max_batch_request_parallelism.

Both mention that they configure the number of workers... so they fight each other to decide the number of workers? Do they complement each other?


By the way, thanks for the patience here, a lot was done to make this possible :)


See for more info:
- https://prometheus.io/docs/prometheus/latest/configuration/configuration/#tsdb
- https://prometheus.io/docs/prometheus/latest/feature_flags/#remote-write-receiver
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- https://prometheus.io/docs/prometheus/latest/feature_flags/#remote-write-receiver

This page doesn't contain any information about receiving out of order samples, does it? At least I couldn't find it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we remove it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

upstream doc may have hanged. will remove this entry.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@github-actions github-actions bot removed the Stale label Jan 4, 2025
Copy link
Member

@ArthurSens ArthurSens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The README is still a bit confusing to me, but I'm fine merging if the other codeowners are ok with it

@bmiguel-teixeira
Copy link
Contributor Author

Hi @ArthurSens

No worries on the delays.
The plan, as discussed with @dashpole, is to start the process of transitioning out of the hard-coded values and into the ability to configure both workers and single request parallelism (for large payloads of single-requests) individually.

If max_batch_request_parallelism is defined, this is the value used for single-requests parallelism regardless of anything else. Users can start moving to this value starting from this PR to keep same behaviour.

Then we have feature gate for controlling the parallelism.
If MultipleWorkersFeatureGate is enabled, then numConsumers enables multiple workers/consumers from the metrics queue, and single request parallelism can be set with max_batch_request_parallelism.

If the gate is absent, we keep the default behaviour with single worker thread and with numConsumer controlling single request parallelism.

Hopefully this helps

@@ -66,6 +66,8 @@ The following settings can be optionally configured:
- `max_batch_size_bytes` (default = `3000000` -> `~2.861 mb`): Maximum size of a batch of
samples to be sent to the remote write endpoint. If the batch size is larger
than this value, it will be split into multiple batches.
- `max_batch_request_parallelism` (default = `5`): Maximum parallelism allowed for a single request bigger than `max_batch_size_bytes`.
This configuration is used only when feature gate `exporter.prometheusremotewritexporter.EnableMultipleWorkers` is enabled.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the implementation differs from this documentation. It seems like this config option will be used if set regardless of the state of the feature gate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants