Slow task execution due to long retry backoff in ReplicationThrottleHelper #2147

eazama · 2024-05-02T23:48:47Z

I've observed extremely long gaps in executions of task batches. Based on the logs, these gaps appear to be the result of ReplicationThrottleHelper submitting requests to add or remove the replication throttle rate configurations and then waiting for the change to be reflected on the broker. Specifically, there is a 10 second gap between each broker's configuration change. This can lead to minutes of time doing nothing, even on small 6 or 12 node clusters.

The specific mechanism that ReplicationThrottleHelper is using to wait for changes is to call CruiseControlMetricsUtils.retry, specifically, the overload that uses the default backoff configurations of scale=5 seconds and base=2.

I assume that the first describe request doesn't return the expected configurations for some reason, resulting in the retry loop triggering the first 10 second backoff.

10 seconds seems like an excessive amount of time to wait for the first retry, so it would be nice if the retry loop started with a much smaller scale, somewhere on the order of a few milliseconds. Because the exponential backoff has no cap, starting with a small scale is somewhat necessary to prevent the backoff from becoming unreasonably large after only one or two retries.

The text was updated successfully, but these errors were encountered:

mhratson · 2024-09-09T02:53:40Z

@eazama do you mind adding additional evidence like logs?

eazama linked a pull request May 2, 2024 that will close this issue

Update ReplicationThrottleHelper.waitForConfigs to use a more reasonable retry loop #2148

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow task execution due to long retry backoff in ReplicationThrottleHelper #2147

Slow task execution due to long retry backoff in ReplicationThrottleHelper #2147

eazama commented May 2, 2024

mhratson commented Sep 9, 2024

Slow task execution due to long retry backoff in ReplicationThrottleHelper #2147

Slow task execution due to long retry backoff in ReplicationThrottleHelper #2147

Comments

eazama commented May 2, 2024

mhratson commented Sep 9, 2024