Speculative retry for some cluster commands. #2618

gdusbabek · 2024-02-13T20:46:23Z

gdusbabek
Feb 13, 2024

Has the idea of implementing speculative retry for non-mutating commands in cluster mode been considered before? (I couldn't find it in the search.)

It's possible that this may be a very specialized situation, but some read-heavy workloads I'm seeing patterns like this in the 99.9th percentile (concurrency in this diagram is 3 for sake of simplicity):

Time ---->
MGET < 5 slots worth of keys >:
AsyncComand 1: |---|
AsyncComand 2: |--|
AsyncComand 3: |-----------------------------------------------------------|
AsyncComand 4:     |--|
AsyncComand 5:      |---|

Basically, the 3rd command is taking a long time for $SOME_REASON, and we have reason to believe that if it were just retried it would complete in a duration on par with other slots. It would be nice to have a setting that would retry it WITHOUT retrying all of the other slots. Of course, the alternative is to manage the process ourselves: set the command timeout to be very low and retry the entire operation.

If the maintainers see any merit in a patch that would accomplish this, I would be happy to contribute it with some guidance.

FWIW, I've spent some time digging around RedisAdvancedClusterAsyncCommandsImpl and AsyncCommand and it appears this would be a non-trivial change.

mp911de · 2024-02-14T08:36:31Z

mp911de
Feb 14, 2024
Maintainer

We see reports of similar observations that suddenly, a command takes longer than it should while we cannot pinpoint what is causing the increase in latency.

We cannot easily solve this issue as commands can only complete if the previously sent command has been completed. Such a retry approach would require a fresh Redis connection (TCP). In a fully-fledged SSL scenario, I'm not sure that TCP, SSL, and HELLO handshakes would complete quicker than waiting for a command completion. Obviously, if the hanging commands lingers around for a minute, then we might have a different problem, though.

set the command timeout to be very low and retry the entire operation.

I think this is a reasonable approach to ensure application responsiveness.
Instead of adding more complexity, if you're able to figure out what prevents commands from completing within the expected timeout, then we're happy to fix the cause if it is on our side.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speculative retry for some cluster commands. #2618

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Speculative retry for some cluster commands. #2618

gdusbabek Feb 13, 2024

Replies: 1 comment

mp911de Feb 14, 2024 Maintainer

gdusbabek
Feb 13, 2024

mp911de
Feb 14, 2024
Maintainer