Skip to content

Make map operations deterministic in quorum queues #13971

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jun 4, 2025
Merged

Make map operations deterministic in quorum queues #13971

merged 4 commits into from
Jun 4, 2025

Conversation

ansd
Copy link
Member

@ansd ansd commented May 28, 2025

Prior to this commit map iteration order was undefined in quorum queues and could therefore be different on different RabbitMQ nodes.

Example:

OTP 26.2.5.3

Erlang/OTP 26 [erts-14.2.5.3] [source] [64-bit] [smp:12:12] [ds:12:12:10] [async-threads:1] [jit]

Eshell V14.2.5.3 (press Ctrl+G to abort, type help(). for help)
1> maps:foreach(fun(K, _) -> io:format("~b,", [K]) end, maps:from_keys(lists:seq(1, 33), ok)).
4,25,8,1,23,10,7,9,11,12,28,24,13,3,18,29,26,22,19,2,33,21,32,20,17,30,14,5,6,27,16,31,15,ok

OTP 27.3.3

Erlang/OTP 27 [erts-15.2.6] [source] [64-bit] [smp:12:12] [ds:12:12:10] [async-threads:1] [jit]

Eshell V15.2.6 (press Ctrl+G to abort, type help(). for help)
1> maps:foreach(fun(K, _) -> io:format("~b,", [K]) end, maps:from_keys(lists:seq(1, 33), ok)).
18,4,12,19,29,13,2,7,31,8,10,23,9,15,32,1,25,28,20,6,11,17,24,14,33,3,16,30,21,5,27,26,22,ok

This can lead to non-determinism on different members. For example, different members could potentially return messages in a different order.

This commit introduces a new machine version fixing this bug.

The property test added by this PR shows that non-determinism could have occurred (prior to this PR) even on different nodes running the same OTP version (tested on macOS and OTP 27.3.3). Such a test failure can be reproduced by changing the rabbit_fifo machine version from 6 to 5 in the new test case.

I also tested manually that the new test succeeds if the CT node runs OTP 27.3.3 and another locally started Erlang node runs OTP 26.2.5.3.

ansd added 2 commits June 3, 2025 19:22
Prior to this commit map iteration order was undefined in quorum queues
and could therefore be different on different versions of Erlang/OTP.

Example:

OTP 26.2.5.3
```
Erlang/OTP 26 [erts-14.2.5.3] [source] [64-bit] [smp:12:12] [ds:12:12:10] [async-threads:1] [jit]

Eshell V14.2.5.3 (press Ctrl+G to abort, type help(). for help)
1> maps:foreach(fun(K, _) -> io:format("~b,", [K]) end, maps:from_keys(lists:seq(1, 33), ok)).
4,25,8,1,23,10,7,9,11,12,28,24,13,3,18,29,26,22,19,2,33,21,32,20,17,30,14,5,6,27,16,31,15,ok
```

OTP 27.3.3
```
Erlang/OTP 27 [erts-15.2.6] [source] [64-bit] [smp:12:12] [ds:12:12:10] [async-threads:1] [jit]

Eshell V15.2.6 (press Ctrl+G to abort, type help(). for help)
1> maps:foreach(fun(K, _) -> io:format("~b,", [K]) end, maps:from_keys(lists:seq(1, 33), ok)).
18,4,12,19,29,13,2,7,31,8,10,23,9,15,32,1,25,28,20,6,11,17,24,14,33,3,16,30,21,5,27,26,22,ok
```

This can lead to non-determinism on different members. For example, different
members could potentially return messages in a different order.

This commit introduces a new machine version fixing this bug.
This commit adds a property test that applies the same Ra commands in
the same order on two different Erlang nodes. The state in which both nodes end
up should be exactly the same.

Ideally, the two nodes should run different OTP versions because this
way we could test for any non-determinism across OTP versions.

However, for now, having a test with both nodes having the same OTP
verison is good enough because running this test with rabbit_fifo
machine version 5 fails while machine version 6 succeeds.

This reveales another interesting: The default "undefined" map order can
even be different using different Erlang nodes with the **same** OTP
version.
For test case leader_locator_balanced the actual leaders elected were
nodes 1, 3, 1 because they know about machine version 6 while node 2
only knows about machine version 5.
@ansd ansd marked this pull request as ready for review June 4, 2025 09:09
@ansd ansd merged commit 21b6088 into main Jun 4, 2025
282 checks passed
@ansd ansd deleted the qq-map-order branch June 4, 2025 09:16
@michaelklishin michaelklishin added this to the 4.2.0 milestone Jun 4, 2025
ansd added a commit that referenced this pull request Jun 6, 2025
 ## What?

PR #13971 added a property test that applies the same quorum queue Raft
command on different quorum queue members on different Erlang nodes
ensuring that the state machine ends up in exaclty the same state.
The different Erlang nodes run the **same** Erlang/OTP version however.

This commit adds another property test where the different Erlang nodes
run **different** Erlang/OTP versions.

 ## Why?

This test allows spotting any non-determinism that could occur when
running quorum queue members in a mixed version cluster, where mixed
version means in our context different Erlang/OTP versions.

 ## How?

CI runs currently tests with Erlang 27.

This commit starts an Erlang 26 node in docker, specifically for the
`rabbit_fifo_prop_SUITE`.

Test case `two_nodes_different_otp_version` running Erlang 27 then transfers
a few Erlang modules (e.g. module `rabbit_fifo`) to the Erlang 26 node.
The test case then runs the Ra commands on its own node in Erlang 27 and
on the Erlang 26 node in Docker.

By default, this test case is skipped locally.
However, to run this test case locally, simply start an Erlang node as
follows:
```
erl -sname rabbit_fifo_prop@localhost
```
ansd added a commit that referenced this pull request Jun 6, 2025
 ## What?

PR #13971 added a property test that applies the same quorum queue Raft
command on different quorum queue members on different Erlang nodes
ensuring that the state machine ends up in exaclty the same state.
The different Erlang nodes run the **same** Erlang/OTP version however.

This commit adds another property test where the different Erlang nodes
run **different** Erlang/OTP versions.

 ## Why?

This test allows spotting any non-determinism that could occur when
running quorum queue members in a mixed version cluster, where mixed
version means in our context different Erlang/OTP versions.

 ## How?

CI runs currently tests with Erlang 27.

This commit starts an Erlang 26 node in docker, specifically for the
`rabbit_fifo_prop_SUITE`.

Test case `two_nodes_different_otp_version` running Erlang 27 then transfers
a few Erlang modules (e.g. module `rabbit_fifo`) to the Erlang 26 node.
The test case then runs the Ra commands on its own node in Erlang 27 and
on the Erlang 26 node in Docker.

By default, this test case is skipped locally.
However, to run this test case locally, simply start an Erlang node as
follows:
```
erl -sname rabbit_fifo_prop@localhost
```
mergify bot pushed a commit that referenced this pull request Jun 6, 2025
 ## What?

PR #13971 added a property test that applies the same quorum queue Raft
command on different quorum queue members on different Erlang nodes
ensuring that the state machine ends up in exaclty the same state.
The different Erlang nodes run the **same** Erlang/OTP version however.

This commit adds another property test where the different Erlang nodes
run **different** Erlang/OTP versions.

 ## Why?

This test allows spotting any non-determinism that could occur when
running quorum queue members in a mixed version cluster, where mixed
version means in our context different Erlang/OTP versions.

 ## How?

CI runs currently tests with Erlang 27.

This commit starts an Erlang 26 node in docker, specifically for the
`rabbit_fifo_prop_SUITE`.

Test case `two_nodes_different_otp_version` running Erlang 27 then transfers
a few Erlang modules (e.g. module `rabbit_fifo`) to the Erlang 26 node.
The test case then runs the Ra commands on its own node in Erlang 27 and
on the Erlang 26 node in Docker.

By default, this test case is skipped locally.
However, to run this test case locally, simply start an Erlang node as
follows:
```
erl -sname rabbit_fifo_prop@localhost
```

(cherry picked from commit eccf9fe)
ansd added a commit that referenced this pull request Jun 6, 2025
 ## What?

PR #13971 added a property test that applies the same quorum queue Raft
command on different quorum queue members on different Erlang nodes
ensuring that the state machine ends up in exaclty the same state.
The different Erlang nodes run the **same** Erlang/OTP version however.

This commit adds another property test where the different Erlang nodes
run **different** Erlang/OTP versions.

 ## Why?

This test allows spotting any non-determinism that could occur when
running quorum queue members in a mixed version cluster, where mixed
version means in our context different Erlang/OTP versions.

 ## How?

CI runs currently tests with Erlang 27.

This commit starts an Erlang 26 node in docker, specifically for the
`rabbit_fifo_prop_SUITE`.

Test case `two_nodes_different_otp_version` running Erlang 27 then transfers
a few Erlang modules (e.g. module `rabbit_fifo`) to the Erlang 26 node.
The test case then runs the Ra commands on its own node in Erlang 27 and
on the Erlang 26 node in Docker.

By default, this test case is skipped locally.
However, to run this test case locally, simply start an Erlang node as
follows:
```
erl -sname rabbit_fifo_prop@localhost
```

(cherry picked from commit eccf9fe)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants