Skip to content

Fix doubly-wrapped log entries after sparse write and recovery#601

Merged
michaelklishin merged 1 commit intomainfrom
wal-bug
Mar 26, 2026
Merged

Fix doubly-wrapped log entries after sparse write and recovery#601
michaelklishin merged 1 commit intomainfrom
wal-bug

Conversation

@ansd
Copy link
Copy Markdown
Member

@ansd ansd commented Mar 25, 2026

When a follower receives a pre-snapshot chunk, it uses ra_log:write_sparse/3 to write the entries to the WAL. However, write_sparse/3 was passing the entire log_entry() tuple {Idx, Term, Cmd} as the payload to the WAL, instead of just the Cmd.

When the node is restarted, the WAL recovery process reads the payload and wraps it in a new {Idx, Term, Payload} tuple. Because the payload was already a full log_entry() tuple, the recovered entry became doubly-wrapped: {Idx, Term, {Idx, Term, Cmd}}.

This caused a badmatch error in rabbit_jms_machine:exec_read/3 when rabbit_jms_queue_client attempted to read messages for delivery. The ra_server:transform_for_partial_read/3 function failed to strip the Raft metadata because the doubly-wrapped entry did not match the expected {'$usr', ...} pattern, causing raw Raft metadata to be returned instead of the expected JMS message.

This commit fixes the issue by passing only the Cmd to the WAL in write_sparse/3, matching the behavior of normal appends. A test case has been added to verify that sparse writes survive recovery without being doubly-wrapped.

When a follower receives a pre-snapshot chunk, it uses `ra_log:write_sparse/3`
to write the entries to the WAL. However, `write_sparse/3` was passing the
entire `log_entry()` tuple `{Idx, Term, Cmd}` as the payload to the WAL,
instead of just the `Cmd`.

When the node is restarted, the WAL recovery process reads the payload and
wraps it in a new `{Idx, Term, Payload}` tuple. Because the payload was
already a full `log_entry()` tuple, the recovered entry became doubly-wrapped:
`{Idx, Term, {Idx, Term, Cmd}}`.

This caused a `badmatch` error in `rabbit_jms_machine:exec_read/3` when
`rabbit_jms_queue_client` attempted to read messages for delivery. The
`ra_server:transform_for_partial_read/3` function failed to strip the Raft
metadata because the doubly-wrapped entry did not match the expected
`{'$usr', ...}` pattern, causing raw Raft metadata to be returned instead
of the expected JMS message.

This commit fixes the issue by passing only the `Cmd` to the WAL in
`write_sparse/3`, matching the behavior of normal appends. A test case
has been added to verify that sparse writes survive recovery without
being doubly-wrapped.
@ansd ansd added the bug label Mar 25, 2026
@ansd ansd requested review from kjnilsson and mkuratczyk March 25, 2026 19:52
@ansd ansd marked this pull request as ready for review March 25, 2026 19:53
@michaelklishin michaelklishin added this to the 3.1.1 milestone Mar 25, 2026
@michaelklishin michaelklishin merged commit 1491506 into main Mar 26, 2026
7 checks passed
@michaelklishin michaelklishin deleted the wal-bug branch March 26, 2026 00:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants