Skip to content

Port redpanda to s390x (big-endian)#30982

Open
binhnguyenduc wants to merge 21 commits into
redpanda-data:devfrom
dnse-tech:s390x-crc32c-big-endian
Open

Port redpanda to s390x (big-endian)#30982
binhnguyenduc wants to merge 21 commits into
redpanda-data:devfrom
dnse-tech:s390x-crc32c-big-endian

Conversation

@binhnguyenduc

Copy link
Copy Markdown

Ports redpanda to s390x (big-endian). The core cause of failures on big-endian is that many multi-byte fields are serialized little-endian on disk/wire but were read/written in native order. This PR fixes every such site found across the storage, wire, schema, hash, datalake, and WASM layers, plus the build/infra blockers that prevented the test suite from building and running on s390x.

Motivation

s390x (IBM Z) is a common platform for enterprise workloads in finance, government, and other regulated industries, where large volumes of transactional data already live on the mainframe. Running Redpanda natively on s390x lets those users co-locate their streaming layer with that data — avoiding cross-architecture data movement and its latency, cost, and compliance overhead — instead of shipping data off-platform to an x86/ARM cluster.

We are ourselves an user of both s390x (IBM Z) and Redpanda (community). Kafka is well supported in s390x world (multiple distro supports s390x: official, Zalando, Confluent, to name a few). I think this port might be of interest for Redpanda Enterprise users as well. Please consider merging this port. I'm willing to help maintain the s390x, as well as coordinate with IBM OSS team to secure native s390x runner for this project so we can keep CI running (IBM provides ways to get access for OSS now, so a current maintainer can also get access too).

Result

The broker boots, bootstraps its controller, serves Kafka + admin APIs, and produce/consume round-trips correctly on s390x. The full //src/v/... unit suite passes (verified natively on an s390x host), with a single integration test gated for a multi-node timing reason documented below. Little-endian behavior is unchanged — all new byteswaps are constexpr-guarded no-ops on little-endian hosts.

Data-format (byte-order) fixes: crc32c (BYTE_ORDER derived from compiler macros), reflection/adl container sizes, serde/parquet FLOAT/DOUBLE, serde/protobuf fixed32/64/float/double, bytes/ioarray read_fixed32, storage index_state + compaction index, lsm block CRC + filter offsets, hashing/murmur (fixes iceberg bucketing), iceberg decimal bucket-hash bytes, cloud_topics/l1 footer sizes, kafka batch magic byte.

WASM host ABI: WebAssembly linear memory is always little-endian, but the host FFI read/wrote every multi-byte boundary value in native order, so on big-endian they were byteswapped (a record_count of 7 reached the guest as 0x07000000; byteswapped WASI arg pointers aborted the guest during runtime init). Added ffi::write_guest/ffi::read_guest helpers (no-ops on little-endian) and routed the transform out-parameters, WASI scalar out-params, fd_write iovecs, and the WASI args/environ pointer table through them.

Build / test infrastructure: skip hermetic llvm-symbolizer on s390x; fix boost.context s390x fcontext asm and an hwloc libxml2 symbol leak; enlarge seastar test-thread stacks (s390x's larger frames overflow the defaults on deeply-recursive tests, sized to still fit small-memory budgets); support target_compatible_with in redpanda_cc_bench.

Known limitation (gated): cloud_topics cluster_recovery_test is marked incompatible on s390x — on the slower s390x box the peers' raft services stay service_unavailable long enough that the ct_l1_domain partition can't be allocated within the test windows, so the metastore flush returns transport_error. A multi-node cluster-formation timing issue (persists in isolation), not byte-order; the metastore serialization is verified by its passing unit tests.

Backports Required

  • none - not a bug fix

Release Notes

Improvements

  • Redpanda now builds and its unit test suite passes on s390x (big-endian) hosts.

The crc32c BCR overlay hardcoded BYTE_ORDER_BIG_ENDIAN to 0, forcing the
little-endian ReadUint32LE path on big-endian hosts. On s390x this produced
wrong CRC32C values, so every produced Kafka record batch failed CRC
validation with CORRUPT_MESSAGE. Derive the macro from __BYTE_ORDER__ so the
portable CRC is correct on every target arch.
rules_boost had no s390x branch in BOOST_CTX_ASM_SOURCES, so Boost.Context's
fcontext asm (jump/make/ontop) was never compiled and every test linking
Boost.Coroutine failed with undefined jump_fcontext. Add the s390x asm via an
archive_override patch.

hwloc's autotools configure auto-detected the host libxml2 and enabled
optional XML topology support, leaking undefined xml* symbols into hwloc
consumers (the dep isn't wired up). Disable it explicitly; we don't use it.
adl serializes string/container length prefixes through adl<int32_t>::to,
which writes cpu_to_le, but the deserializer read them with a raw
consume_type<int32_t>() (no le_to_cpu). On little-endian that round-trips by
accident; on s390x the length is byte-swapped (e.g. 5 read as 0x05000000),
overflowing iobuf consume. Route the size reads through adl<int32_t>::from so
they mirror the writer and keep the wire format little-endian on every arch.
numeric_plain_encoder only byteswapped integral values; FLOAT/DOUBLE fell
through and were appended in native byte order. PLAIN encoding requires
little-endian IEEE bytes, so on s390x every float/double column (and the
golden-file round trips) was wrong. Byteswap the bit pattern via bit_cast +
cpu_to_le, matching the bloom-filter hash path.
The cross-chunk slow path assembles the 4 bytes with explicit little-endian
shifts, which already yields a host-order integer, then called le_to_cpu on
top. On little-endian that's a harmless no-op; on s390x it byteswaps a second
time, so reads spanning a chunk boundary returned the wrong value. Drop the
redundant le_to_cpu (the fast memcpy path keeps its le_to_cpu, which is
correct).
The hermetic LLVM toolchain has no s390x distribution, so every
redpanda_cc_test failed analysis trying to fetch llvm-symbolizer for the
ASAN_SYMBOLIZER_PATH data dep. Sanitizers are not used on s390x; gate the
symbolizer data dep and env var behind a //bazel:s390x config_setting.
The deprecated on-disk index decoder read the entry count with
ss::le_to_cpu(adl<uint32_t>::from(...)), but adl<uint32_t>::from already
converts from little-endian. On big-endian the extra le_to_cpu byteswaps a
second time, yielding a bogus entry count and a failed consume. The encoder
writes the count through adl<uint32_t>::to (single cpu_to_le), so the read
must convert exactly once.
format_verification_max_key decoded the on-disk uint16 entry size with a raw
consume_type<uint16_t>(), which byteswaps on big-endian hosts and broke the
assertion on s390x. The writer stores it little-endian (cpu_to_le) and the
production reader decodes via adl<uint16_t>; match that in the test.
The fail_magic cases decremented a 4-byte int overlapping the 1-byte magic
field to invalidate it. The magic is the int's least-significant byte only on
little-endian; on big-endian it is the most-significant byte and the decrement
borrows through the CRC bytes instead, leaving magic==2 so v2_format stayed
true. Corrupt the magic as an int8 so exactly that byte changes on any arch.
The protobuf wire format stores i32/i64 (fixed32, sfixed32, float, fixed64,
sfixed64, double) as little-endian, but the parser read them with a raw
consume_type<T>() in host byte order. On big-endian hosts every fixed-width
field was byteswapped (e.g. fixed32 1024 read as 262144, doubles garbled).
Add consume_fixed_le<T> (le_to_cpu for integrals, bit_cast over the
little-endian bit pattern for float/double) and use it for all fixed reads.
filter_builder::finish wrote the offsets_start footer pointer in native byte
order while the reader decodes it little-endian, corrupting bloom-filter
lookups (false positives) on big-endian. Write it with cpu_to_le to match the
per-filter offsets just above it.
The block trailer CRC was stored in native byte order via bit_cast, but the
reader decodes it with ioarray::read_fixed32 (little-endian). On big-endian
the stored and recomputed CRCs disagreed and every block read failed with
'unexpected crc'. Store the CRC with cpu_to_le to match the reader.
The per-filter offset array and the offsets_start pointer are read back via
block::contents::read_fixed32, which round-trips host byte order (matching the
native fixed32s the block builder writes). The builder wrote the per-filter
offsets little-endian (and offsets_start native), so on big-endian the decoded
offsets were byteswapped and bloom-filter lookups returned false positives.
Write both in native byte order to match the reader.
MurmurHash3 is defined to read each 32/64-bit block as a little-endian word.
getblock and the iobuf torn-block path read them in native byte order, so on
big-endian hosts the hash diverged from little-endian hosts, breaking
consumers that need a stable value (e.g. iceberg bucket partitioning). Convert
each block from little-endian with std::byteswap so the hash is identical on
every arch.
The decimal bucket transform byteswapped each 64-bit half to big-endian and
then extracted bytes with value shifts. Arithmetic shifts already yield bytes
least-significant first regardless of host endianness, so combined with the
byteswap this only produced the required big-endian encoding on little-endian
hosts; on s390x the bytes came out little-endian and the bucket (and murmur
hash) were wrong. Extract most-significant-byte-first directly without the
byteswap.
footer::read parsed the trailing footer_size and the inner section size with
raw consume_type<uint32_t>() in native byte order, but they are written
little-endian via as_bytes/cpu_to_le. On big-endian the sizes were byteswapped
so footer parsing failed and L1 object reads returned the error variant. Decode
both with le_to_cpu.
The default seastar::thread stack (128KiB) is too small for tests that build
and tear down deeply-nested structures (JSON/AVRO/protobuf conformance DOMs);
big-endian hosts have larger stack frames and overflow first. Run gtest on a
16MiB stack — ample for the recursion while still fitting tests with small
(32-64MiB) memory budgets, since the stack is drawn from the seastar pool.
TEST_CORO cases run on the reactor posix_thread (default 2MiB) and boost
SEASTAR_THREAD_TEST_CASE bodies run on a seastar::async thread (default 128KiB);
both overflow on big-endian for deeply-recursive tests. Patch seastar to give
the reactor a 256MiB mmap stack and the thread-test a 16MiB pooled stack.
So benchmarks (and their generated smoke tests) can be gated by platform,
matching the other test macros.
WebAssembly linear memory is always little-endian, but the host ABI read and
wrote multi-byte values in native order, so on big-endian hosts every value
crossing the host<->guest boundary was byteswapped (e.g. a record_count of 7
reached the guest as 0x07000000, producing out-of-bounds FFI accesses, and
byteswapped WASI arg pointers aborted the guest during runtime init). Add
ffi::write_guest/read_guest helpers (no-ops on little-endian) and route the
transform batch/record out-parameters, the WASI scalar out-params, the
fd_write iovec fields, and the WASI args/environ pointer table through them.
The 3-broker recovery test needs the cluster to fully form before it flushes
the metastore, but on the slower s390x box the peers' raft services stay
service_unavailable long enough that the ct_l1_domain partition cannot be
allocated (no_eligible_allocation_nodes) within the test windows, so flush
returns transport_error. A multi-node timing issue, not byte-order; the
metastore serialization is verified by its unit tests.
@CLAassistant

CLAassistant commented Jul 1, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

@binhnguyenduc

Copy link
Copy Markdown
Author

Unfortunately, I do not have access to Buildkite and we do not use Builtkite so I have not been able to update and validate the builds on Buildkite.

Please kindly advise on how to proceed.

@WillemKauf WillemKauf left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting PR. Wondering what other reviewers think. It would be nice to fully test our CI suite against s390x/BE systems

// PLAIN encoding stores FLOAT/DOUBLE as little-endian IEEE bytes.
// cpu_to_le only accepts integrals, so byteswap the bit pattern.
using bits_type
= std::conditional_t<sizeof(v.val) == 4, uint32_t, uint64_t>;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps better as:

  using bits_type = std::conditional_t<sizeof(v.val) == sizeof(uint32_t), uint32_t, uint64_t>;
  static_assert(sizeof(bits_type) == sizeof(v.val));

// from little-endian to keep these fields correct on big-endian hosts.
template<typename T>
T consume_fixed_le(iobuf_parser& parser) {
static_assert(std::is_arithmetic_v<T>);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use a concept here instead?

if constexpr (std::is_integral_v<T>) {
return ss::le_to_cpu(parser.consume_type<T>());
} else {
using bits_type

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as my earlier bits_type recommendation

// seastar::thread stack (128KiB) is too small for tests that build and tear
// down deeply-nested structures (JSON/AVRO/protobuf conformance DOMs);
// big-endian hosts have larger stack frames and overflow first.
seastar::thread_attributes test_thread_attrs;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

16MiB is quite a bump here and maybe not one we want to take by default (unless other reviewers disagree and think this is fine). I'd also push back on saying "big-endian hosts have larger stack frames" - this might just be the case for s390x, but stack frame size and endianness aren't directly related.

namespace bpo = boost::program_options;
- _thread = std::make_unique<posix_thread>([this, ac, av, init_outcome]() mutable {
+ _thread = std::make_unique<posix_thread>(
+ posix_thread::attr{posix_thread::attr::stack_size{size_t(256) << 20}},

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both overflow on big-endian for deeply-recursive tests.

Similar comment

256MiB mmap stack

do we really need to 128x this allocation? I suppose it doesn't matter that much (it's a single mmap() at start-up IIUC)

# partition cannot be allocated (no_eligible_allocation_nodes) within the
# test windows, so flush returns transport_error. A multi-node timing issue,
# not byte-order; the metastore serde is verified by its unit tests.
target_compatible_with = select({

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

marking this incompatible instead of adjusting test timeouts and making the test pass seems like sweeping the issue under the rug

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants