Port redpanda to s390x (big-endian)#30982
Conversation
The crc32c BCR overlay hardcoded BYTE_ORDER_BIG_ENDIAN to 0, forcing the little-endian ReadUint32LE path on big-endian hosts. On s390x this produced wrong CRC32C values, so every produced Kafka record batch failed CRC validation with CORRUPT_MESSAGE. Derive the macro from __BYTE_ORDER__ so the portable CRC is correct on every target arch.
rules_boost had no s390x branch in BOOST_CTX_ASM_SOURCES, so Boost.Context's fcontext asm (jump/make/ontop) was never compiled and every test linking Boost.Coroutine failed with undefined jump_fcontext. Add the s390x asm via an archive_override patch. hwloc's autotools configure auto-detected the host libxml2 and enabled optional XML topology support, leaking undefined xml* symbols into hwloc consumers (the dep isn't wired up). Disable it explicitly; we don't use it.
adl serializes string/container length prefixes through adl<int32_t>::to, which writes cpu_to_le, but the deserializer read them with a raw consume_type<int32_t>() (no le_to_cpu). On little-endian that round-trips by accident; on s390x the length is byte-swapped (e.g. 5 read as 0x05000000), overflowing iobuf consume. Route the size reads through adl<int32_t>::from so they mirror the writer and keep the wire format little-endian on every arch.
numeric_plain_encoder only byteswapped integral values; FLOAT/DOUBLE fell through and were appended in native byte order. PLAIN encoding requires little-endian IEEE bytes, so on s390x every float/double column (and the golden-file round trips) was wrong. Byteswap the bit pattern via bit_cast + cpu_to_le, matching the bloom-filter hash path.
The cross-chunk slow path assembles the 4 bytes with explicit little-endian shifts, which already yields a host-order integer, then called le_to_cpu on top. On little-endian that's a harmless no-op; on s390x it byteswaps a second time, so reads spanning a chunk boundary returned the wrong value. Drop the redundant le_to_cpu (the fast memcpy path keeps its le_to_cpu, which is correct).
The hermetic LLVM toolchain has no s390x distribution, so every redpanda_cc_test failed analysis trying to fetch llvm-symbolizer for the ASAN_SYMBOLIZER_PATH data dep. Sanitizers are not used on s390x; gate the symbolizer data dep and env var behind a //bazel:s390x config_setting.
The deprecated on-disk index decoder read the entry count with ss::le_to_cpu(adl<uint32_t>::from(...)), but adl<uint32_t>::from already converts from little-endian. On big-endian the extra le_to_cpu byteswaps a second time, yielding a bogus entry count and a failed consume. The encoder writes the count through adl<uint32_t>::to (single cpu_to_le), so the read must convert exactly once.
format_verification_max_key decoded the on-disk uint16 entry size with a raw consume_type<uint16_t>(), which byteswaps on big-endian hosts and broke the assertion on s390x. The writer stores it little-endian (cpu_to_le) and the production reader decodes via adl<uint16_t>; match that in the test.
The fail_magic cases decremented a 4-byte int overlapping the 1-byte magic field to invalidate it. The magic is the int's least-significant byte only on little-endian; on big-endian it is the most-significant byte and the decrement borrows through the CRC bytes instead, leaving magic==2 so v2_format stayed true. Corrupt the magic as an int8 so exactly that byte changes on any arch.
The protobuf wire format stores i32/i64 (fixed32, sfixed32, float, fixed64, sfixed64, double) as little-endian, but the parser read them with a raw consume_type<T>() in host byte order. On big-endian hosts every fixed-width field was byteswapped (e.g. fixed32 1024 read as 262144, doubles garbled). Add consume_fixed_le<T> (le_to_cpu for integrals, bit_cast over the little-endian bit pattern for float/double) and use it for all fixed reads.
filter_builder::finish wrote the offsets_start footer pointer in native byte order while the reader decodes it little-endian, corrupting bloom-filter lookups (false positives) on big-endian. Write it with cpu_to_le to match the per-filter offsets just above it.
The block trailer CRC was stored in native byte order via bit_cast, but the reader decodes it with ioarray::read_fixed32 (little-endian). On big-endian the stored and recomputed CRCs disagreed and every block read failed with 'unexpected crc'. Store the CRC with cpu_to_le to match the reader.
The per-filter offset array and the offsets_start pointer are read back via block::contents::read_fixed32, which round-trips host byte order (matching the native fixed32s the block builder writes). The builder wrote the per-filter offsets little-endian (and offsets_start native), so on big-endian the decoded offsets were byteswapped and bloom-filter lookups returned false positives. Write both in native byte order to match the reader.
MurmurHash3 is defined to read each 32/64-bit block as a little-endian word. getblock and the iobuf torn-block path read them in native byte order, so on big-endian hosts the hash diverged from little-endian hosts, breaking consumers that need a stable value (e.g. iceberg bucket partitioning). Convert each block from little-endian with std::byteswap so the hash is identical on every arch.
The decimal bucket transform byteswapped each 64-bit half to big-endian and then extracted bytes with value shifts. Arithmetic shifts already yield bytes least-significant first regardless of host endianness, so combined with the byteswap this only produced the required big-endian encoding on little-endian hosts; on s390x the bytes came out little-endian and the bucket (and murmur hash) were wrong. Extract most-significant-byte-first directly without the byteswap.
footer::read parsed the trailing footer_size and the inner section size with raw consume_type<uint32_t>() in native byte order, but they are written little-endian via as_bytes/cpu_to_le. On big-endian the sizes were byteswapped so footer parsing failed and L1 object reads returned the error variant. Decode both with le_to_cpu.
The default seastar::thread stack (128KiB) is too small for tests that build and tear down deeply-nested structures (JSON/AVRO/protobuf conformance DOMs); big-endian hosts have larger stack frames and overflow first. Run gtest on a 16MiB stack — ample for the recursion while still fitting tests with small (32-64MiB) memory budgets, since the stack is drawn from the seastar pool.
TEST_CORO cases run on the reactor posix_thread (default 2MiB) and boost SEASTAR_THREAD_TEST_CASE bodies run on a seastar::async thread (default 128KiB); both overflow on big-endian for deeply-recursive tests. Patch seastar to give the reactor a 256MiB mmap stack and the thread-test a 16MiB pooled stack.
So benchmarks (and their generated smoke tests) can be gated by platform, matching the other test macros.
WebAssembly linear memory is always little-endian, but the host ABI read and wrote multi-byte values in native order, so on big-endian hosts every value crossing the host<->guest boundary was byteswapped (e.g. a record_count of 7 reached the guest as 0x07000000, producing out-of-bounds FFI accesses, and byteswapped WASI arg pointers aborted the guest during runtime init). Add ffi::write_guest/read_guest helpers (no-ops on little-endian) and route the transform batch/record out-parameters, the WASI scalar out-params, the fd_write iovec fields, and the WASI args/environ pointer table through them.
The 3-broker recovery test needs the cluster to fully form before it flushes the metastore, but on the slower s390x box the peers' raft services stay service_unavailable long enough that the ct_l1_domain partition cannot be allocated (no_eligible_allocation_nodes) within the test windows, so flush returns transport_error. A multi-node timing issue, not byte-order; the metastore serialization is verified by its unit tests.
|
Unfortunately, I do not have access to Buildkite and we do not use Builtkite so I have not been able to update and validate the builds on Buildkite. Please kindly advise on how to proceed. |
WillemKauf
left a comment
There was a problem hiding this comment.
Interesting PR. Wondering what other reviewers think. It would be nice to fully test our CI suite against s390x/BE systems
| // PLAIN encoding stores FLOAT/DOUBLE as little-endian IEEE bytes. | ||
| // cpu_to_le only accepts integrals, so byteswap the bit pattern. | ||
| using bits_type | ||
| = std::conditional_t<sizeof(v.val) == 4, uint32_t, uint64_t>; |
There was a problem hiding this comment.
perhaps better as:
using bits_type = std::conditional_t<sizeof(v.val) == sizeof(uint32_t), uint32_t, uint64_t>;
static_assert(sizeof(bits_type) == sizeof(v.val));| // from little-endian to keep these fields correct on big-endian hosts. | ||
| template<typename T> | ||
| T consume_fixed_le(iobuf_parser& parser) { | ||
| static_assert(std::is_arithmetic_v<T>); |
There was a problem hiding this comment.
nit: use a concept here instead?
| if constexpr (std::is_integral_v<T>) { | ||
| return ss::le_to_cpu(parser.consume_type<T>()); | ||
| } else { | ||
| using bits_type |
There was a problem hiding this comment.
Same comment as my earlier bits_type recommendation
| // seastar::thread stack (128KiB) is too small for tests that build and tear | ||
| // down deeply-nested structures (JSON/AVRO/protobuf conformance DOMs); | ||
| // big-endian hosts have larger stack frames and overflow first. | ||
| seastar::thread_attributes test_thread_attrs; |
There was a problem hiding this comment.
16MiB is quite a bump here and maybe not one we want to take by default (unless other reviewers disagree and think this is fine). I'd also push back on saying "big-endian hosts have larger stack frames" - this might just be the case for s390x, but stack frame size and endianness aren't directly related.
| namespace bpo = boost::program_options; | ||
| - _thread = std::make_unique<posix_thread>([this, ac, av, init_outcome]() mutable { | ||
| + _thread = std::make_unique<posix_thread>( | ||
| + posix_thread::attr{posix_thread::attr::stack_size{size_t(256) << 20}}, |
There was a problem hiding this comment.
both overflow on big-endian for deeply-recursive tests.
Similar comment
256MiB mmap stack
do we really need to 128x this allocation? I suppose it doesn't matter that much (it's a single mmap() at start-up IIUC)
| # partition cannot be allocated (no_eligible_allocation_nodes) within the | ||
| # test windows, so flush returns transport_error. A multi-node timing issue, | ||
| # not byte-order; the metastore serde is verified by its unit tests. | ||
| target_compatible_with = select({ |
There was a problem hiding this comment.
marking this incompatible instead of adjusting test timeouts and making the test pass seems like sweeping the issue under the rug
Ports redpanda to s390x (big-endian). The core cause of failures on big-endian is that many multi-byte fields are serialized little-endian on disk/wire but were read/written in native order. This PR fixes every such site found across the storage, wire, schema, hash, datalake, and WASM layers, plus the build/infra blockers that prevented the test suite from building and running on s390x.
Motivation
s390x (IBM Z) is a common platform for enterprise workloads in finance, government, and other regulated industries, where large volumes of transactional data already live on the mainframe. Running Redpanda natively on s390x lets those users co-locate their streaming layer with that data — avoiding cross-architecture data movement and its latency, cost, and compliance overhead — instead of shipping data off-platform to an x86/ARM cluster.
We are ourselves an user of both s390x (IBM Z) and Redpanda (community). Kafka is well supported in s390x world (multiple distro supports s390x: official, Zalando, Confluent, to name a few). I think this port might be of interest for Redpanda Enterprise users as well. Please consider merging this port. I'm willing to help maintain the s390x, as well as coordinate with IBM OSS team to secure native s390x runner for this project so we can keep CI running (IBM provides ways to get access for OSS now, so a current maintainer can also get access too).
Result
The broker boots, bootstraps its controller, serves Kafka + admin APIs, and produce/consume round-trips correctly on s390x. The full
//src/v/...unit suite passes (verified natively on an s390x host), with a single integration test gated for a multi-node timing reason documented below. Little-endian behavior is unchanged — all new byteswaps areconstexpr-guarded no-ops on little-endian hosts.Data-format (byte-order) fixes: crc32c (BYTE_ORDER derived from compiler macros), reflection/adl container sizes, serde/parquet FLOAT/DOUBLE, serde/protobuf fixed32/64/float/double, bytes/ioarray read_fixed32, storage index_state + compaction index, lsm block CRC + filter offsets, hashing/murmur (fixes iceberg bucketing), iceberg decimal bucket-hash bytes, cloud_topics/l1 footer sizes, kafka batch magic byte.
WASM host ABI: WebAssembly linear memory is always little-endian, but the host FFI read/wrote every multi-byte boundary value in native order, so on big-endian they were byteswapped (a
record_countof 7 reached the guest as0x07000000; byteswapped WASI arg pointers aborted the guest during runtime init). Addedffi::write_guest/ffi::read_guesthelpers (no-ops on little-endian) and routed the transform out-parameters, WASI scalar out-params,fd_writeiovecs, and the WASI args/environ pointer table through them.Build / test infrastructure: skip hermetic llvm-symbolizer on s390x; fix boost.context s390x fcontext asm and an hwloc libxml2 symbol leak; enlarge seastar test-thread stacks (s390x's larger frames overflow the defaults on deeply-recursive tests, sized to still fit small-memory budgets); support
target_compatible_withinredpanda_cc_bench.Known limitation (gated):
cloud_topics cluster_recovery_testis marked incompatible on s390x — on the slower s390x box the peers' raft services stayservice_unavailablelong enough that thect_l1_domainpartition can't be allocated within the test windows, so the metastore flush returnstransport_error. A multi-node cluster-formation timing issue (persists in isolation), not byte-order; the metastore serialization is verified by its passing unit tests.Backports Required
Release Notes
Improvements