Scylla fails to startup on Fedora aarch64 AMI #22382

syuu1228 · 2025-01-18T03:46:39Z

The issue is originally reported on the thread in scylladb/scylla-pkg#4797

On Fedora 41 AMI on some aarch64 instance such as m7gd.16xlarge, Scylla fails to startup with following error message:

$ /opt/scylladb/bin/scylla --log-to-stdout 1
WARNING: debug mode. Not for benchmarking or production
hwloc/linux: failed to find sysfs cpu topology directory, aborting linux discovery.
scylla: seastar/src/core/resource.cc:683: resources seastar::resource::allocate(configuration &): Assertion `!remain' failed.
Aborting.
Backtrace:
  0x34b2593
  /opt/scylladb/libreloc/libseastar.so+0x3934403
  /opt/scylladb/libreloc/libseastar.so+0x393401b
  /opt/scylladb/libreloc/libseastar.so+0x375bc6f
  /opt/scylladb/libreloc/libseastar.so+0x37946a3
  /opt/scylladb/libreloc/libseastar.so+0x387d77f
  /opt/scylladb/libreloc/libseastar.so+0x387db67
  /opt/scylladb/libreloc/libseastar.so+0x387d90f
  linux-vdso.so.1+0x83f
  /opt/scylladb/libreloc/libc.so.6+0x98fff
  /opt/scylladb/libreloc/libc.so.6+0x459ff
  /opt/scylladb/libreloc/libc.so.6+0x30287
  /opt/scylladb/libreloc/libc.so.6+0x3e3df
  /opt/scylladb/libreloc/libc.so.6+0x3e453
  /opt/scylladb/libreloc/libseastar.so+0x3b8026f
  /opt/scylladb/libreloc/libseastar.so+0x37b43a7
  /opt/scylladb/libreloc/libseastar.so+0x3298baf
  /opt/scylladb/libreloc/libseastar.so+0x329619b
  0x354d0ef
  0x354a82b
  /opt/scylladb/libreloc/libc.so.6+0x30a1b
  /opt/scylladb/libreloc/libc.so.6+0x30afb
  0x346ed2f
Aborted (core dumped)

It seems like hwloc is fails to initialize, and returns incorrect HW information to Seastar.
The error "hwloc/linux: failed to find sysfs cpu topology directory, aborting linux discovery." also occur on hwloc commands such as hwloc-ls.
The error message comming from check_sysfs_cpu_path(), and it is occur when /sys/devices/sys/cpu/cpuX/topology/ is not available.
It also can verify on shell:

$ ls /sys/devices/system/cpu/cpu0/topology/
ls: cannot access '/sys/devices/system/cpu/cpu0/topology/': No such file or directory

This is probably kernel driver problem of the CPU.

I debugged seastar/src/core/resource.cc to find out why assert occured, and found that alloc_from_node() is failing because node->total_memory is 0.
It is likely because of failure of hwloc initialize described above.
This can be able to reproduce on hwloc-ls command.
On normal x86_64 machine, hwloc-ls output memory size and CPU cache size, but on m7gd.16xlarge nothing is shows up:

hwloc-ls outout on normal x86_64 machine

Machine (63GB total)
  Package L#0
    NUMANode L#0 (P#0 63GB)
    L3 L#0 (30MB)
      L2 L#0 (1280KB) + L1d L#0 (48KB) + L1i L#0 (32KB) + Core L#0
        PU L#0 (P#0)
        PU L#1 (P#1)
      L2 L#1 (1280KB) + L1d L#1 (48KB) + L1i L#1 (32KB) + Core L#1
        PU L#2 (P#2)
        PU L#3 (P#3)
...
  Misc(MemoryModule)
  Misc(MemoryModule)

hwloc-ls outout on m7gd.16xlarge

$ hwloc-ls
hwloc/linux: failed to find sysfs cpu topology directory, aborting linux discovery.
Machine
  NUMANode L#0 (P#0)
  PU L#0 (P#0)
  PU L#1 (P#1)
...

To avoid Scylla startup failure on such environment, we should stop using hwloc on seastar/src/core/resource.cc code.
Since resource.cc has code to build Seastar without libhwloc, we can use this code to fix the problem.

The text was updated successfully, but these errors were encountered:

On Fedora 41 AMI on some aarch64 instance such as m7gd.16xlarge, Seastar program such as Scylla fails to startup with following error message: ``` $ /opt/scylladb/bin/scylla --log-to-stdout 1 WARNING: debug mode. Not for benchmarking or production hwloc/linux: failed to find sysfs cpu topology directory, aborting linux discovery. scylla: seastar/src/core/resource.cc:683: resources seastar::resource::allocate(configuration &): Assertion `!remain' failed. ``` It seems like hwloc is failed to initialize because of /sys/devices/system/cpu/cpu0/topology/ not available on the instance. I debugged src/core/resource.cc to find out why assert occured, and found that alloc_from_node() is failing because node->total_memory is 0. It is likely because of failure of hwloc initialize described above. To avoid the error on such environment, we should stop using hwloc on resource.cc. hwloc initalization function does not return error code even error message is printed, we need to check "topology" directory is available on /sys. Since resource.cc has code to build Seastar without libhwloc, we need to call them if "topology" directory is not available. Fixes scylladb/scylladb#22382 Related scylladb/scylla-pkg#4797

mykaul · 2025-01-18T07:52:17Z

Sounds like an issue that should be reported to Amazon?

syuu1228 · 2025-01-20T11:08:40Z

Forgot to describe on previous post, the problem does not occur on Amazon Linux 2023 and Ubuntu 24.04 AMIs.
It is probably because of kernel driver diffrence.
When I build Amazon Linux 2023 kernel and install it to Fedora AMI, the problem fixed.
But, Scylla / Seastar shouldn't abort even HW information is incomplete.
Also we currently using Fedora aarch64 AMI on our CI, and we don't want to switch distribution if possible.

syuu1228 · 2025-01-20T11:11:54Z

Sounds like an issue that should be reported to Amazon?

As I just described above it is working on official AMIs (Amazon Linux, Ubuntu), so probably we need to report it to Fedora not Amazon

mykaul · 2025-01-20T11:26:03Z

Sounds like an issue that should be reported to Amazon?

As I just described above it is working on official AMIs (Amazon Linux, Ubuntu), so probably we need to report it to Fedora not Amazon

That's fine too. As long as we do report it.

syuu1228 · 2025-01-20T11:49:50Z

One more note: seems like not all instance size of m7gd are affected on this problem.
I found that m7gd.xlarge does not affected, but m7gd.16xlarge does.

syuu1228 linked a pull request Jan 18, 2025 that will close this issue

Fix seastar::resource::allocate() error on EC2 m7gd.16xlarge instance scylladb/seastar#2624

Open

syuu1228 self-assigned this Jan 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scylla fails to startup on Fedora aarch64 AMI #22382

Scylla fails to startup on Fedora aarch64 AMI #22382

syuu1228 commented Jan 18, 2025

mykaul commented Jan 18, 2025

syuu1228 commented Jan 20, 2025

syuu1228 commented Jan 20, 2025

mykaul commented Jan 20, 2025

syuu1228 commented Jan 20, 2025

Scylla fails to startup on Fedora aarch64 AMI #22382

Scylla fails to startup on Fedora aarch64 AMI #22382

Comments

syuu1228 commented Jan 18, 2025

mykaul commented Jan 18, 2025

syuu1228 commented Jan 20, 2025

syuu1228 commented Jan 20, 2025

mykaul commented Jan 20, 2025

syuu1228 commented Jan 20, 2025