Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scylla fails to startup on Fedora aarch64 AMI #22382

Open
syuu1228 opened this issue Jan 18, 2025 · 5 comments · May be fixed by scylladb/seastar#2624
Open

Scylla fails to startup on Fedora aarch64 AMI #22382

syuu1228 opened this issue Jan 18, 2025 · 5 comments · May be fixed by scylladb/seastar#2624
Assignees

Comments

@syuu1228
Copy link
Contributor

The issue is originally reported on the thread in scylladb/scylla-pkg#4797


On Fedora 41 AMI on some aarch64 instance such as m7gd.16xlarge, Scylla fails to startup with following error message:

$ /opt/scylladb/bin/scylla --log-to-stdout 1
WARNING: debug mode. Not for benchmarking or production
hwloc/linux: failed to find sysfs cpu topology directory, aborting linux discovery.
scylla: seastar/src/core/resource.cc:683: resources seastar::resource::allocate(configuration &): Assertion `!remain' failed.
Aborting.
Backtrace:
  0x34b2593
  /opt/scylladb/libreloc/libseastar.so+0x3934403
  /opt/scylladb/libreloc/libseastar.so+0x393401b
  /opt/scylladb/libreloc/libseastar.so+0x375bc6f
  /opt/scylladb/libreloc/libseastar.so+0x37946a3
  /opt/scylladb/libreloc/libseastar.so+0x387d77f
  /opt/scylladb/libreloc/libseastar.so+0x387db67
  /opt/scylladb/libreloc/libseastar.so+0x387d90f
  linux-vdso.so.1+0x83f
  /opt/scylladb/libreloc/libc.so.6+0x98fff
  /opt/scylladb/libreloc/libc.so.6+0x459ff
  /opt/scylladb/libreloc/libc.so.6+0x30287
  /opt/scylladb/libreloc/libc.so.6+0x3e3df
  /opt/scylladb/libreloc/libc.so.6+0x3e453
  /opt/scylladb/libreloc/libseastar.so+0x3b8026f
  /opt/scylladb/libreloc/libseastar.so+0x37b43a7
  /opt/scylladb/libreloc/libseastar.so+0x3298baf
  /opt/scylladb/libreloc/libseastar.so+0x329619b
  0x354d0ef
  0x354a82b
  /opt/scylladb/libreloc/libc.so.6+0x30a1b
  /opt/scylladb/libreloc/libc.so.6+0x30afb
  0x346ed2f
Aborted (core dumped)

It seems like hwloc is fails to initialize, and returns incorrect HW information to Seastar.
The error "hwloc/linux: failed to find sysfs cpu topology directory, aborting linux discovery." also occur on hwloc commands such as hwloc-ls.
The error message comming from check_sysfs_cpu_path(), and it is occur when /sys/devices/sys/cpu/cpuX/topology/ is not available.
It also can verify on shell:

$ ls /sys/devices/system/cpu/cpu0/topology/
ls: cannot access '/sys/devices/system/cpu/cpu0/topology/': No such file or directory

This is probably kernel driver problem of the CPU.

I debugged seastar/src/core/resource.cc to find out why assert occured, and found that alloc_from_node() is failing because node->total_memory is 0.
It is likely because of failure of hwloc initialize described above.
This can be able to reproduce on hwloc-ls command.
On normal x86_64 machine, hwloc-ls output memory size and CPU cache size, but on m7gd.16xlarge nothing is shows up:

  • hwloc-ls outout on normal x86_64 machine
Machine (63GB total)
  Package L#0
    NUMANode L#0 (P#0 63GB)
    L3 L#0 (30MB)
      L2 L#0 (1280KB) + L1d L#0 (48KB) + L1i L#0 (32KB) + Core L#0
        PU L#0 (P#0)
        PU L#1 (P#1)
      L2 L#1 (1280KB) + L1d L#1 (48KB) + L1i L#1 (32KB) + Core L#1
        PU L#2 (P#2)
        PU L#3 (P#3)
...
  Misc(MemoryModule)
  Misc(MemoryModule)
  • hwloc-ls outout on m7gd.16xlarge
$ hwloc-ls
hwloc/linux: failed to find sysfs cpu topology directory, aborting linux discovery.
Machine
  NUMANode L#0 (P#0)
  PU L#0 (P#0)
  PU L#1 (P#1)
...

To avoid Scylla startup failure on such environment, we should stop using hwloc on seastar/src/core/resource.cc code.
Since resource.cc has code to build Seastar without libhwloc, we can use this code to fix the problem.

syuu1228 added a commit to syuu1228/seastar-1 that referenced this issue Jan 18, 2025
On Fedora 41 AMI on some aarch64 instance such as m7gd.16xlarge, Seastar
program such as Scylla fails to startup with following error message:
```
$ /opt/scylladb/bin/scylla --log-to-stdout 1
WARNING: debug mode. Not for benchmarking or production
hwloc/linux: failed to find sysfs cpu topology directory, aborting linux discovery.
scylla: seastar/src/core/resource.cc:683: resources seastar::resource::allocate(configuration &): Assertion `!remain' failed.
```

It seems like hwloc is failed to initialize because of
/sys/devices/system/cpu/cpu0/topology/ not available on the instance.

I debugged src/core/resource.cc to find out why assert occured,
and found that alloc_from_node() is failing because node->total_memory is 0.
It is likely because of failure of hwloc initialize described above.

To avoid the error on such environment, we should stop using hwloc on
resource.cc.
hwloc initalization function does not return error code even error message is
printed, we need to check "topology" directory is available on /sys.
Since resource.cc has code to build Seastar without libhwloc, we need to
call them if "topology" directory is not available.

Fixes scylladb/scylladb#22382
Related scylladb/scylla-pkg#4797
@mykaul
Copy link
Contributor

mykaul commented Jan 18, 2025

Sounds like an issue that should be reported to Amazon?

@syuu1228
Copy link
Contributor Author

Forgot to describe on previous post, the problem does not occur on Amazon Linux 2023 and Ubuntu 24.04 AMIs.
It is probably because of kernel driver diffrence.
When I build Amazon Linux 2023 kernel and install it to Fedora AMI, the problem fixed.
But, Scylla / Seastar shouldn't abort even HW information is incomplete.
Also we currently using Fedora aarch64 AMI on our CI, and we don't want to switch distribution if possible.

@syuu1228
Copy link
Contributor Author

Sounds like an issue that should be reported to Amazon?

As I just described above it is working on official AMIs (Amazon Linux, Ubuntu), so probably we need to report it to Fedora not Amazon

@mykaul
Copy link
Contributor

mykaul commented Jan 20, 2025

Sounds like an issue that should be reported to Amazon?

As I just described above it is working on official AMIs (Amazon Linux, Ubuntu), so probably we need to report it to Fedora not Amazon

That's fine too. As long as we do report it.

@syuu1228
Copy link
Contributor Author

One more note: seems like not all instance size of m7gd are affected on this problem.
I found that m7gd.xlarge does not affected, but m7gd.16xlarge does.

@syuu1228 syuu1228 self-assigned this Jan 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants