Fix seastar::resource::allocate() error on EC2 m7gd.16xlarge instance #2624

syuu1228 · 2025-01-18T03:47:34Z

On Fedora 41 AMI on some aarch64 instance such as m7gd.16xlarge, Seastar program such as Scylla fails to startup with following error message:

$ /opt/scylladb/bin/scylla --log-to-stdout 1
WARNING: debug mode. Not for benchmarking or production
hwloc/linux: failed to find sysfs cpu topology directory, aborting linux discovery.
scylla: seastar/src/core/resource.cc:683: resources seastar::resource::allocate(configuration &): Assertion `!remain' failed.

It seems like hwloc is failed to initialize because of /sys/devices/system/cpu/cpu0/topology/ not available on the instance.

I debugged src/core/resource.cc to find out why assert occured, and found that alloc_from_node() is failing because node->total_memory is 0. It is likely because of failure of hwloc initialize described above.

To avoid the error on such environment, we should stop using hwloc on resource.cc.
hwloc initalization function does not return error code even error message is printed, we need to check "topology" directory is available on /sys. Since resource.cc has code to build Seastar without libhwloc, we need to call them if "topology" directory is not available.

Fixes scylladb/scylladb#22382
Related scylladb/scylla-pkg#4797

On Fedora 41 AMI on some aarch64 instance such as m7gd.16xlarge, Seastar program such as Scylla fails to startup with following error message: ``` $ /opt/scylladb/bin/scylla --log-to-stdout 1 WARNING: debug mode. Not for benchmarking or production hwloc/linux: failed to find sysfs cpu topology directory, aborting linux discovery. scylla: seastar/src/core/resource.cc:683: resources seastar::resource::allocate(configuration &): Assertion `!remain' failed. ``` It seems like hwloc is failed to initialize because of /sys/devices/system/cpu/cpu0/topology/ not available on the instance. I debugged src/core/resource.cc to find out why assert occured, and found that alloc_from_node() is failing because node->total_memory is 0. It is likely because of failure of hwloc initialize described above. To avoid the error on such environment, we should stop using hwloc on resource.cc. hwloc initalization function does not return error code even error message is printed, we need to check "topology" directory is available on /sys. Since resource.cc has code to build Seastar without libhwloc, we need to call them if "topology" directory is not available. Fixes scylladb/scylladb#22382 Related scylladb/scylla-pkg#4797

denesb

I was asked to review so I did. I'm not familar with this code but from a high level it looks good to me.
Someone more familiar with this should also review.

tchaikov

lgtm in general. in addition to the inlined comments.

could you please use a more specific prefix in the title of the commit message? like: "resource: "
and use a more specific title. like "fall back to single io group if hwloc fails to work".

BTW, could even split the commit into two. one for moving the non-hwloc code up. the other for using it when hwloc fails tell the CPU topology.

tchaikov · 2025-01-20T11:53:52Z

src/core/resource.cc

+// cannot receive error from the API.
+// Therefore, we have to detect cpu topology availability in our code.
+static bool is_hwloc_available() {
+    const std::string cpux_properties[] = {


nit, std::string_view would suffice.

and better off referencing the related hwloc function, so that the posterity understand why we are using this logic to determine if hwloc is able to identify the CPU topology.

tchaikov · 2025-01-20T11:59:31Z

src/core/resource.cc

+    }
+
+    // cpu0 might be offline, try to check first online cpu.
+    auto online = read_first_line_as<std::string>("/sys/devices/system/cpu/online");


I noticed we could potentially use read_first_line_as<unsigned>("/sys/devices/system/cpu/online")
and handle the exception. If this approach wouldn't significantly simplify the implementation, feel free to keep the current solution.

avikivity · 2025-01-20T19:48:25Z

I think this is too extreme. Here's hwloc-ls output on m7gd.16xlarge:

$ hwloc-ls
hwloc/linux: failed to find sysfs cpu topology directory, aborting linux discovery.
Machine
  NUMANode L#0 (P#0)
  PU L#0 (P#0)
  PU L#1 (P#1)
  PU L#2 (P#2)
  PU L#3 (P#3)
  PU L#4 (P#4)
  PU L#5 (P#5)
  PU L#6 (P#6)
  PU L#7 (P#7)
  PU L#8 (P#8)
  PU L#9 (P#9)
  PU L#10 (P#10)
  PU L#11 (P#11)
  PU L#12 (P#12)
  PU L#13 (P#13)
  PU L#14 (P#14)
  PU L#15 (P#15)
  PU L#16 (P#16)
  PU L#17 (P#17)
  PU L#18 (P#18)
  PU L#19 (P#19)
  PU L#20 (P#20)
  PU L#21 (P#21)
  PU L#22 (P#22)
  PU L#23 (P#23)
  PU L#24 (P#24)
  PU L#25 (P#25)
  PU L#26 (P#26)
  PU L#27 (P#27)
  PU L#28 (P#28)
  PU L#29 (P#29)
  PU L#30 (P#30)
  PU L#31 (P#31)
  PU L#32 (P#32)
  PU L#33 (P#33)
  PU L#34 (P#34)
  PU L#35 (P#35)
  PU L#36 (P#36)
  PU L#37 (P#37)
  PU L#38 (P#38)
  PU L#39 (P#39)
  PU L#40 (P#40)
  PU L#41 (P#41)
  PU L#42 (P#42)
  PU L#43 (P#43)
  PU L#44 (P#44)
  PU L#45 (P#45)
  PU L#46 (P#46)
  PU L#47 (P#47)
  PU L#48 (P#48)
  PU L#49 (P#49)
  PU L#50 (P#50)
  PU L#51 (P#51)
  PU L#52 (P#52)
  PU L#53 (P#53)
  PU L#54 (P#54)
  PU L#55 (P#55)
  PU L#56 (P#56)
  PU L#57 (P#57)
  PU L#58 (P#58)
  PU L#59 (P#59)
  PU L#60 (P#60)
  PU L#61 (P#61)
  PU L#62 (P#62)
  PU L#63 (P#63)
  HostBridge
    PCI 00:04.0 (NVMExp)
      Block(Disk) "nvme1n1"
    PCI 00:05.0 (Ethernet)
      Net "eth0"
    PCI 00:1e.0 (NVMExp)
      Block(Disk) "nvme0n1"
    PCI 00:1f.0 (NVMExp)
      Block(Disk) "nvme2n1"

So it aborted linux discovery, but was still able to keep going. Maybe hwloc-ls is telling hwloc to fall back to alternative methods if needed, and we are not.

avikivity · 2025-01-20T20:00:45Z

I think the problem is that hwloc doesn't report the memory as belonging to any NUMA nodes (on a normal machine the NUMA nodes have memory counts).

    // Divide local memory to cpus
    for (auto&& cs : cpu_sets()) {
        auto cpu_id = hwloc_bitmap_first(cs);
        assert(cpu_id != -1);
        auto node = cpu_to_node.at(cpu_id);
        cpu this_cpu;
        this_cpu.cpu_id = cpu_id;
        size_t remain = mem_per_proc - alloc_from_node(this_cpu, node, topo_used_mem, mem_per_proc);

        remains.emplace_back(std::move(this_cpu), remain);
    }

    // Divide the rest of the memory
    auto depth = hwloc_get_type_or_above_depth(topology, HWLOC_OBJ_NUMANODE);
    for (auto&& [this_cpu, remain] : remains) {
        auto node = cpu_to_node.at(this_cpu.cpu_id);
        auto obj = node;

        while (remain) {
            remain -= alloc_from_node(this_cpu, obj, topo_used_mem, remain);
            do {
                obj = hwloc_get_next_obj_by_depth(topology, depth, obj);
            } while (!obj);
            if (obj == node)
                break;
        }
        assert(!remain);
        ret.cpus.push_back(std::move(this_cpu));
    }

We need to add a third loop that allocates non-NUMA memory if the second loop fails.

avikivity · 2025-01-20T20:03:31Z

Bad NUMA detection:

  NUMANode L#0 (P#0)

Good NUMA detection:

    NUMANode L#0 (P#0 126GB)

So we just need to detect the case where the detected NUMA memory (here 0) is less that the memory we want to allocate, and treat that case too.

syuu1228 requested a review from yaronkaikov January 18, 2025 03:47

yaronkaikov requested review from avikivity, denesb and tchaikov January 19, 2025 08:50

denesb reviewed Jan 20, 2025

View reviewed changes

tchaikov reviewed Jan 20, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix seastar::resource::allocate() error on EC2 m7gd.16xlarge instance #2624

Fix seastar::resource::allocate() error on EC2 m7gd.16xlarge instance #2624

syuu1228 commented Jan 18, 2025

denesb left a comment

tchaikov left a comment •

edited

Loading

tchaikov Jan 20, 2025

tchaikov Jan 20, 2025

tchaikov Jan 20, 2025

avikivity commented Jan 20, 2025

avikivity commented Jan 20, 2025

avikivity commented Jan 20, 2025

Fix seastar::resource::allocate() error on EC2 m7gd.16xlarge instance #2624

Are you sure you want to change the base?

Fix seastar::resource::allocate() error on EC2 m7gd.16xlarge instance #2624

Conversation

syuu1228 commented Jan 18, 2025

denesb left a comment

Choose a reason for hiding this comment

tchaikov left a comment • edited Loading

Choose a reason for hiding this comment

tchaikov Jan 20, 2025

Choose a reason for hiding this comment

tchaikov Jan 20, 2025

Choose a reason for hiding this comment

tchaikov Jan 20, 2025

Choose a reason for hiding this comment

avikivity commented Jan 20, 2025

avikivity commented Jan 20, 2025

avikivity commented Jan 20, 2025

tchaikov left a comment •

edited

Loading