Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix seastar::resource::allocate() error on EC2 m7gd.16xlarge instance #2624

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

syuu1228
Copy link
Contributor

On Fedora 41 AMI on some aarch64 instance such as m7gd.16xlarge, Seastar program such as Scylla fails to startup with following error message:

$ /opt/scylladb/bin/scylla --log-to-stdout 1
WARNING: debug mode. Not for benchmarking or production
hwloc/linux: failed to find sysfs cpu topology directory, aborting linux discovery.
scylla: seastar/src/core/resource.cc:683: resources seastar::resource::allocate(configuration &): Assertion `!remain' failed.

It seems like hwloc is failed to initialize because of /sys/devices/system/cpu/cpu0/topology/ not available on the instance.

I debugged src/core/resource.cc to find out why assert occured, and found that alloc_from_node() is failing because node->total_memory is 0. It is likely because of failure of hwloc initialize described above.

To avoid the error on such environment, we should stop using hwloc on resource.cc.
hwloc initalization function does not return error code even error message is printed, we need to check "topology" directory is available on /sys. Since resource.cc has code to build Seastar without libhwloc, we need to call them if "topology" directory is not available.

Fixes scylladb/scylladb#22382
Related scylladb/scylla-pkg#4797

On Fedora 41 AMI on some aarch64 instance such as m7gd.16xlarge, Seastar
program such as Scylla fails to startup with following error message:
```
$ /opt/scylladb/bin/scylla --log-to-stdout 1
WARNING: debug mode. Not for benchmarking or production
hwloc/linux: failed to find sysfs cpu topology directory, aborting linux discovery.
scylla: seastar/src/core/resource.cc:683: resources seastar::resource::allocate(configuration &): Assertion `!remain' failed.
```

It seems like hwloc is failed to initialize because of
/sys/devices/system/cpu/cpu0/topology/ not available on the instance.

I debugged src/core/resource.cc to find out why assert occured,
and found that alloc_from_node() is failing because node->total_memory is 0.
It is likely because of failure of hwloc initialize described above.

To avoid the error on such environment, we should stop using hwloc on
resource.cc.
hwloc initalization function does not return error code even error message is
printed, we need to check "topology" directory is available on /sys.
Since resource.cc has code to build Seastar without libhwloc, we need to
call them if "topology" directory is not available.

Fixes scylladb/scylladb#22382
Related scylladb/scylla-pkg#4797
Copy link
Contributor

@denesb denesb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was asked to review so I did. I'm not familar with this code but from a high level it looks good to me.
Someone more familiar with this should also review.

Copy link
Contributor

@tchaikov tchaikov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm in general. in addition to the inlined comments.

  • could you please use a more specific prefix in the title of the commit message? like: "resource: "
  • and use a more specific title. like "fall back to single io group if hwloc fails to work".

BTW, could even split the commit into two. one for moving the non-hwloc code up. the other for using it when hwloc fails tell the CPU topology.

// cannot receive error from the API.
// Therefore, we have to detect cpu topology availability in our code.
static bool is_hwloc_available() {
const std::string cpux_properties[] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, std::string_view would suffice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and better off referencing the related hwloc function, so that the posterity understand why we are using this logic to determine if hwloc is able to identify the CPU topology.

}

// cpu0 might be offline, try to check first online cpu.
auto online = read_first_line_as<std::string>("/sys/devices/system/cpu/online");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed we could potentially use read_first_line_as<unsigned>("/sys/devices/system/cpu/online")
and handle the exception. If this approach wouldn't significantly simplify the implementation, feel free to keep the current solution.

@avikivity
Copy link
Member

I think this is too extreme. Here's hwloc-ls output on m7gd.16xlarge:

$ hwloc-ls
hwloc/linux: failed to find sysfs cpu topology directory, aborting linux discovery.
Machine
  NUMANode L#0 (P#0)
  PU L#0 (P#0)
  PU L#1 (P#1)
  PU L#2 (P#2)
  PU L#3 (P#3)
  PU L#4 (P#4)
  PU L#5 (P#5)
  PU L#6 (P#6)
  PU L#7 (P#7)
  PU L#8 (P#8)
  PU L#9 (P#9)
  PU L#10 (P#10)
  PU L#11 (P#11)
  PU L#12 (P#12)
  PU L#13 (P#13)
  PU L#14 (P#14)
  PU L#15 (P#15)
  PU L#16 (P#16)
  PU L#17 (P#17)
  PU L#18 (P#18)
  PU L#19 (P#19)
  PU L#20 (P#20)
  PU L#21 (P#21)
  PU L#22 (P#22)
  PU L#23 (P#23)
  PU L#24 (P#24)
  PU L#25 (P#25)
  PU L#26 (P#26)
  PU L#27 (P#27)
  PU L#28 (P#28)
  PU L#29 (P#29)
  PU L#30 (P#30)
  PU L#31 (P#31)
  PU L#32 (P#32)
  PU L#33 (P#33)
  PU L#34 (P#34)
  PU L#35 (P#35)
  PU L#36 (P#36)
  PU L#37 (P#37)
  PU L#38 (P#38)
  PU L#39 (P#39)
  PU L#40 (P#40)
  PU L#41 (P#41)
  PU L#42 (P#42)
  PU L#43 (P#43)
  PU L#44 (P#44)
  PU L#45 (P#45)
  PU L#46 (P#46)
  PU L#47 (P#47)
  PU L#48 (P#48)
  PU L#49 (P#49)
  PU L#50 (P#50)
  PU L#51 (P#51)
  PU L#52 (P#52)
  PU L#53 (P#53)
  PU L#54 (P#54)
  PU L#55 (P#55)
  PU L#56 (P#56)
  PU L#57 (P#57)
  PU L#58 (P#58)
  PU L#59 (P#59)
  PU L#60 (P#60)
  PU L#61 (P#61)
  PU L#62 (P#62)
  PU L#63 (P#63)
  HostBridge
    PCI 00:04.0 (NVMExp)
      Block(Disk) "nvme1n1"
    PCI 00:05.0 (Ethernet)
      Net "eth0"
    PCI 00:1e.0 (NVMExp)
      Block(Disk) "nvme0n1"
    PCI 00:1f.0 (NVMExp)
      Block(Disk) "nvme2n1"

So it aborted linux discovery, but was still able to keep going. Maybe hwloc-ls is telling hwloc to fall back to alternative methods if needed, and we are not.

@avikivity
Copy link
Member

I think the problem is that hwloc doesn't report the memory as belonging to any NUMA nodes (on a normal machine the NUMA nodes have memory counts).

    // Divide local memory to cpus
    for (auto&& cs : cpu_sets()) {
        auto cpu_id = hwloc_bitmap_first(cs);
        assert(cpu_id != -1);
        auto node = cpu_to_node.at(cpu_id);
        cpu this_cpu;
        this_cpu.cpu_id = cpu_id;
        size_t remain = mem_per_proc - alloc_from_node(this_cpu, node, topo_used_mem, mem_per_proc);

        remains.emplace_back(std::move(this_cpu), remain);
    }

    // Divide the rest of the memory
    auto depth = hwloc_get_type_or_above_depth(topology, HWLOC_OBJ_NUMANODE);
    for (auto&& [this_cpu, remain] : remains) {
        auto node = cpu_to_node.at(this_cpu.cpu_id);
        auto obj = node;

        while (remain) {
            remain -= alloc_from_node(this_cpu, obj, topo_used_mem, remain);
            do {
                obj = hwloc_get_next_obj_by_depth(topology, depth, obj);
            } while (!obj);
            if (obj == node)
                break;
        }
        assert(!remain);
        ret.cpus.push_back(std::move(this_cpu));
    }

We need to add a third loop that allocates non-NUMA memory if the second loop fails.

@avikivity
Copy link
Member

Bad NUMA detection:

  NUMANode L#0 (P#0)

Good NUMA detection:

    NUMANode L#0 (P#0 126GB)

So we just need to detect the case where the detected NUMA memory (here 0) is less that the memory we want to allocate, and treat that case too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Scylla fails to startup on Fedora aarch64 AMI
4 participants