Skip to content

Conversation

@darrylabbate
Copy link
Member

Marking as a draft since this hasn't been tested, but there are plenty of changes made here that can undergo CI testing and review.

@darrylabbate darrylabbate requested a review from a team November 1, 2025 05:20
@darrylabbate darrylabbate force-pushed the feat/efa/rocr branch 2 times, most recently from 6b18cf8 to 6972146 Compare November 21, 2025 07:33
if (ret == FI_SUCCESS) {
ibv_mr = ibv_reg_dmabuf_mr(ibv_pd, dmabuf_offset,
len, (uint64_t) ptr, dmabuf_fd, ibv_access);
(void) rocr_hmem_put_dmabuf_fd(dmabuf_fd);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see dmabuf ops in rocr_hmem.c. Rocr doesn't support it? That means you will always try the side channel, and that likely won't work unless the efa driver implemented it for rocr

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the only path the efa driver is going to support with ROCm is dmabuf.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see dmabuf ops in rocr_hmem.c

Help me out here: what's missing from the current rocr_hmem_{get,put}_dmabuf_fd() functions? It's currently gated behind FI_HMEM_ROCR_USE_DMABUF, which we can discuss changing the default there. But it should otherwise hit hsa_amd_portable_export_dmabuf() as long as the ROCr headers expose it at compile-time.

Were you talking about the fabtests code? Because yeah I think that's missing and I'll add that functionality.

Copy link
Contributor

@shijin-aws shijin-aws Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the confusion. I can see rocr_hmem_{get,put}_dmabuf_fd() is implemented in src/hmem_rocr.c, (and not in fabtests code yet), but this PR doesn't seem to go this code path for the fi_mr_regattr via a VA (no FI_MR_DMABUF) as it finally calls ibv_reg_mr

The expected workflow would be

  1. efa_mr_reg_ibv_mr should retrieve the dmabuf fd internally and reg via ibv_reg_dmabuf_mr, when dmabuf is available
  2. fabtests implement that dmabuf code as well and it will call fi_mr_regattr with FI_MR_DMABUF flags and fd directly

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, I'll add this

This simply utilizes `fi_tostr()` to log the interface identifier
instead of the integer value.

Signed-off-by: Darryl Abbate <[email protected]>
This actually assigns the fi_hmem_iface snum value to the provided
identifier, rather than providing an index into efa_hmem_ifaces

Signed-off-by: Darryl Abbate <[email protected]>
This is a constant value; no need to re-compute every time the function
is called.

Signed-off-by: Darryl Abbate <[email protected]>
This was ostensibly designed to match fi_mr_attr.device, but should
instead be treated more like a subset of ofi_mr_info, where device is
simply a uint64_t. Aliasing fi_mr_attr.device is already silly since
fi_mr_attr.iface should tell you the device type.

Signed-off-by: Darryl Abbate <[email protected]>
* with ibv_reg_dmabuf_mr. If dmabuf is not supported, fall back to
* plain ibv_reg_mr on the VA.
*/
if (efa_mr->peer.iface == FI_HMEM_ROCR) {
Copy link
Contributor

@shijin-aws shijin-aws Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just change l550 to if (efa_mr_is_neuron(efa_mr) || efa_mr_is_rocm(efa_mr)) {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is some changes for the string err message needed to make it general

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants