-
Notifications
You must be signed in to change notification settings - Fork 456
prov/efa: Enable ROCr HMEM support #11573
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
6b18cf8 to
6972146
Compare
| if (ret == FI_SUCCESS) { | ||
| ibv_mr = ibv_reg_dmabuf_mr(ibv_pd, dmabuf_offset, | ||
| len, (uint64_t) ptr, dmabuf_fd, ibv_access); | ||
| (void) rocr_hmem_put_dmabuf_fd(dmabuf_fd); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't see dmabuf ops in rocr_hmem.c. Rocr doesn't support it? That means you will always try the side channel, and that likely won't work unless the efa driver implemented it for rocr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the only path the efa driver is going to support with ROCm is dmabuf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't see dmabuf ops in rocr_hmem.c
Help me out here: what's missing from the current rocr_hmem_{get,put}_dmabuf_fd() functions? It's currently gated behind FI_HMEM_ROCR_USE_DMABUF, which we can discuss changing the default there. But it should otherwise hit hsa_amd_portable_export_dmabuf() as long as the ROCr headers expose it at compile-time.
Were you talking about the fabtests code? Because yeah I think that's missing and I'll add that functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the confusion. I can see rocr_hmem_{get,put}_dmabuf_fd() is implemented in src/hmem_rocr.c, (and not in fabtests code yet), but this PR doesn't seem to go this code path for the fi_mr_regattr via a VA (no FI_MR_DMABUF) as it finally calls ibv_reg_mr
The expected workflow would be
- efa_mr_reg_ibv_mr should retrieve the dmabuf fd internally and reg via ibv_reg_dmabuf_mr, when dmabuf is available
- fabtests implement that dmabuf code as well and it will call fi_mr_regattr with FI_MR_DMABUF flags and fd directly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack, I'll add this
Signed-off-by: Darryl Abbate <[email protected]>
Signed-off-by: Darryl Abbate <[email protected]>
This simply utilizes `fi_tostr()` to log the interface identifier instead of the integer value. Signed-off-by: Darryl Abbate <[email protected]>
This actually assigns the fi_hmem_iface snum value to the provided identifier, rather than providing an index into efa_hmem_ifaces Signed-off-by: Darryl Abbate <[email protected]>
Signed-off-by: Darryl Abbate <[email protected]>
Signed-off-by: Darryl Abbate <[email protected]>
This is a constant value; no need to re-compute every time the function is called. Signed-off-by: Darryl Abbate <[email protected]>
This was ostensibly designed to match fi_mr_attr.device, but should instead be treated more like a subset of ofi_mr_info, where device is simply a uint64_t. Aliasing fi_mr_attr.device is already silly since fi_mr_attr.iface should tell you the device type. Signed-off-by: Darryl Abbate <[email protected]>
Signed-off-by: Darryl Abbate <[email protected]>
Signed-off-by: Darryl Abbate <[email protected]>
Signed-off-by: Darryl Abbate <[email protected]>
Signed-off-by: Darryl Abbate <[email protected]>
Signed-off-by: Darryl Abbate <[email protected]>
6972146 to
0276ddd
Compare
prov/efa/src/efa_mr.c
Outdated
| * with ibv_reg_dmabuf_mr. If dmabuf is not supported, fall back to | ||
| * plain ibv_reg_mr on the VA. | ||
| */ | ||
| if (efa_mr->peer.iface == FI_HMEM_ROCR) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not just change l550 to if (efa_mr_is_neuron(efa_mr) || efa_mr_is_rocm(efa_mr)) {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is some changes for the string err message needed to make it general
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
0276ddd to
dcff237
Compare
Signed-off-by: Darryl Abbate <[email protected]>
Signed-off-by: alexander-sannikov <[email protected]> Signed-off-by: Darryl Abbate <[email protected]> (cherry picked from commit 79bfea2)
Signed-off-by: Darryl Abbate <[email protected]>
Signed-off-by: Darryl Abbate <[email protected]>
dcff237 to
c5eebdf
Compare
Marking as a draft since this hasn't been tested, but there are plenty of changes made here that can undergo CI testing and review.