Skip to content

Conversation

@nmazzilli3
Copy link
Contributor

feat: Adding default dmabuf attempt and fallback logic for all efa_hmem_ifaces

Problem:

  • How make dmabuf usage default going forward and have fallback mechanism if dmabuf not supported

Solution:

Testing:

  • Ran mpi perf tests on 2 nodes on p5en with dmabuf and fallback option hard set
  • Ran mpi perf tests on 16 nodes on p5en with dmabuf and fallback option hard set

Sim Issue:

  • N/A

@nmazzilli3 nmazzilli3 requested a review from jiaxiyan October 3, 2025 21:39
@jiaxiyan jiaxiyan requested a review from a team October 3, 2025 21:52
@nmazzilli3 nmazzilli3 force-pushed the Subspace-2834 branch 2 times, most recently from 4d7e686 to 8901eae Compare October 22, 2025 19:35
@nmazzilli3 nmazzilli3 requested a review from jiaxiyan October 23, 2025 19:26
feat: Add dmabuf capability tracking for HMEM interfaces

Problem:
  - Need infrastructure to detect and query dmabuf support per HMEM interface
  - Providers need to know if dmabuf is available before attempting registration

Solution:
  - Added ofi_hmem_is_dmabuf_supported() query function in include/ofi_hmem.h
  - Implemented dmabuf support detection in src/hmem_cuda.c
  - Implemented dmabuf support detection in src/hmem_neuron.c
  - Added stub implementations in src/hmem_rocr.c and src/hmem_synapseai.c
  - Updated src/hmem.c to expose the new interface

Testing:
  - Verified compilation with and without CUDA/Neuron support
  - Tested query function returns correct values for each interface

Sim Issue:
  - N/A

Signed-off-by: Nick Mazzilli <[email protected]>
feat: Add default dmabuf attempt with fallback for all efa_hmem_ifaces

Problem:
  - Need to make dmabuf usage default going forward
  - Need fallback mechanism when dmabuf is not supported or fails

Solution:
  - Modified initial PR from @jiaxiyan at ofiwg@6aa6708
  - Added dmabuf_supported_by_device_b flag in efa_hmem_info structure in prov/efa/src/efa_hmem.h
  - Updated dmabuf_supported_by_device_b detection in each fi_hmem_iface p2p_support function in prov/efa/src/efa_hmem.c
  - Modified efa_mr_reg_ibv_mr() in prov/efa/src/efa_mr.c to use efa_mr->peer.iface for dmabuf checks
  - Implemented try-dmabuf-first with fallback to ibv_reg_mr in efa_mr_reg_ibv_mr()
  - Added environment variable control for dmabuf enable/disable per interface

Testing:
  - Ran MPI perf tests on 2 nodes on p5en with dmabuf and fallback
  - Ran MPI perf tests on 16 nodes on p5en with dmabuf and fallback
  - Verified fallback works when dmabuf is unavailable

Sim Issue:
  - N/A

Signed-off-by: Nick Mazzilli <[email protected]>
@nmazzilli3 nmazzilli3 requested a review from jiaxiyan November 12, 2025 18:29
info->max_medium_msg_size = 0;
info->runt_size = 0;
info->min_read_msg_size = 0;
info->min_read_write_size = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are default to 0 and set in efa_domain_hmem_info_init_protocol_thresholds.

info->dmabuf_supported_by_device_b = true;
} else {
EFA_INFO(FI_LOG_CORE, "FI_HMEM_SYNAPSEAI DMABUF disabled by environment variable\n");
info->dmabuf_supported_by_device_b = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to set it again here

@jiaxiyan jiaxiyan requested a review from j-xiong November 12, 2025 18:51
@jiaxiyan
Copy link
Contributor

@j-xiong Can you review 80a80f3?

* @param[in] iface The HMEM interface to check
* @return true if DMABUF is enabled for the interface, false otherwise
*/
bool efa_hmem_is_dmabuf_env_var_enabled(enum fi_hmem_iface iface)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should add a method to the common hmem interface and define an ofi_hmem_is_dmabuf_requested function in src/hmem.c. That make the code reusable for other providers. Also need to add the function for FI_HMEM_ZE is going this route.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants