-
Notifications
You must be signed in to change notification settings - Fork 456
prov/efa: dmabuf try / fallback logic #11465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
c1be6c7 to
d626afc
Compare
4d7e686 to
8901eae
Compare
8901eae to
44d02b1
Compare
feat: Add dmabuf capability tracking for HMEM interfaces Problem: - Need infrastructure to detect and query dmabuf support per HMEM interface - Providers need to know if dmabuf is available before attempting registration Solution: - Added ofi_hmem_is_dmabuf_supported() query function in include/ofi_hmem.h - Implemented dmabuf support detection in src/hmem_cuda.c - Implemented dmabuf support detection in src/hmem_neuron.c - Added stub implementations in src/hmem_rocr.c and src/hmem_synapseai.c - Updated src/hmem.c to expose the new interface Testing: - Verified compilation with and without CUDA/Neuron support - Tested query function returns correct values for each interface Sim Issue: - N/A Signed-off-by: Nick Mazzilli <[email protected]>
44d02b1 to
002320b
Compare
feat: Add default dmabuf attempt with fallback for all efa_hmem_ifaces Problem: - Need to make dmabuf usage default going forward - Need fallback mechanism when dmabuf is not supported or fails Solution: - Modified initial PR from @jiaxiyan at ofiwg@6aa6708 - Added dmabuf_supported_by_device_b flag in efa_hmem_info structure in prov/efa/src/efa_hmem.h - Updated dmabuf_supported_by_device_b detection in each fi_hmem_iface p2p_support function in prov/efa/src/efa_hmem.c - Modified efa_mr_reg_ibv_mr() in prov/efa/src/efa_mr.c to use efa_mr->peer.iface for dmabuf checks - Implemented try-dmabuf-first with fallback to ibv_reg_mr in efa_mr_reg_ibv_mr() - Added environment variable control for dmabuf enable/disable per interface Testing: - Ran MPI perf tests on 2 nodes on p5en with dmabuf and fallback - Ran MPI perf tests on 16 nodes on p5en with dmabuf and fallback - Verified fallback works when dmabuf is unavailable Sim Issue: - N/A Signed-off-by: Nick Mazzilli <[email protected]>
002320b to
8e504f6
Compare
| info->max_medium_msg_size = 0; | ||
| info->runt_size = 0; | ||
| info->min_read_msg_size = 0; | ||
| info->min_read_write_size = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are default to 0 and set in efa_domain_hmem_info_init_protocol_thresholds.
| info->dmabuf_supported_by_device_b = true; | ||
| } else { | ||
| EFA_INFO(FI_LOG_CORE, "FI_HMEM_SYNAPSEAI DMABUF disabled by environment variable\n"); | ||
| info->dmabuf_supported_by_device_b = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need to set it again here
| * @param[in] iface The HMEM interface to check | ||
| * @return true if DMABUF is enabled for the interface, false otherwise | ||
| */ | ||
| bool efa_hmem_is_dmabuf_env_var_enabled(enum fi_hmem_iface iface) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should add a method to the common hmem interface and define an ofi_hmem_is_dmabuf_requested function in src/hmem.c. That make the code reusable for other providers. Also need to add the function for FI_HMEM_ZE is going this route.
feat: Adding default dmabuf attempt and fallback logic for all efa_hmem_ifaces
Problem:
Solution:
Testing:
Sim Issue: