Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCP/PROTO: Disable protocols that need memory type copy #10490

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

tvegas1
Copy link
Contributor

@tvegas1 tvegas1 commented Feb 13, 2025

What?

Make sure to disable any protocol using locally or remotely, a memory type based copy.

Why?

Need to avoid all cuda calls from under ucp_worker_progress(). Cuda memory queries are fine since they are performed under send/recv operations. cuda_ipc transport can be disabled altogether, and gdr_copy is anyways not of interest when specifying UCX_MEMTYPE_AVOID_COPY=y.

How?

Disable applicable protocols altogether, for any memory type (cuda/host..).

Test

Might need to make inline sizes to 0, see below otherwise.

UCX_IB_TX_INLINE_RESP=0 UCX_MEMTYPE_AVOID_COPY=y ucx_perftest
UCX_IB_TX_INLINE_RESP=0 UCX_MEMTYPE_AVOID_COPY=y ucx_perftest -t ucp_put_bw -m cuda,cuda rock

Used by perftest to exchange psn_t sn:

  | perftest self cfg#2 | remote memory read by ucp_get* into host memory from cuda    |
  +---------------------+-------------------------------------+------------------------+
  |             65..inf | zero-copy                           | rc_mlx5/mlx5_2:1/path0 |

@@ -192,7 +192,8 @@ ucp_tag_rndv_offload_sw_proto_probe(const ucp_proto_init_params_t *init_params)
ucp_proto_rndv_ctrl_priv_t rpriv;

if (!ucp_tag_rndv_check_op_id(init_params) ||
!ucp_ep_config_key_has_tag_lane(init_params->ep_config_key)) {
!ucp_ep_config_key_has_tag_lane(init_params->ep_config_key) ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a simple mock test to verify the protocol selection?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to discuss because there are many protocols to tests in that case.

@@ -49,7 +49,8 @@ ucp_am_eager_multi_bcopy_proto_probe(const ucp_proto_init_params_t *init_params)
};

if (!ucp_am_check_init_params(init_params, UCP_PROTO_AM_OP_ID_MASK,
UCP_PROTO_SELECT_OP_FLAG_AM_RNDV)) {
UCP_PROTO_SELECT_OP_FLAG_AM_RNDV) ||
context->config.ext.avoid_copy_mem_types) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it feasible to move this check in ucp_proto_init_check_op and then apply it when send_op != ZCOPY?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eager zcopy should also be disabled

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed, and egr single too

@@ -49,7 +49,8 @@ ucp_am_eager_multi_bcopy_proto_probe(const ucp_proto_init_params_t *init_params)
};

if (!ucp_am_check_init_params(init_params, UCP_PROTO_AM_OP_ID_MASK,
UCP_PROTO_SELECT_OP_FLAG_AM_RNDV)) {
UCP_PROTO_SELECT_OP_FLAG_AM_RNDV) ||
context->config.ext.avoid_copy_mem_types) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eager zcopy should also be disabled

@@ -118,7 +118,8 @@ ucp_am_eager_short_probe_common(const ucp_proto_init_params_t *init_params,

if (!ucp_am_check_init_params(init_params, UCS_BIT(op_id),
UCP_PROTO_SELECT_OP_FLAG_AM_RNDV) ||
!ucp_proto_is_short_supported(select_param)) {
!ucp_proto_is_short_supported(select_param) ||
init_params->worker->context->config.ext.avoid_copy_mem_types) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe you could define a wrapper on top of ucp_am_check_init_params to use for all AM eager protocols and check for avoid_copy_mem_types inside

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -396,7 +396,7 @@ ucp_proto_t ucp_rndv_rkey_ptr_mtype_proto = {
.name = "rndv/rkey_ptr/mtype",
.desc = "copy to mapped remote memory",
.flags = 0,
.probe = ucp_proto_rndv_rkey_ptr_mtype_probe,
.probe = ucp_proto_rndv_rkey_ptr_mtype_probe,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated
also we need to disable rkey_ptr_mtype

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is disabled as it uses ucp_proto_rndv_mtype_init() which contains the added check. restored space

@@ -120,6 +120,10 @@ static void ucp_rndv_am_bcopy_probe(const ucp_proto_init_params_t *init_params)
.middle.tl_cap_flags = UCT_IFACE_FLAG_AM_BCOPY
};

if (init_params->worker->context->config.ext.avoid_copy_mem_types) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls move it to ucp_rndv_am_probe_common, because we need to also disable am_zcopy (it also implies unpacking on the receiver)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -177,7 +177,8 @@ static void ucp_proto_amo_probe(const ucp_proto_init_params_t *init_params,
};

if ((init_params->select_param->dt_class != UCP_DATATYPE_CONTIG) ||
!ucp_proto_init_check_op(init_params, UCS_BIT(op_id))) {
!ucp_proto_init_check_op(init_params, UCS_BIT(op_id)) ||
init_params->worker->context->config.ext.avoid_copy_mem_types) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe you can disable it only for non-host memtypes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be fixed now

Comment on lines +362 to +364
{"MEMTYPE_AVOID_COPY", "n",
"Avoid memory type copies when activated.\n",
ucs_offsetof(ucp_context_config_t, avoid_copy_mem_types), UCS_CONFIG_TYPE_BOOL},
Copy link
Contributor

@rakhmets rakhmets Feb 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
{"MEMTYPE_AVOID_COPY", "n",
"Avoid memory type copies when activated.\n",
ucs_offsetof(ucp_context_config_t, avoid_copy_mem_types), UCS_CONFIG_TYPE_BOOL},
{"MEMTYPE_COPY_ENABLE", "y",
"Enable memory type copies",
ucs_offsetof(ucp_context_config_t, mem_type_copy_enable), UCS_CONFIG_TYPE_BOOL},

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok to fix from my side, since we had agreed on naming, @yosefe @brminich shall we go ahead with MEMTYPE_COPY_ENABLE?

Copy link
Contributor

@rakhmets rakhmets Feb 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to explain my comment. We have dozens UCX variables ended by "ENABLE", but with "AVOID". And I think it is inconvenient to set a negative value to a variable with a negative in the name to enable a feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants