Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CONTRIB/IBMOCK: Inital add with EFA support #10069

Merged
merged 1 commit into from
Feb 6, 2025
Merged

Conversation

tvegas1
Copy link
Contributor

@tvegas1 tvegas1 commented Aug 19, 2024

What

Add IB mocking (EFA support) capability.

Why ?

Need to run gtest without the need for EFA interface.

How ?

Implement loopback libibverbs/libefa. Keep build/code independent from UCX libraries.

Perftest/Info

vdi$ export LD_LIBRARY_PATH=$(pwd)/../efa_mock/build:$LD_LIBRARY_PATH
vdi$ UCX_TLS=srd ucx_perftest -t tag_bw -l
vdi$ UCX_TLS=ud ucx_perftest -t tag_bw -l
vdi$ ucx_info -d

Gtest:

vdi$ export LD_LIBRARY_PATH=$(pwd)/../hpc-poc/ibmock/build
vdi$  ./test/gtest/gtest --gtest_filter=*ud*
...
[----------] Global test environment tear-down
[==========] 2300 tests from 119 test suites ran. (642250 ms total)
[  PASSED  ] 2300 tests.

vdi$ valgrind --leak-check=full  ./test/gtest/gtest --gtest_filter=\
*ud*-*perf.enveloppe*

contrib/ibmock/config.h Outdated Show resolved Hide resolved
contrib/ibmock/config.h Outdated Show resolved Hide resolved
contrib/ibmock/config.h Show resolved Hide resolved
contrib/ibmock/efa.c Outdated Show resolved Hide resolved
contrib/ibmock/efa.c Outdated Show resolved Hide resolved
contrib/ibmock/lib.h Outdated Show resolved Hide resolved
contrib/ibmock/lib.h Outdated Show resolved Hide resolved
contrib/ibmock/lib.h Outdated Show resolved Hide resolved
contrib/ibmock/lib.h Outdated Show resolved Hide resolved
contrib/ibmock/lib.h Show resolved Hide resolved
contrib/ibmock/lib.h Outdated Show resolved Hide resolved
contrib/ibmock/lib.h Outdated Show resolved Hide resolved
contrib/ibmock/verbs.c Outdated Show resolved Hide resolved
contrib/ibmock/lib.h Outdated Show resolved Hide resolved
contrib/ibmock/lib.h Show resolved Hide resolved
contrib/ibmock/lib.h Outdated Show resolved Hide resolved
contrib/ibmock/verbs.c Show resolved Hide resolved
contrib/ibmock/verbs.c Outdated Show resolved Hide resolved
contrib/ibmock/verbs.c Outdated Show resolved Hide resolved
contrib/ibmock/verbs.c Outdated Show resolved Hide resolved
contrib/ibmock/verbs.c Outdated Show resolved Hide resolved
contrib/ibmock/verbs.c Show resolved Hide resolved
contrib/ibmock/verbs.c Show resolved Hide resolved
Comment on lines 27 to 29
AZP_AGENT_ID: $(AZP_AGENT_ID)
BUILD_NUMBER: "$(Build.BuildId)-$(Build.BuildNumber)"
JOB_URL: "$(System.TeamFoundationCollectionUri)$(System.TeamProject)/_build/results?buildId=$(Build.BuildId)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need these? looks like they are not used by the efa test script

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment on lines 83 to 86
LD_LIBRARY_PATH=$(pwd)/contrib/ibmock/build:$LD_LIBRARY_PATH
CLOUD_TYPE=aws
export LD_LIBRARY_PATH
export CLOUD_TYPE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: can init in place

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -0,0 +1,21 @@
/**
* Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
* Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed all

@@ -0,0 +1,1064 @@
/**
* Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
* Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

#include <infiniband/efadv.h>
#include <infiniband/verbs.h>

#define NUM_DEVS 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe define in verbs.c? anyways it needs to be queried by ibv_get_device_list

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -0,0 +1,109 @@
/**
* Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
* Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -0,0 +1,22 @@
/**
* Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
* Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -0,0 +1,102 @@
/**
* Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
* Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

}
}

printf("ibmock: failed to destroy QP#%d\n", qp->qp_num);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe print to stderr instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed for all

Comment on lines 974 to 981
if (node_type < IBV_NODE_CA || node_type > IBV_NODE_UNSPECIFIED)
return "unknown";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in some places you use different style - brackets around each statement, and also brackets for every block even if it is one line. Better to use the same style thru the whole lib

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

```
cd ucx
./autogen.sh
./contrib/configure-devel --with-verbs=$(pwd)/../rdma-core/build/include --with-efa
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, remove the include suffix

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

neeed to apply the comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

contrib/ibmock/lib.h Outdated Show resolved Hide resolved
contrib/ibmock/verbs.c Outdated Show resolved Hide resolved
contrib/ibmock/verbs.c Show resolved Hide resolved
contrib/ibmock/verbs.c Outdated Show resolved Hide resolved
contrib/ibmock/verbs.c Outdated Show resolved Hide resolved
iyastreb
iyastreb previously approved these changes Jan 21, 2025
brminich
brminich previously approved these changes Jan 27, 2025
@brminich
Copy link
Contributor

@tvegas1, can you pls squash?

@tvegas1
Copy link
Contributor Author

tvegas1 commented Jan 28, 2025

CI failure fixed by #10457

@yosefe
Copy link
Contributor

yosefe commented Jan 30, 2025

/azp run UCX PR

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@tvegas1 tvegas1 force-pushed the ibmock branch 2 times, most recently from cb7e3da to 610ec40 Compare February 3, 2025 14:37
@@ -357,8 +364,7 @@ static uct_iface_ops_t uct_srd_iface_tl_ops = {
ucs_empty_function_return_unsupported,
.ep_pending_purge = (uct_ep_pending_purge_func_t)
ucs_empty_function_return_unsupported,
.iface_flush = (uct_iface_flush_func_t)
ucs_empty_function_return_unsupported,
.iface_flush = uct_srd_iface_flush,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use uct_base_iface_flush?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -327,6 +327,10 @@ UCS_TEST_SKIP_COND_P(test_ucp_perf, envelope, has_transport("self"))
size_t max_iter = std::numeric_limits<size_t>::max();
test_spec test = tests[get_variant_value(VARIANT_TEST_TYPE)];

if (ucs::is_aws() && (test.wait_mode == UCX_PERF_WAIT_MODE_SLEEP)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if we don'k skip aws here?
i would expect UCP layer or libperf be able to handle the missing support

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is the error below, actually it is seen when only ud_v is available, updated the condition in the meantime. I am understanding that it is related to missing event support, do you see an easy way to detect it before inside libperf?

[ RUN      ] ud/test_ucp_perf.envelope/10 <ud_v/tag_bw_b>
[1738752485.488782] [rock28:3736565:p-0]          select.c:645  UCX  ERROR   no active messages transport to rock28:3736565: ud_verbs/rdmap0:1 - no send completion event, ud_verbs/rdmap1:1 - no send completion event
[1738752485.488791] [rock28:3736565:p-1]          select.c:645  UCX  ERROR   no active messages transport to rock28:3736565: ud_verbs/rdmap0:1 - no send completion event, ud_verbs/rdmap1:1 - no send completion event
contrib/../test/gtest/common/test.cc:366: Failure
Failed
Got 2 errors and 0 warnings during the test
[     INFO ] < contrib/../src/ucp/wireup/select.c:645 no active messages transport to rock28:3736565: ud_verbs/rdmap0:1 - no send completion event, ud_verbs/rdmap1:1 - no send completion event >
[     INFO ] < contrib/../src/ucp/wireup/select.c:645 no active messages transport to rock28:3736565: ud_verbs/rdmap0:1 - no send completion event, ud_verbs/rdmap1:1 - no send completion event >
[  FAILED  ] ud/test_ucp_perf.envelope/10, where GetParam() = ud_v/tag_bw_b (6 ms)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIR the issue was solicited even support, maybe we can support wakeup/sleep in ud_v trasnport with arming cq for any event (not just solicited), it will not be optimal but at least will work
we can add here "// TODO support wakeup in UD transport without requiring IBV_SEND_SOLICITED"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added comment

ss << GetParam().transports;

if (ucs::is_aws() && (test.wait_mode == UCX_PERF_WAIT_MODE_SLEEP) &&
ss.str() == "ud_v") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. add ( )
  2. maybe use has_transport() instead of std::stringstream ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@yosefe yosefe enabled auto-merge February 6, 2025 10:44
@yosefe yosefe merged commit 077fc07 into openucx:master Feb 6, 2025
148 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants