-
Notifications
You must be signed in to change notification settings - Fork 711
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
librdmacm: prevent NULL pointer access during device initialization #1547
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's have a matching 'Fixes' line in the commit log.
9a124b9
to
4182dc8
Compare
Done :) |
The Fixes line should include 12 chars from the commit ID, you put only 9 |
When an RNIC with node_guid 0 is present, rdma_resolve_addr succeeds with ADDR_RESOLVED but subsequent device initialization can fail. This occurs because ucma_query_addr and ucma_query_route skip device initialization when the kernel returns a zero node_guid, leading to NULL pointer access in ucma_process_addr_resolved. Add explicit NULL checks for id->verbs after ucma_query_addr and ucma_query_route calls. Return ENODEV error if device initialization fails, ensuring proper error propagation instead of crashes. Note: ucma_query_addr must still return success in this case as it's used for probing AF_IB support, which intentionally skips device initialization. Fixes: 7162325 ("librdmacm: replace query_route call with separate queries") Signed-off-by: Luke Yue <[email protected]>
4182dc8
to
56da3c2
Compare
Thanks for catching that! I've now updated the Fixes line to include all 12 characters of the commit ID in the latest push. |
} else { | ||
evt->event.status = ucma_query_route(&evt->id_priv->id); | ||
evt->event.status = ucma_query_route(id); | ||
if (!evt->event.status && !id->verbs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that flow, the 'id->verbs' is not used compared to the above 'af_ib_support' case.
Do we really need to fail the call once id->verbs wasn't set ? no NULL pointer issue is expected here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we still need to fail the call in af_ib_support == 0
case, as the man page says
If successful, the specified rdma_cm_id will be bound to a local device.
However, in the current implementation, no device binding occurs in this case.
This becomes problematic in the typical RDMA CM connection setup flow where rdma_resolve_route
is called after receiving RDMA_CM_EVENT_ADDR_RESOLVED
. The following code in rdma_resolve_route
would trigger a NULL pointer dereference:
int rdma_resolve_route(struct rdma_cm_id *id, int timeout_ms)
{
... snip ...
// id->verbs is NULL here
if (id->verbs->device->transport_type == IBV_TRANSPORT_IB) {
ret = ucma_set_ib_route(id);
if (!ret)
goto out;
}
... snip ...
}
So we'd better set the event to RDMA_CM_EVENT_ADDR_ERROR
and let user handle the error event
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes look good to me and make sense. Thanks
When an RNIC with
node_guid
0 is present,rdma_resolve_addr
succeeds withADDR_RESOLVED
but subsequent device initialization can fail. This occurs becauseucma_query_addr
anducma_query_route
skip device initialization when the kernel returns a zeronode_guid
, leading to NULL pointer access inucma_process_addr_resolved
.Add explicit
NULL
checks forid->verbs
afterucma_query_addr
anducma_query_route
calls. ReturnENODEV
error if device initialization fails, ensuring proper error propagation instead of crashes.Note:
ucma_query_addr
must still return success in this case as it's used for probingAF_IB
support, which intentionally skips device initialization.This is easily reproducible with this RNIC configuration and C code:
When use the original librdmacm:
After applying the fix: