Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

librdmacm: prevent NULL pointer access during device initialization #1547

Merged
merged 1 commit into from
Feb 4, 2025

Conversation

dragonJACson
Copy link
Contributor

@dragonJACson dragonJACson commented Jan 18, 2025

When an RNIC with node_guid 0 is present, rdma_resolve_addr succeeds with ADDR_RESOLVED but subsequent device initialization can fail. This occurs because ucma_query_addr and ucma_query_route skip device initialization when the kernel returns a zero node_guid, leading to NULL pointer access in ucma_process_addr_resolved.

Add explicit NULL checks for id->verbs after ucma_query_addr and ucma_query_route calls. Return ENODEV error if device initialization fails, ensuring proper error propagation instead of crashes.

Note: ucma_query_addr must still return success in this case as it's used for probing AF_IB support, which intentionally skips device initialization.

This is easily reproducible with this RNIC configuration and C code:

$ ibv_devinfo
hca_id: mlx5_0
        transport:                      InfiniBand (0)
        fw_ver:                         22.41.1000
        node_guid:                      0000:0000:0000:0000
        sys_image_guid:                 b8ce:f603:00e9:d18e
        vendor_id:                      0x02c9
        vendor_part_id:                 4126
        hw_ver:                         0x0
        board_id:                       MT_0000000430
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_DOWN (1)
                        max_mtu:                4096 (5)
                        active_mtu:             1024 (3)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <rdma/rdma_cma.h>
#include <netinet/in.h>
#include <arpa/inet.h>

int main() {
    struct rdma_cm_id *cm_id;
    struct rdma_event_channel *channel;
    struct rdma_cm_event *event;
    struct sockaddr_in addr;
    int ret;

    // Create event channel
    channel = rdma_create_event_channel();
    if (!channel) {
        perror("rdma_create_event_channel failed");
        return 1;
    }

    // Create RDMA ID
    ret = rdma_create_id(channel, &cm_id, NULL, RDMA_PS_TCP);
    if (ret) {
        perror("rdma_create_id failed");
        rdma_destroy_event_channel(channel);
        return 1;
    }

    // Setup address
    memset(&addr, 0, sizeof(addr));
    addr.sin_family = AF_INET;
    addr.sin_port = htons(7471); 
    inet_pton(AF_INET, "127.0.0.1", &addr.sin_addr);

    // Resolve address
    ret = rdma_resolve_addr(cm_id, NULL, (struct sockaddr *)&addr, 2000);
    if (ret) {
        perror("rdma_resolve_addr failed");
        goto cleanup;
    }

    // Get the address resolved event
    ret = rdma_get_cm_event(channel, &event);
    if (ret) {
        perror("rdma_get_cm_event failed");
        goto cleanup;
    }

    if (event->event != RDMA_CM_EVENT_ADDR_RESOLVED) {
        fprintf(stderr, "Unexpected event: %s, status: %d\n", rdma_event_str(event->event), event->status);
        rdma_ack_cm_event(event);
        goto cleanup;
    }

    printf("Address resolved successfully\n");
    rdma_ack_cm_event(event);

cleanup:
    rdma_destroy_id(cm_id);
    rdma_destroy_event_channel(channel);
    return ret;
}

When use the original librdmacm:

$ ./test_cm
[1]    44206 segmentation fault  ./test_cm

After applying the fix:

LD_LIBRARY_PATH=~/workspace/rdma-core/build/lib ./test_cm
Unexpected event: RDMA_CM_EVENT_ADDR_ERROR, status: -1

Copy link
Member

@yishaih yishaih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's have a matching 'Fixes' line in the commit log.

@dragonJACson dragonJACson force-pushed the fix/rdmacm-addr-resolved branch from 9a124b9 to 4182dc8 Compare January 28, 2025 03:02
@dragonJACson
Copy link
Contributor Author

Let's have a matching 'Fixes' line in the commit log.

Done :)

@yishaih
Copy link
Member

yishaih commented Jan 28, 2025

The Fixes line should include 12 chars from the commit ID, you put only 9

When an RNIC with node_guid 0 is present, rdma_resolve_addr succeeds with
ADDR_RESOLVED but subsequent device initialization can fail. This occurs
because ucma_query_addr and ucma_query_route skip device initialization
when the kernel returns a zero node_guid, leading to NULL pointer access
in ucma_process_addr_resolved.

Add explicit NULL checks for id->verbs after ucma_query_addr and
ucma_query_route calls. Return ENODEV error if device initialization
fails, ensuring proper error propagation instead of crashes.

Note: ucma_query_addr must still return success in this case as it's
used for probing AF_IB support, which intentionally skips device
initialization.

Fixes: 7162325 ("librdmacm: replace query_route call with separate queries")
Signed-off-by: Luke Yue <[email protected]>
@dragonJACson dragonJACson force-pushed the fix/rdmacm-addr-resolved branch from 4182dc8 to 56da3c2 Compare January 28, 2025 13:25
@dragonJACson
Copy link
Contributor Author

The Fixes line should include 12 chars from the commit ID, you put only 9

Thanks for catching that! I've now updated the Fixes line to include all 12 characters of the commit ID in the latest push.

} else {
evt->event.status = ucma_query_route(&evt->id_priv->id);
evt->event.status = ucma_query_route(id);
if (!evt->event.status && !id->verbs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that flow, the 'id->verbs' is not used compared to the above 'af_ib_support' case.

Do we really need to fail the call once id->verbs wasn't set ? no NULL pointer issue is expected here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we still need to fail the call in af_ib_support == 0 case, as the man page says

If successful, the specified rdma_cm_id will be bound to a local device.

However, in the current implementation, no device binding occurs in this case.

This becomes problematic in the typical RDMA CM connection setup flow where rdma_resolve_route is called after receiving RDMA_CM_EVENT_ADDR_RESOLVED. The following code in rdma_resolve_route would trigger a NULL pointer dereference:

int rdma_resolve_route(struct rdma_cm_id *id, int timeout_ms)
{
        ... snip ...
        // id->verbs is NULL here
        if (id->verbs->device->transport_type == IBV_TRANSPORT_IB) {
                ret = ucma_set_ib_route(id);
                if (!ret)
                        goto out;
        }
      
        ... snip ...
}

So we'd better set the event to RDMA_CM_EVENT_ADDR_ERROR and let user handle the error event

Copy link
Contributor

@shefty shefty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes look good to me and make sense. Thanks

@dragonJACson dragonJACson requested a review from yishaih February 4, 2025 02:02
@yishaih yishaih merged commit aac4923 into linux-rdma:master Feb 4, 2025
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants