Skip to content

Conversation

@zachdworkin
Copy link
Contributor

When updating the unspec queue address information the peer_context's connection is sometimes NULL. This results in segmentation faults on av_insert so checking for it before accessing it fixes the issue. However, this isn't a complete fix because it leaves potential for an entry to get stuck in the wrong queue. The util_srx implementation is overcomplicated for the non-directed receive case. To simplify this, queuing will be limited to the unspecified unexpected queue when not using directed receive. This is because we do not need to enable the per-peer queues unless directed receive is requested. This makes sure that entries do not get stuck in the wrong queue because they are all living in the same one. It also simplifies the lookup for the non-directed receive case.

Note: We do not want to fix this by forcing the rx_buf->conn to be set for all incoming message. We want to avoid an extra lookup/set because the conn is only needed for FI_SOURCE and FI_DIRECTED_RECV cases.

When updating the unspec queue address information the peer_context's
connection is sometimes NULL. This results in segmentation faults on
av_insert so checking for it before accessing it fixes the issue.
However, this isn't a complete fix because it leaves potential for an
entry to get stuck in the wrong queue. The util_srx implementation is
overcomplicated for the non-directed receive case. To simplify this,
queuing will be limited to the unspecified unexpected queue when not
using directed receive. This is because we do not need to enable the
per-peer queues unless directed receive is requested. This makes sure
that entries do not get stuck in the wrong queue because they are all
living in the same one. It also simplifies the lookup for the
non-directed receive case.

Note: We do not want to fix this by forcing the rx_buf->conn to be set
for all incoming message. We want to avoid an extra lookup/set because
the conn is only needed for FI_SOURCE and FI_DIRECTED_RECV cases.

Signed-off-by: Zach Dworkin <[email protected]>
@zachdworkin
Copy link
Contributor Author

@j-xiong I have replayed this PR with the failing mpichtestsuite spawn tests re-enabled. I need to verify that they are enabled correctly and update our CI at the same time this merges

@j-xiong
Copy link
Contributor

j-xiong commented Nov 21, 2025

Looks good. Let's see how the CI works out.

@zachdworkin
Copy link
Contributor Author

Looks good. Let's see how the CI works out.

Looks like there are failures. I will investigate locally and will disable the failing tests if they are caused by different issues than what this PR addresses.

Copy link
Contributor

@aingerson aingerson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might also want to add a line in util_foreach_unspec to only move the entry out of the unspec queue if we're using directed receive.
I'm thinking of the case where we are using FI_SOURCE but not FI_DIRECTED_RECV so the rx_entry->addr will get updated (because the conn is set) and will trigger the move into the per peer queue. It's not a big issue but would ensure all the entries stay in the unspec queue as intended. Otherwise you have some in the per peer queue that could get matched later than potentially intended. It's not a bug perse but would be a lingering case that behaves like the old implementation so it would be good to align all the cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants