dpm: don't use locality info for multi PMIX namespace environments #13059

hppritcha · 2025-01-27T21:25:31Z

Some of our collective frameworks are now locality aware and make use of this information to make decisions on how to handle app coll ops.

It turns out that in certain multi-namespace situations (jobid in ompi speak), some procs can get locality info about other procs but not in a symmetric fashion using PMIx mechanisms. This can lead to communicators with different locality information on different procs. This can lead to deadlock when using certain collectives.

This situation can be seen with the ompi-tests/ibm/dynamic/intercomm_merge.c

In this test the following happens:

process set A is started with mpirun
process set A spawns a set of processes B
processes in sets A and B create an intra comm using the intercomm from MPI_Comm_spawn and MPI_Comm_get_parent in the spawners and spawnees respectively
process in set A spawns a set of processes C
processes in sets A and C create an intra comm using the intercomm from MPI_Comm_spawn and MPI_Comm_get_parent in the spawners and spawnees respectively
processes in A and B create new intercomm
processes in A and C create new intercomm
processes in A, B, anc C create a new intra comm using the intercomms from steps 6 and 7
processes in A, B, and C try to do an MPI_Barrier using the intra comm from step 8

It turns out in step 8 the locality info supplied by pmix is asymmetric. Processes in sets B and C aren't able to deterimine locality info from each other (PMIx returns not found when attempts are made to get locality info for the remote processes). This causes issues when the step 9 is executed.

Processes in set A are trying to use the tuned collective component for the barrier. Processes in sets B and C are trying to use the HAN collective component for the barrier. In process sets B and C, HAN thinks that the communicator has both local and remote procs so tries to use a hierarchical algorithm. Meanwhile, procs in set A can retrieve locality from all procs in sets B and C and think the collective is occuring on a single node - which in fact it is.

This behavior can be observed using prrte master at 8ecee645de and openpmix master at a083d8f9.

This patch restricts using locality info for a proc if its in a different PMIx namespace. It also removes some comments which are now no longer accurate.

Some of our collective frameworks are now locality aware and make use of this information to make decisions on how to handle app coll ops. It turns out that in certain multi-namespace situations (jobid in ompi speak), some procs can get locality info about other procs but not in a symmetric fashion using PMIx mechanisms. This can lead to communicators with different locality information on different procs. This can lead to deadlock when using certain collectives. This situation can be seen with the ompi-tests/ibm/dynamic/intercomm_merge.c In this test the following happens: 1. process set A is started with mpirun 2. process set A spawns a set of processes B 3. processes in sets A and B create an intra comm using the intercomm from MPI_Comm_spawn and MPI_Comm_get_parent in the spawners and spawnees respectively 4. process in set A spawns a set of processes C 5. processes in sets A and C create an intra comm using the intercomm from MPI_Comm_spawn and MPI_Comm_get_parent in the spawners and spawnees respectively 6. processes in A and B create new intercomm 7. processes in A and C create new intercomm 8. processes in A, B, anc C create a new intra comm using the intercomms from steps 6 and 7 9. processes in A, B, and C try to do an MPI_Barrier using the intra comm from step 8 It turns out in step 8 the locality info supplied by pmix is asymmetric. Processes in sets B and C aren't able to deterimine locality info from each other (PMIx returns not found when attempts are made to get locality info for the remote processes). This causes issues when the step 9 is executed. Processes in set A are trying to use the tuned collective component for the barrier. Processes in sets B and C are trying to use the HAN collective component for the barrier. In process sets B and C, HAN thinks that the communicator has both local and remote procs so tries to use a hierarchical algorithm. Meanwhile, procs in set A can retrieve locality from all procs in sets B and C and think the collective is occuring on a single node - which in fact it is. This behavior can be observed using prrte master at 8ecee645de and openpmix master at a083d8f9. This patch restricts using locality info for a proc if its in a different PMIx namespace. It also removes some comments which are now no longer accurate. Signed-off-by: Howard Pritchard <[email protected]>

bosilca

This sounds like a terrible solution,eliminating all topological information outside the scope of the local MPI_COMM_WORLD.

hppritcha · 2025-01-27T21:49:21Z

okay well I'll close this.

bosilca · 2025-01-27T21:50:26Z

Can we exchange the missing information in the Intercomm_merge instead ? If I understand your example correctly, processes in A have the correct info about the new intercomm.

rhc54 · 2025-01-27T21:52:28Z

It's trivial for PMIx to provide locality for other namespaces - nobody ever asked for it before, but we know all the info. I can take a look at it now that the request has surfaced.

rhc54 · 2025-01-28T13:43:14Z

So I spent some time thinking about this and digging into it, and this has nothing to do with PMIx or PRRTE. The problem is that A knows about B and C, but B and C don't know about each other.

If you walk through the provided logic, you can see that B and C never execute the connect/accept code in the dpm. Thus, then never get their local peers list for the other job, and never mark procs from the other job as being local vs remote.

What this patch does is simply declare that procs from all other jobs are "remote" - i.e., on another node. As George points out, that doesn't seem like a good solution. It ensures you have a consistent picture, but at a cost.

What you'll need to decide is: at what point in all those steps should B and C "discover" each other? You basically just need to ensure that the dpm code that retrieves/parses local peers and sets up locality for them gets executed, even if B and C don't explicitly call connect/accept to each other. I suspect the answer is to do it in step 8, but that's up to you. I'm guessing it isn't in that step right now because it is creating an intra communicator instead of an inter communicator, and thus doesn't flow thru connect/accept?

hppritcha · 2025-01-28T14:52:45Z

The problematic code is on ompi_comm_get_rprocs

The modex call returns not found if the namespace of the procs is not known.

This used not to matter but with things like Han it does.

rhc54 · 2025-01-28T14:56:28Z

I see a potential problem. On line 2355 of ompi/communicator/comm.c, change OPAL_MODEX_RECV_VALUE_OPTIONAL to OPAL_MODEX_RECV_VALUE_IMMEDIATE and see if that helps.

hppritcha · 2025-01-28T15:21:49Z

Sure I'll try that but I find the comment in commit #6635795911c interesting

hppritcha · 2025-01-28T15:23:51Z

What does local concept mean in the comment in that commit?

rhc54 · 2025-01-28T15:33:25Z

"locality" refers to the location of the process within the node it is operating on - i.e., what package, cores, etc. it is using. So it is a "local" concept in that it only has meaning for the node upon which the process operates. You cannot take the returned locality and interpret it in terms of your own topology if you are on a different node.

This is why we only pull locality for procs that are on the "local peers" list - i.e., procs that we know are on the same node as us. So determining locality is a two-step process. First, you get the list of local peers to determine who is local vs remote. Then you get the locality for all local peers so you know where they are relative to you on the node (e.g., do you share a package?).

We switched to "immediate" on the modex because the data for other namespaces is held in the server, and "optional" wouldn't go to the server to retrieve it.

What you seem to be missing in that section of comm.c is the initial call to get the local peers - you are cycling all the procs through the request for locality. You might get errors that way - I'd have to check what happens if you ask for a proc that isn't on your node (I believe we handle it politely and return not found or something). Regardless, it needs to be "immediate" so you'll ask the server for the data on namespaces other than your own.

hppritcha · 2025-01-28T16:47:17Z

that suggestion doesn't help. I added some print statements and still see for jobid(namespace != my namespace) for the child processes in B looking up data for C and visa versa:

PMIX_LOCALITY for proc returns 0 jobid 3904569345 vpid 1
PMIX_LOCALITY for proc returns 0 jobid 3904569345 vpid 2
PMIX_LOCALITY for proc returns 0 jobid 3904569345 vpid 3
PMIX_LOCALITY for proc returns -46 jobid 3904569347 vpid 0
PMIX_LOCALITY for proc returns -46 jobid 3904569347 vpid 1
PMIX_LOCALITY for proc returns -46 jobid 3904569346 vpid 0
PMIX_LOCALITY for proc returns -46 jobid 3904569347 vpid 2
PMIX_LOCALITY for proc returns -46 jobid 3904569347 vpid 1
PMIX_LOCALITY for proc returns -46 jobid 3904569346 vpid 1
PMIX_LOCALITY for proc returns -46 jobid 3904569347 vpid 1
PMIX_LOCALITY for proc returns -46 jobid 3904569346 vpid 1
PMIX_LOCALITY for proc returns -46 jobid 3904569346 vpid 1
PMIX_LOCALITY for proc returns -46 jobid 3904569347 vpid 2
PMIX_LOCALITY for proc returns -46 jobid 3904569346 vpid 1
PMIX_LOCALITY for proc returns -46 jobid 3904569347 vpid 3
PMIX_LOCALITY for proc returns -46 jobid 3904569346 vpid 2
PMIX_LOCALITY for proc returns -46 jobid 3904569347 vpid 2
PMIX_LOCALITY for proc returns -46 jobid 3904569347 vpid 2
PMIX_LOCALITY for proc returns -46 jobid 3904569346 vpid 2
PMIX_LOCALITY for proc returns -46 jobid 3904569346 vpid 2
PMIX_LOCALITY for proc returns -46 jobid 3904569347 vpid 3
PMIX_LOCALITY for proc returns -46 jobid 3904569346 vpid 2
PMIX_LOCALITY for proc returns -46 jobid 3904569347 vpid 3
PMIX_LOCALITY for proc returns -46 jobid 3904569346 vpid 3
PMIX_LOCALITY for proc returns -46 jobid 3904569347 vpid 3
PMIX_LOCALITY for proc returns -46 jobid 3904569346 vpid 3
PMIX_LOCALITY for proc returns -46 jobid 3904569346 vpid 3
PMIX_LOCALITY for proc returns -46 jobid 3904569346 vpid 3

remember the procs in B and C were not "connected" directly via PMIx_Connect.

I'm running on a single node.

rhc54 · 2025-01-28T17:02:07Z

I suspect the lack of connection is part of the problem, but it should still go up to the server to return the info. Can you provide me with a reproducer so I can explore this a bit? You can post it to me on Slack if necessary.

rhc54 · 2025-01-28T17:10:26Z

Still, remember that there is a difference between "locality" and "colocated on a node". If all you want to know is the latter, then asking for "locality" isn't the correct approach - you want to ask for "local peers" and see which procs are on that list. If a proc isn't, then it is on a different node. If it is local, then you can get the topological location for it.

hppritcha · 2025-01-28T18:17:48Z

the test source code is here - https://gist.github.com/hppritcha/3f27705da10f8d0af7823b88fe905dc4

hppritcha · 2025-01-28T18:22:18Z

i'm running with mpirun -np4 blah blah
Note if you run with fewer starting processes the problem may vanish. If it doesn't vanish, the code will hang in the collective op on the merged intracommunicator.

rhc54 · 2025-01-29T14:48:38Z

Ah...I have found the problem. Combination of errors in OMPI and PMIx - the latter reported by Cray just last week. Pretty simple fixes, so hopefully have something later today.

hppritcha · 2025-01-29T17:30:53Z

okay sounds good!

rhc54 · 2025-01-29T21:00:04Z

Guess I'll capture some of the learnings here - will include them later on the commit/PR, but good to ensure they don't get lost along the way.

First, there is a misunderstanding in the code in ompi_comm_get_rprocs. PMIX_LOCALITY is a value that is computed by OMPI when we do connect/accept - it is computed in opal_hwloc_compute_relative_locality and the value is locally stored on each proc. The reason is that PMIX_LOCALITY provides the location of a process relative to you - it isn't an absolute value representing the location of the process on the node.

The absolute location of the proc is provided by the runtime in PMIX_LOCALITY_STRING. This is what you retrieve in dpm.c - and then you use that to compute the relative locality of that proc, which is then stored as PMIX_LOCALITY.

So the reason B and C aren't able to get each others PMIX_LOCALITY values is simply because (a) they didn't go thru connect/accept, and therefore (b) they never computed and saved those values.

Second, the runtime provides PMIX_LOCALITY_STRING only for those procs that have a defined location - i.e., procs that are BOUND. If a process is not bound, then it has no fixed location on the node, and so the runtime doesn't provide a locality string for it. Thus, getting "not found" for a modex retrieval on PMIX_LOCALITY_STRING is NOT a definitive indicator that the proc is on a different node.

The only way to determine that a proc is on a different node is to get the list (or array) of procs on the node and see if the proc is on it. We do this in the dpm, but that step was missing from the comm code.

So what I'm doing is creating a new function ompi_dpm_set_locality that both connect/accept and get_rprocs can use since the required functionality is identical. This will hopefully avoid similar mistakes in the future. I'm also adding some comments to explain the above.

The bug in PMIx was an attempted optimization that caused the server not to return the key being requested. I removed the optimization - not sure it was all that helpful anyway.

rhc54 · 2025-01-29T21:07:17Z

Oh - forgot to mention. The PMIx bug had nothing to do with this particular problem. We wouldn't be able to retrieve PMIX_LOCALITY because the server doesn't have it - it is locally computed and stored by each proc as the values only pertain to that proc. So we were going up to the server just fine, and it was correctly reporting "not found" for that key.

github-actions bot added the Target: main label Jan 27, 2025

bosilca reviewed Jan 27, 2025

View reviewed changes

hppritcha closed this Jan 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dpm: don't use locality info for multi PMIX namespace environments #13059

dpm: don't use locality info for multi PMIX namespace environments #13059

hppritcha commented Jan 27, 2025

bosilca left a comment

hppritcha commented Jan 27, 2025

bosilca commented Jan 27, 2025 •

edited

Loading

rhc54 commented Jan 27, 2025

rhc54 commented Jan 28, 2025

hppritcha commented Jan 28, 2025

rhc54 commented Jan 28, 2025

hppritcha commented Jan 28, 2025

hppritcha commented Jan 28, 2025

rhc54 commented Jan 28, 2025

hppritcha commented Jan 28, 2025

rhc54 commented Jan 28, 2025

rhc54 commented Jan 28, 2025

hppritcha commented Jan 28, 2025

hppritcha commented Jan 28, 2025

rhc54 commented Jan 29, 2025

hppritcha commented Jan 29, 2025

rhc54 commented Jan 29, 2025

rhc54 commented Jan 29, 2025

dpm: don't use locality info for multi PMIX namespace environments #13059

dpm: don't use locality info for multi PMIX namespace environments #13059

Conversation

hppritcha commented Jan 27, 2025

bosilca left a comment

Choose a reason for hiding this comment

hppritcha commented Jan 27, 2025

bosilca commented Jan 27, 2025 • edited Loading

rhc54 commented Jan 27, 2025

rhc54 commented Jan 28, 2025

hppritcha commented Jan 28, 2025

rhc54 commented Jan 28, 2025

hppritcha commented Jan 28, 2025

hppritcha commented Jan 28, 2025

rhc54 commented Jan 28, 2025

hppritcha commented Jan 28, 2025

rhc54 commented Jan 28, 2025

rhc54 commented Jan 28, 2025

hppritcha commented Jan 28, 2025

hppritcha commented Jan 28, 2025

rhc54 commented Jan 29, 2025

hppritcha commented Jan 29, 2025

rhc54 commented Jan 29, 2025

rhc54 commented Jan 29, 2025

bosilca commented Jan 27, 2025 •

edited

Loading