-
Notifications
You must be signed in to change notification settings - Fork 878
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dpm: don't use locality info for multi PMIX namespace environments #13059
dpm: don't use locality info for multi PMIX namespace environments #13059
Conversation
Some of our collective frameworks are now locality aware and make use of this information to make decisions on how to handle app coll ops. It turns out that in certain multi-namespace situations (jobid in ompi speak), some procs can get locality info about other procs but not in a symmetric fashion using PMIx mechanisms. This can lead to communicators with different locality information on different procs. This can lead to deadlock when using certain collectives. This situation can be seen with the ompi-tests/ibm/dynamic/intercomm_merge.c In this test the following happens: 1. process set A is started with mpirun 2. process set A spawns a set of processes B 3. processes in sets A and B create an intra comm using the intercomm from MPI_Comm_spawn and MPI_Comm_get_parent in the spawners and spawnees respectively 4. process in set A spawns a set of processes C 5. processes in sets A and C create an intra comm using the intercomm from MPI_Comm_spawn and MPI_Comm_get_parent in the spawners and spawnees respectively 6. processes in A and B create new intercomm 7. processes in A and C create new intercomm 8. processes in A, B, anc C create a new intra comm using the intercomms from steps 6 and 7 9. processes in A, B, and C try to do an MPI_Barrier using the intra comm from step 8 It turns out in step 8 the locality info supplied by pmix is asymmetric. Processes in sets B and C aren't able to deterimine locality info from each other (PMIx returns not found when attempts are made to get locality info for the remote processes). This causes issues when the step 9 is executed. Processes in set A are trying to use the tuned collective component for the barrier. Processes in sets B and C are trying to use the HAN collective component for the barrier. In process sets B and C, HAN thinks that the communicator has both local and remote procs so tries to use a hierarchical algorithm. Meanwhile, procs in set A can retrieve locality from all procs in sets B and C and think the collective is occuring on a single node - which in fact it is. This behavior can be observed using prrte master at 8ecee645de and openpmix master at a083d8f9. This patch restricts using locality info for a proc if its in a different PMIx namespace. It also removes some comments which are now no longer accurate. Signed-off-by: Howard Pritchard <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds like a terrible solution,eliminating all topological information outside the scope of the local MPI_COMM_WORLD.
okay well I'll close this. |
Can we exchange the missing information in the Intercomm_merge instead ? If I understand your example correctly, processes in A have the correct info about the new intercomm. |
It's trivial for PMIx to provide locality for other namespaces - nobody ever asked for it before, but we know all the info. I can take a look at it now that the request has surfaced. |
So I spent some time thinking about this and digging into it, and this has nothing to do with PMIx or PRRTE. The problem is that A knows about B and C, but B and C don't know about each other. If you walk through the provided logic, you can see that B and C never execute the connect/accept code in the dpm. Thus, then never get their local peers list for the other job, and never mark procs from the other job as being local vs remote. What this patch does is simply declare that procs from all other jobs are "remote" - i.e., on another node. As George points out, that doesn't seem like a good solution. It ensures you have a consistent picture, but at a cost. What you'll need to decide is: at what point in all those steps should B and C "discover" each other? You basically just need to ensure that the dpm code that retrieves/parses local peers and sets up locality for them gets executed, even if B and C don't explicitly call connect/accept to each other. I suspect the answer is to do it in step 8, but that's up to you. I'm guessing it isn't in that step right now because it is creating an intra communicator instead of an inter communicator, and thus doesn't flow thru connect/accept? |
The problematic code is on ompi_comm_get_rprocs The modex call returns not found if the namespace of the procs is not known. This used not to matter but with things like Han it does. |
I see a potential problem. On line 2355 of ompi/communicator/comm.c, change |
Sure I'll try that but I find the comment in commit #6635795911c interesting |
What does local concept mean in the comment in that commit? |
"locality" refers to the location of the process within the node it is operating on - i.e., what package, cores, etc. it is using. So it is a "local" concept in that it only has meaning for the node upon which the process operates. You cannot take the returned locality and interpret it in terms of your own topology if you are on a different node. This is why we only pull locality for procs that are on the "local peers" list - i.e., procs that we know are on the same node as us. So determining locality is a two-step process. First, you get the list of local peers to determine who is local vs remote. Then you get the locality for all local peers so you know where they are relative to you on the node (e.g., do you share a package?). We switched to "immediate" on the modex because the data for other namespaces is held in the server, and "optional" wouldn't go to the server to retrieve it. What you seem to be missing in that section of comm.c is the initial call to get the local peers - you are cycling all the procs through the request for locality. You might get errors that way - I'd have to check what happens if you ask for a proc that isn't on your node (I believe we handle it politely and return not found or something). Regardless, it needs to be "immediate" so you'll ask the server for the data on namespaces other than your own. |
that suggestion doesn't help. I added some print statements and still see for jobid(namespace != my namespace) for the child processes in B looking up data for C and visa versa:
remember the procs in B and C were not "connected" directly via PMIx_Connect. I'm running on a single node. |
I suspect the lack of connection is part of the problem, but it should still go up to the server to return the info. Can you provide me with a reproducer so I can explore this a bit? You can post it to me on Slack if necessary. |
Still, remember that there is a difference between "locality" and "colocated on a node". If all you want to know is the latter, then asking for "locality" isn't the correct approach - you want to ask for "local peers" and see which procs are on that list. If a proc isn't, then it is on a different node. If it is local, then you can get the topological location for it. |
the test source code is here - https://gist.github.com/hppritcha/3f27705da10f8d0af7823b88fe905dc4 |
i'm running with mpirun -np4 blah blah |
Ah...I have found the problem. Combination of errors in OMPI and PMIx - the latter reported by Cray just last week. Pretty simple fixes, so hopefully have something later today. |
okay sounds good! |
Guess I'll capture some of the learnings here - will include them later on the commit/PR, but good to ensure they don't get lost along the way. First, there is a misunderstanding in the code in The absolute location of the proc is provided by the runtime in So the reason B and C aren't able to get each others Second, the runtime provides The only way to determine that a proc is on a different node is to get the list (or array) of procs on the node and see if the proc is on it. We do this in the dpm, but that step was missing from the comm code. So what I'm doing is creating a new function The bug in PMIx was an attempted optimization that caused the server not to return the key being requested. I removed the optimization - not sure it was all that helpful anyway. |
Oh - forgot to mention. The PMIx bug had nothing to do with this particular problem. We wouldn't be able to retrieve |
Some of our collective frameworks are now locality aware and make use of this information to make decisions on how to handle app coll ops.
It turns out that in certain multi-namespace situations (jobid in ompi speak), some procs can get locality info about other procs but not in a symmetric fashion using PMIx mechanisms. This can lead to communicators with different locality information on different procs. This can lead to deadlock when using certain collectives.
This situation can be seen with the ompi-tests/ibm/dynamic/intercomm_merge.c
In this test the following happens:
It turns out in step 8 the locality info supplied by pmix is asymmetric. Processes in sets B and C aren't able to deterimine locality info from each other (PMIx returns not found when attempts are made to get locality info for the remote processes). This causes issues when the step 9 is executed.
Processes in set A are trying to use the tuned collective component for the barrier. Processes in sets B and C are trying to use the HAN collective component for the barrier. In process sets B and C, HAN thinks that the communicator has both local and remote procs so tries to use a hierarchical algorithm. Meanwhile, procs in set A can retrieve locality from all procs in sets B and C and think the collective is occuring on a single node - which in fact it is.
This behavior can be observed using prrte master at 8ecee645de and openpmix master at a083d8f9.
This patch restricts using locality info for a proc if its in a different PMIx namespace. It also removes some comments which are now no longer accurate.