Skip to content

Conversation

@rhc54
Copy link
Contributor

@rhc54 rhc54 commented Nov 11, 2025

Inherit env directives if requested

If someone specifies that child jobs inherit from their
parents, then have them inherit any env directives as
well as job-level directives.

Have children inherit their parent's inheritance directive,
unless directed not to do so.

Signed-off-by: Ralph Castain [email protected]
(cherry picked from commit eb577d4)

Extend inheritance to app level

If we are inheriting envar directives from our parent job, then
extend that to inheriting envar directives for the application
of the proc that spawned us. Shift processing of inheritance
directives to the mapper, and ensure that the child inherits
the inheritance directive so that the grandchildren will also
inherit.

Signed-off-by: Ralph Castain [email protected]
(cherry picked from commit a63791f)

Extend testbuild launchers support

Check RAS components for compile errors by shimming
the environment-specific functions

Signed-off-by: Ralph Castain [email protected]
(cherry picked from commit 17399cd)

Fix the colocation algorithm

Therer were two compensating errors that wound up yielding the
correct map, but had a flaw in it should a certain condition
exist. So rework the code to fix the errors and remove the
flaw.

Signed-off-by: Ralph Castain [email protected]
(cherry picked from commit bdbf4db)

Fix precedence ordering on envar operations

Work from left-to-right across the cmd line, applying env-related
options as we go. When one operation affects the result of another,
this preserves a user's common expectation.

Add a "--set-env" option if the corresponding PMIx CLI is defined.
Seemed a little weird that we had "prepend-env", "append-env", etc.,
but no "set-env". It's the equivalent of "-x foo=val".

Signed-off-by: Ralph Castain [email protected]
(cherry picked from commit 805e130)

Bugfix: inconsistently setting PMIX_JOB_RECOVERABLE

Signed-off-by: Matthew Whitlock [email protected]
(cherry picked from commit 0b1ada9)

Clarify help messages

This error is also displayed in cases where files or directories do not
exist and is not only caused by missing permissions.

Signed-off-by: Christoph Niethammer [email protected]
(cherry picked from commit ac77387)

Do not assign DVM's bookmark to the application job

Allow the target node list to follow the ordering inside a provided hostfile
and dash-host specification by not assigning a bookmark based on the DVM job.

Add support for missing default-hostfile cmd line option We have the support
for the user to specify it via MCA param, but somehow we lost the integration
to pick it up off of the prte and prterun cmd lines.

Signed-off-by: Ralph Castain [email protected]
(cherry picked from commit 16d8412)

Error out when asymmetric topologies cannot support ppr requests

PPR placement policy requests are uniform - i.e., the specified
number of procs must be placed on every object of the directed
type. When the request includes a cpu/proc directive, then there
must also be enough CPUs to meet the request on every object.

When that isn't the case, then we need to error out and not
just place the proc without binding it.

Signed-off-by: Ralph Castain [email protected]
(cherry picked from commit 665c38e)

Let seq and rankfile mappers compute their own num-procs

If we are using the seq or rankfile mapper and have multiple
apps on the cmd line, then allow the mappers to compute
their own num procs if one or more are not given.

Signed-off-by: Ralph Castain [email protected]
(cherry picked from commit cb17cce)

Fix relative node processing

The empty nodes were not properly being added to the list
of names to be used by the mapper.

Signed-off-by: Ralph Castain [email protected]
(cherry picked from commit 58130c6)

Replace sprintf with snprintf

Per note in the OMPI project, at least one compiler family is removing the "sprintf" function. Replace all uses of that function with the safer "snprintf" version.

Signed-off-by: Ralph Castain [email protected]
(cherry picked from commit 2ff7d6b)

Extend timeout to child jobs

When a timeout is specified and the primary job is timed-out,
then we need to ensure we also report and kill any child jobs
it started. This includes reporting any requested stack
traces.

Also all inheritance of output directives like tag and timestamp.

Signed-off-by: Ralph Castain [email protected]
(cherry picked from commit d072f27)

Add launching-apps section to docs

Port the "launching-apps" section from the OMPI docs over
to PRRTE since it specifically deals with prterun usage.
Add some updates about gridengine support courtesy of
open-mpi/ompi#13450.

Signed-off-by: Ralph Castain [email protected]
(cherry picked from commit 424480d)

Improve hetero node detection a bit

Use the hwloc synthetic topology string as the signature
instead of our custom attempt at counting number of types
of objects - the synthetic retains some hierarchical info
and hopefully does a little better job of detecting hetero
nodes are in use.

Signed-off-by: Ralph Castain [email protected]
(cherry picked from commit 7e5d030)

Tweak the forwarding of signals

Update the MCA param help message to clarify what the param
does and what values it supports. Cleanup an error where we
would overwrite the resulting list of signals to forward.
Cleanup the return value so we don't generate spurious
error log output. Provide verbose output showing the
signals being forwarded.

Signed-off-by: Ralph Castain [email protected]
(cherry picked from commit 2845dcd)

Cleanup and improve autohandling of hetero nodes

Further improve automatic handling of hetero nodes
by making the non-symmetric signature unique, thereby
forcing collection of the full topology from each
such node. Fix an error in the topology retrieval
procedure whereby we double-counted cached nodes,
thereby causing us to quit collecting topologies early.

Signed-off-by: Ralph Castain [email protected]
(cherry picked from commit 4671290)

Fix prun tool

Need to init the ess framework to have the signal forwarding list initialized

Signed-off-by: Ralph Castain [email protected]
(cherry picked from commit bff13fb)

rhc54 and others added 18 commits November 11, 2025 11:41
If someone specifies that child jobs inherit from their
parents, then have them inherit any env directives as
well as job-level directives.

Have children inherit their parent's inheritance directive,
unless directed not to do so.

Signed-off-by: Ralph Castain <[email protected]>
(cherry picked from commit eb577d4)
If we are inheriting envar directives from our parent job, then
extend that to inheriting envar directives for the application
of the proc that spawned us. Shift processing of inheritance
directives to the mapper, and ensure that the child inherits
the inheritance directive so that the grandchildren will also
inherit.

Signed-off-by: Ralph Castain <[email protected]>
(cherry picked from commit a63791f)
Check RAS components for compile errors by shimming
the environment-specific functions

Signed-off-by: Ralph Castain <[email protected]>
(cherry picked from commit 17399cd)
Therer were two compensating errors that wound up yielding the
correct map, but had a flaw in it should a certain condition
exist. So rework the code to fix the errors and remove the
flaw.

Signed-off-by: Ralph Castain <[email protected]>
(cherry picked from commit bdbf4db)
Work from left-to-right across the cmd line, applying env-related
options as we go. When one operation affects the result of another,
this preserves a user's common expectation.

Add a "--set-env" option if the corresponding PMIx CLI is defined.
Seemed a little weird that we had "prepend-env", "append-env", etc.,
but no "set-env". It's the equivalent of "-x foo=val".

Signed-off-by: Ralph Castain <[email protected]>
(cherry picked from commit 805e130)
Signed-off-by: Matthew Whitlock <[email protected]>
(cherry picked from commit 0b1ada9)
This error is also displayed in cases where files or directories do not
exist and is not only caused by missing permissions.

Signed-off-by: Christoph Niethammer <[email protected]>
(cherry picked from commit ac77387)
Allow the target node list to follow the ordering inside a provided hostfile
and dash-host specification by not assigning a bookmark based on the DVM job.

Add support for missing default-hostfile cmd line option We have the support
for the user to specify it via MCA param, but somehow we lost the integration
to pick it up off of the prte and prterun cmd lines.

Signed-off-by: Ralph Castain <[email protected]>
(cherry picked from commit 16d8412)
PPR placement policy requests are uniform - i.e., the specified
number of procs must be placed on every object of the directed
type. When the request includes a cpu/proc directive, then there
must also be enough CPUs to meet the request on every object.

When that isn't the case, then we need to error out and not
just place the proc without binding it.

Signed-off-by: Ralph Castain <[email protected]>
(cherry picked from commit 665c38e)
If we are using the seq or rankfile mapper and have multiple
apps on the cmd line, then allow the mappers to compute
their own num procs if one or more are not given.

Signed-off-by: Ralph Castain <[email protected]>
(cherry picked from commit cb17cce)
The empty nodes were not properly being added to the list
of names to be used by the mapper.

Signed-off-by: Ralph Castain <[email protected]>
(cherry picked from commit 58130c6)
Per note in the OMPI project, at least one compiler family is removing the "sprintf" function. Replace all uses of that function with the safer "snprintf" version.

Signed-off-by: Ralph Castain <[email protected]>
(cherry picked from commit 2ff7d6b)
When a timeout is specified and the primary job is timed-out,
then we need to ensure we also report and kill any child jobs
it started. This includes reporting any requested stack
traces.

Also all inheritance of output directives like tag and timestamp.

Signed-off-by: Ralph Castain <[email protected]>
(cherry picked from commit d072f27)
Port the "launching-apps" section from the OMPI docs over
to PRRTE since it specifically deals with prterun usage.
Add some updates about gridengine support courtesy of
open-mpi/ompi#13450.

Signed-off-by: Ralph Castain <[email protected]>
(cherry picked from commit 424480d)
Use the hwloc synthetic topology string as the signature
instead of our custom attempt at counting number of types
of objects - the synthetic retains some hierarchical info
and hopefully does a little better job of detecting hetero
nodes are in use.

Signed-off-by: Ralph Castain <[email protected]>
(cherry picked from commit 7e5d030)
Update the MCA param help message to clarify what the param
does and what values it supports. Cleanup an error where we
would overwrite the resulting list of signals to forward.
Cleanup the return value so we don't generate spurious
error log output. Provide verbose output showing the
signals being forwarded.

Signed-off-by: Ralph Castain <[email protected]>
(cherry picked from commit 2845dcd)
Further improve automatic handling of hetero nodes
by making the non-symmetric signature unique, thereby
forcing collection of the full topology from each
such node. Fix an error in the topology retrieval
procedure whereby we double-counted cached nodes,
thereby causing us to quit collecting topologies early.

Signed-off-by: Ralph Castain <[email protected]>
(cherry picked from commit 4671290)
Need to init the ess framework to have the signal forwarding list initialized

Signed-off-by: Ralph Castain <[email protected]>
(cherry picked from commit bff13fb)
@rhc54 rhc54 merged commit 791d41a into openpmix:v3.0 Nov 11, 2025
18 checks passed
@rhc54 rhc54 deleted the cmr30/up branch November 11, 2025 20:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants