-
Notifications
You must be signed in to change notification settings - Fork 214
Description
Describe the bug
The bug occurs when using Parsl's MpiExecLauncher on ALCF's Polaris. Parsl fails to acquire workers because compute nodes cannot create socket connections, raising an "OSError: AF_UNIX path too long" error.
The problem is that Parsl uses Python's multiprocessing connection.py module to create socket connections and the module's arbitrary_address() function generates socket file paths in the current TMPDIR. However, when launched via mpiexec, the TMPDIR environment variable gets extended with additional subdirectories (e.g., /3faf840a-5bc3-49e5-8d4f-59b55d844dc5/tmp/), causing the final socket path to exceed the AF_UNIX 107-byte limit.
For example, direct script execution creates sockets with manageable paths like:
/var/tmp/pbs.6020182.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov/pymp-s4odqjpt/listener-6ovu0q00
But mpiexec-launched processes generate paths that are too long:
/var/tmp/pbs.6020182.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov/3faf840a-5bc3-49e5-8d4f-59b55d844dc5/tmp/pymp-3ot0yojm/listener-_10gbt3q
This path length error prevents socket creation and breaks Parsl's worker acquisition process.
A workaround is to set TMPDIR to a shorter path during the node's initialization. For example, exporting TMPDIR=/tmp in the worker_init solves the problem.
To Reproduce
The problem can be reproduced using mpirun to execute the simple multiprocessing script provided below:
- Acquire a compute node on Polaris
- Run the script directly and ensure it works:
python socket_test.py - Run the script via mpirun:
mpirun -n 1 python socket_test.pyand see theOSError: AF_UNIX path too longerror
from multiprocessing.connection import Listener
def create_temp_socket():
"""Create and bind a temporary socket, then display its information."""
address = None
print("Creating socket...")
listener = Listener(address)
actual_address = listener.address
print(f"Socket bound to: {actual_address}")
listener.close()
print("Socket closed successfully!")
if __name__ == "__main__":
create_temp_socket()Expected behavior
Polaris compute nodes should be able to create a socket connection back to the script running on the login node.
Environment
- OS: SUSE Linux Enterprise Server 15 SP6
- Python version: 3.11.9
- Parsl version: 2025.3.31
Distributed Environment
- Where are you running the Parsl script from? ALCF Polaris login node.
- Where do you need the workers to run? ALCF Polaris compute nodes