Skip to content

AF_UNIX socket creation fails with "path too long" error at ALCF #3947

@ryanchard

Description

@ryanchard

Describe the bug
The bug occurs when using Parsl's MpiExecLauncher on ALCF's Polaris. Parsl fails to acquire workers because compute nodes cannot create socket connections, raising an "OSError: AF_UNIX path too long" error.

The problem is that Parsl uses Python's multiprocessing connection.py module to create socket connections and the module's arbitrary_address() function generates socket file paths in the current TMPDIR. However, when launched via mpiexec, the TMPDIR environment variable gets extended with additional subdirectories (e.g., /3faf840a-5bc3-49e5-8d4f-59b55d844dc5/tmp/), causing the final socket path to exceed the AF_UNIX 107-byte limit.

For example, direct script execution creates sockets with manageable paths like:
/var/tmp/pbs.6020182.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov/pymp-s4odqjpt/listener-6ovu0q00

But mpiexec-launched processes generate paths that are too long:
/var/tmp/pbs.6020182.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov/3faf840a-5bc3-49e5-8d4f-59b55d844dc5/tmp/pymp-3ot0yojm/listener-_10gbt3q

This path length error prevents socket creation and breaks Parsl's worker acquisition process.

A workaround is to set TMPDIR to a shorter path during the node's initialization. For example, exporting TMPDIR=/tmp in the worker_init solves the problem.

To Reproduce
The problem can be reproduced using mpirun to execute the simple multiprocessing script provided below:

  1. Acquire a compute node on Polaris
  2. Run the script directly and ensure it works: python socket_test.py
  3. Run the script via mpirun: mpirun -n 1 python socket_test.py and see the OSError: AF_UNIX path too long error
from multiprocessing.connection import Listener

def create_temp_socket():
    """Create and bind a temporary socket, then display its information."""
    address = None

    print("Creating socket...")
    listener = Listener(address)

    actual_address = listener.address
    print(f"Socket bound to: {actual_address}")

    listener.close()
    print("Socket closed successfully!")

if __name__ == "__main__":
    create_temp_socket()

Expected behavior
Polaris compute nodes should be able to create a socket connection back to the script running on the login node.

Environment

  • OS: SUSE Linux Enterprise Server 15 SP6
  • Python version: 3.11.9
  • Parsl version: 2025.3.31

Distributed Environment

  • Where are you running the Parsl script from? ALCF Polaris login node.
  • Where do you need the workers to run? ALCF Polaris compute nodes

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions