You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If there is an error during slurm job submission, for example if sbatch complains that the job submission script is invalid, the resulting error is not propagated to the caller, leading to a hanging program.
Exception in thread Thread-323:
Traceback (most recent call last):
File ".local/share/uv/python/cpython-3.11.10-linux-x86_64-gnu/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
self.run()
File ".venv/lib/python3.11/site-packages/cluster_tools/schedulers/slurm.py", line 577, in run
job_id = SlurmExecutor.submit_text(script, self.cfut_dir)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".venv/lib/python3.11/site-packages/cluster_tools/schedulers/slurm.py", line 248, in submit_text
job_id, stderr = chcall("sbatch --parsable {}".format(filename))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".venv/lib/python3.11/site-packages/cluster_tools/_utils/call.py", line 47, in chcall
raise CommandError(command, code, stderr)
cluster_tools._utils.call.CommandError: 'sbatch --parsable <redacted>.sh' exited with status 1: 'sbatch: error: memory limit must be provided for shared jobs\nsbatch: error: Batch job submission failed: Invalid feature specification\n'
^C
This bug was introduced with the use of job submission threads. Since the submission threads are never joined and there is no special error handling/communication, errors are not propagated.
Expected Behavior
The caller of the slurm executor should be notified about the submission error through a raised error
Current Behavior
No error is raised on the caller side and no more jobs are submitted leading to an indefinite hang of the program
Steps to Reproduce the bug
Cannot reproduce the bug anymore / needs deeper investigation.
Provoke an sbatch submission error, for example by specifying the slurm strategy and a time or mem resource that is too large or invalid
Caller won't shut down and hang indefinitely
Your Environment for bug
Operating System and version: Linux 5.14.21
Version of webKnossos-libs (Release or Commit): 0.16.2
The text was updated successfully, but these errors were encountered:
Context
Affected library: cluster-tools
If there is an error during slurm job submission, for example if sbatch complains that the job submission script is invalid, the resulting error is not propagated to the caller, leading to a hanging program.
Expected Behavior
Current Behavior
Steps to Reproduce the bug
Your Environment for bug
The text was updated successfully, but these errors were encountered: