Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Errors during slurm job submission are not propagated #1237

Open
1 task
daniel-wer opened this issue Jan 21, 2025 · 0 comments
Open
1 task

Errors during slurm job submission are not propagated #1237

daniel-wer opened this issue Jan 21, 2025 · 0 comments

Comments

@daniel-wer
Copy link
Member

Context

  • Affected library: cluster-tools

  • If there is an error during slurm job submission, for example if sbatch complains that the job submission script is invalid, the resulting error is not propagated to the caller, leading to a hanging program.

Exception in thread Thread-323:
Traceback (most recent call last):
  File ".local/share/uv/python/cpython-3.11.10-linux-x86_64-gnu/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
    self.run()
  File ".venv/lib/python3.11/site-packages/cluster_tools/schedulers/slurm.py", line 577, in run
    job_id = SlurmExecutor.submit_text(script, self.cfut_dir)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".venv/lib/python3.11/site-packages/cluster_tools/schedulers/slurm.py", line 248, in submit_text
    job_id, stderr = chcall("sbatch --parsable {}".format(filename))
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".venv/lib/python3.11/site-packages/cluster_tools/_utils/call.py", line 47, in chcall
    raise CommandError(command, code, stderr)
cluster_tools._utils.call.CommandError: 'sbatch --parsable <redacted>.sh' exited with status 1: 'sbatch: error: memory limit must be provided for shared jobs\nsbatch: error: Batch job submission failed: Invalid feature specification\n'
^C
  • This bug was introduced with the use of job submission threads. Since the submission threads are never joined and there is no special error handling/communication, errors are not propagated.

Expected Behavior

  • The caller of the slurm executor should be notified about the submission error through a raised error

Current Behavior

  • No error is raised on the caller side and no more jobs are submitted leading to an indefinite hang of the program

Steps to Reproduce the bug

  • Cannot reproduce the bug anymore / needs deeper investigation.
  1. Provoke an sbatch submission error, for example by specifying the slurm strategy and a time or mem resource that is too large or invalid
  2. Caller won't shut down and hang indefinitely

Your Environment for bug

  • Operating System and version: Linux 5.14.21
  • Version of webKnossos-libs (Release or Commit): 0.16.2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant