Skip to content

SLURMCluster jobs not running when using parameters from dask.yaml #394

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Ovec8hkin opened this issue Mar 16, 2020 · 8 comments
Closed

Comments

@Ovec8hkin
Copy link

I am using a SLURMCluster object to run some simple python functions in parallel on an HPC cluster. When I run the script by manually passing each parameter to the SLURMCluster object, the jobs are submitted, connect, run, and return properly. However, when I move those parameters to a dask.yaml file (in ~/.config/dask/dask.yaml), the job submit but never connect, finish, and return, but instead hang until I kill the running python process and cancel the subequently submitted jobs. Both ways yield the same job script with identical options specified.

What could be causing this?

Below are copies of my dask.yaml file, as well as the SLURMCluster object with parameters that I use when I manually specifying everything:

CORES=2

#### This works
cluster = SLURMCluster(name='worker_bee',                                                                                                                                                              
                       queue='normal',                                                                                                                                                                 
                       project='TG-EAR180014',                                                                                                                                                         
                       processes=1,                                                                                                                                                                    
                       cores=CORES,                                                                                                                                                                    
                       memory='2GB',                                                                                                                                                                   
                       interface='ib0',                                                                                                                                                                
                       header_skip=['--mem', '--cpus-per-task='],                                                                                                                                      
                       job_extra=['-N {}'.format(CORES)]
)


#### This doesn't work
cluster = SLURMCluster()   

dask.yaml

jobqueue:
  slurm:
    name: worker-bee
    project: TG-EAR180014
    queue: normal

    cores: 2
    memory: 2GB
    processes: 1

    interface: ib0
    death-timeout: 60           # Number of seconds to wait if a worker can not find a scheduler
    local-directory: null       # Location of fast local storage like /scratch or $TMPDIR

    # LSF resource manager options
    shebang: "#!/usr/bin/env bash"
    walltime: '00:30'
    extra: []
    env-extra: []
    ncpus: null
    header-skip: ['--mem', '--cpus-per-task=']

    job-extra: ['-N 2']
    log-directory: null
@lesteve
Copy link
Member

lesteve commented Mar 17, 2020

Weird ... I don't see anything obviously wrong.

Maybe you can have a look at https://jobqueue.dask.org/en/latest/debug.html.
In particular, is there anything in your SLURM logs that would indicate what goes wrong? Generally the log file has a format like <some_text><jobid>.out. You can also use log_directory to specify where the log files should be created.

@lesteve
Copy link
Member

lesteve commented Mar 17, 2020

Also not related at all, but I would be interested to know why you need header_skip. In particular if you could post the error you have without header_skip it could help others with a similar error to figure out how to fix it.

@Ovec8hkin
Copy link
Author

The header_skip is due to some quirks of the HPC cluster I am on. The scheduler doesn't accept the --mem directive and I don't use the --cpus-per-task directive in lieu of the -N directive because the scheduler requires -N to be present.

@Ovec8hkin
Copy link
Author

Sorry for the delay, I was forced to relocate due to ongoing concerns with the spread of the virus.

Logs for the working jobs look like this:

distributed.nanny - INFO -         Start Nanny at: 'tcp://192.168.194.80:35195'
distributed.dashboard.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
distributed.worker - INFO -       Start worker at: tcp://192.168.194.80:41101
distributed.worker - INFO -          Listening to: tcp://192.168.194.80:41101
distributed.worker - INFO -          dashboard at:       192.168.194.80:42457
distributed.worker - INFO - Waiting to connect to: tcp://192.168.192.54:45828
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          2
distributed.worker - INFO -                Memory:                    2.00 GB
distributed.worker - INFO -       Local Directory: /home1/06729/tg860286/tests/dask-rsmas-presentation/dask-worker-space/worker-2iqo8gf1
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to: tcp://192.168.192.54:45828
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.core - INFO - Event loop was unresponsive in Worker for 11.70s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.core - INFO - Event loop was unresponsive in Worker for 10.00s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.worker - INFO - Connection to scheduler broken.  Reconnecting...
distributed.worker - INFO - Stopping worker at tcp://192.168.194.80:41101
distributed.nanny - INFO - Worker closed

Logs for non-working cases, look like this:

distributed.nanny - INFO -         Start Nanny at: 'tcp://192.168.194.11:38197'
distributed.dashboard.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
distributed.worker - INFO -       Start worker at: tcp://192.168.194.11:40507
distributed.worker - INFO -          Listening to: tcp://192.168.194.11:40507
distributed.worker - INFO -          dashboard at:       192.168.194.11:41016
distributed.worker - INFO - Waiting to connect to:  tcp://129.114.63.44:35457
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          2
distributed.worker - INFO -                Memory:                    2.00 GB
distributed.worker - INFO -       Local Directory: /home1/06729/tg860286/tests/dask-rsmas-presentation/dask-worker-space/worker-gkesa8dg
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Waiting to connect to:  tcp://129.114.63.44:35457
distributed.worker - INFO - Waiting to connect to:  tcp://129.114.63.44:35457
distributed.worker - INFO - Waiting to connect to:  tcp://129.114.63.44:35457
distributed.worker - INFO - Waiting to connect to:  tcp://129.114.63.44:35457
distributed.nanny - INFO - Closing Nanny at 'tcp://192.168.194.11:38197'
distributed.worker - INFO - Stopping worker at tcp://192.168.194.11:40507
distributed.worker - INFO - Closed worker has not yet started: None
distributed.dask_worker - INFO - End worker
Traceback (most recent call last):
  File "/home1/06729/tg860286/miniconda3/envs/dasktest/lib/python3.8/site-packages/distributed/node.py", line 173, in wait_for
    await asyncio.wait_for(future, timeout=timeout)
  File "/home1/06729/tg860286/miniconda3/envs/dasktest/lib/python3.8/asyncio/tasks.py", line 490, in wait_for
    raise exceptions.TimeoutError()
asyncio.exceptions.TimeoutError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home1/06729/tg860286/miniconda3/envs/dasktest/lib/python3.8/runpy.py", line 193, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home1/06729/tg860286/miniconda3/envs/dasktest/lib/python3.8/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home1/06729/tg860286/miniconda3/envs/dasktest/lib/python3.8/site-packages/distributed/cli/dask_worker.py", line 440, in <module>
    go()
  File "/home1/06729/tg860286/miniconda3/envs/dasktest/lib/python3.8/site-packages/distributed/cli/dask_worker.py", line 436, in go
    main()
  File "/home1/06729/tg860286/miniconda3/envs/dasktest/lib/python3.8/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/home1/06729/tg860286/miniconda3/envs/dasktest/lib/python3.8/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/home1/06729/tg860286/miniconda3/envs/dasktest/lib/python3.8/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home1/06729/tg860286/miniconda3/envs/dasktest/lib/python3.8/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/home1/06729/tg860286/miniconda3/envs/dasktest/lib/python3.8/site-packages/distributed/cli/dask_worker.py", line 422, in main
    loop.run_sync(run)
  File "/home1/06729/tg860286/miniconda3/envs/dasktest/lib/python3.8/site-packages/tornado/ioloop.py", line 532, in run_sync
    return future_cell[0].result()
  File "/home1/06729/tg860286/miniconda3/envs/dasktest/lib/python3.8/site-packages/distributed/cli/dask_worker.py", line 416, in run
    await asyncio.gather(*nannies)
  File "/home1/06729/tg860286/miniconda3/envs/dasktest/lib/python3.8/asyncio/tasks.py", line 684, in _wrap_awaitable
    return (yield from awaitable.__await__())
  File "/home1/06729/tg860286/miniconda3/envs/dasktest/lib/python3.8/site-packages/distributed/node.py", line 176, in wait_for
    raise TimeoutError(
asyncio.exceptions.TimeoutError: Nanny failed to start in 60 seconds

I have encountered the Nanny failed to start in 60 seconds error before, but it was related to some networking issues, which I have resolved and are reflected in the argument lists for both working and non-working cases. Since the clusters yield identical job scripts, I am at a loss as to why everything works if I manually specify the cluster parameters not not if I source the parameters from the .yaml file.

@lesteve
Copy link
Member

lesteve commented Mar 20, 2020

Sorry for the delay, I was forced to relocate due to ongoing concerns with the spread of the virus.

No need to apologize at all, it is kind of expected that things are a bit complicated these days.

I think your problem should be fixed in master. One way to install dask-jobqueue from master is:

pip install git+https://github.com/dask/dask-jobqueue

Let me know if that does not fix it!

This is very likely the same thing as #358: interface is ignored when set from the config file. The fix is in master (#366).

How I figured this out (can be useful for potential debugging sessions): the scheduler is on a separate network in the non-working case.

  • working: Waiting to connect to: tcp://192.168.192.54:45828. Note that: Start worker at: tcp://192.168.194.11:40507 means the worker and scheduler are very likely in the same network so they can communicate via TCP/IP (required by Dask).
  • non-working: Waiting to connect to: tcp://129.114.63.44:35457

Note in principle this difference should be visible in cluster.job_script() (but hard to notice) since the scheduler address is present in cluster.job_script() (the first argument of dask-worker).

@Ovec8hkin
Copy link
Author

Installing from master worked somewhat, but it exposed a new error. Now, the cluster properly sources all of the parameters, and one of the two parallel jobs Im executing runs and completes as expected, but the other yields the following error message in the logs:

distributed.nanny - INFO -         Start Nanny at: 'tcp://192.168.194.81:33401'
distributed.dashboard.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
distributed.worker - INFO -       Start worker at: tcp://192.168.194.81:34840
distributed.worker - INFO -          Listening to: tcp://192.168.194.81:34840
distributed.worker - INFO -          dashboard at:       192.168.194.81:40883
distributed.worker - INFO - Waiting to connect to: tcp://192.168.192.51:46854
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          2
distributed.worker - INFO -                Memory:                    2.00 GB
distributed.worker - INFO -       Local Directory: /home1/06729/tg860286/tests/dask-rsmas-presentation/dask-worker-space/worker-4p2vmepa
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to: tcp://192.168.192.51:46854
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.core - INFO - Event loop was unresponsive in Worker for 11.58s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.core - INFO - Event loop was unresponsive in Worker for 10.02s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
slurmstepd: error: *** JOB 5406832 ON c402-061 CANCELLED AT 2020-03-20T13:38:55 DUE TO TIME LIMIT ***
distributed.dask_worker - INFO - End worker
Traceback (most recent call last):
  File "/home1/06729/tg860286/miniconda3/envs/dasktest/lib/python3.8/runpy.py", line 193, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home1/06729/tg860286/miniconda3/envs/dasktest/lib/python3.8/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home1/06729/tg860286/miniconda3/envs/dasktest/lib/python3.8/site-packages/distributed/cli/dask_worker.py", line 440, in <module>
    go()
  File "/home1/06729/tg860286/miniconda3/envs/dasktest/lib/python3.8/site-packages/distributed/cli/dask_worker.py", line 436, in go
    main()
  File "/home1/06729/tg860286/miniconda3/envs/dasktest/lib/python3.8/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/home1/06729/tg860286/miniconda3/envs/dasktest/lib/python3.8/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/home1/06729/tg860286/miniconda3/envs/dasktest/lib/python3.8/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home1/06729/tg860286/miniconda3/envs/dasktest/lib/python3.8/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/home1/06729/tg860286/miniconda3/envs/dasktest/lib/python3.8/site-packages/distributed/cli/dask_worker.py", line 422, in main
    loop.run_sync(run)
  File "/home1/06729/tg860286/miniconda3/envs/dasktest/lib/python3.8/site-packages/tornado/ioloop.py", line 526, in run_sync
    self.start()
  File "/home1/06729/tg860286/miniconda3/envs/dasktest/lib/python3.8/site-packages/tornado/platform/asyncio.py", line 149, in start
    self.asyncio_loop.run_forever()
  File "/home1/06729/tg860286/miniconda3/envs/dasktest/lib/python3.8/asyncio/base_events.py", line 570, in run_forever
    self._run_once()
  File "/home1/06729/tg860286/miniconda3/envs/dasktest/lib/python3.8/asyncio/base_events.py", line 1823, in _run_once
    event_list = self._selector.select(timeout)
  File "/home1/06729/tg860286/miniconda3/envs/dasktest/lib/python3.8/selectors.py", line 468, in select
    fd_event_list = self._selector.poll(timeout, max_ev)
  File "/home1/06729/tg860286/miniconda3/envs/dasktest/lib/python3.8/site-packages/distributed/cli/utils.py", line 65, in handle_signal
    signal.signal(sig, old_handlers[sig])
  File "/home1/06729/tg860286/miniconda3/envs/dasktest/lib/python3.8/signal.py", line 47, in signal
    handler = _signal.signal(_enum_to_int(signalnum), _enum_to_int(handler))
TypeError: signal handler must be signal.SIG_IGN, signal.SIG_DFL, or a callable object

@Ovec8hkin
Copy link
Author

The above error was a walltime issue. Please ignore.

@lesteve
Copy link
Member

lesteve commented Mar 21, 2020

OK great I am going to close the issue, thanks for your feed-back!

In an ideal world, you would not have such a confusing error at the end of your log. Quickly looking at it, it is not clear whether it is Dask's, Tornado's, asyncio's or something else fault. The few things I found:

@lesteve lesteve closed this as completed Mar 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants