-
-
Notifications
You must be signed in to change notification settings - Fork 134
SLURMCluster jobs not running when using parameters from dask.yaml #394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Weird ... I don't see anything obviously wrong. Maybe you can have a look at https://jobqueue.dask.org/en/latest/debug.html. |
Also not related at all, but I would be interested to know why you need |
The |
Sorry for the delay, I was forced to relocate due to ongoing concerns with the spread of the virus. Logs for the working jobs look like this:
Logs for non-working cases, look like this:
I have encountered the |
No need to apologize at all, it is kind of expected that things are a bit complicated these days. I think your problem should be fixed in master. One way to install
Let me know if that does not fix it! This is very likely the same thing as #358: interface is ignored when set from the config file. The fix is in master (#366). How I figured this out (can be useful for potential debugging sessions): the scheduler is on a separate network in the non-working case.
Note in principle this difference should be visible in |
Installing from master worked somewhat, but it exposed a new error. Now, the cluster properly sources all of the parameters, and one of the two parallel jobs Im executing runs and completes as expected, but the other yields the following error message in the logs:
|
The above error was a walltime issue. Please ignore. |
OK great I am going to close the issue, thanks for your feed-back! In an ideal world, you would not have such a confusing error at the end of your log. Quickly looking at it, it is not clear whether it is Dask's, Tornado's, asyncio's or something else fault. The few things I found:
|
I am using a
SLURMCluster
object to run some simple python functions in parallel on an HPC cluster. When I run the script by manually passing each parameter to theSLURMCluster
object, the jobs are submitted, connect, run, and return properly. However, when I move those parameters to adask.yaml
file (in~/.config/dask/dask.yaml
), the job submit but never connect, finish, and return, but instead hang until I kill the running python process and cancel the subequently submitted jobs. Both ways yield the same job script with identical options specified.What could be causing this?
Below are copies of my
dask.yaml
file, as well as theSLURMCluster
object with parameters that I use when I manually specifying everything:dask.yaml
The text was updated successfully, but these errors were encountered: