-
Notifications
You must be signed in to change notification settings - Fork 182
Initial slurm deployment scripts #1168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Ayush Dattagupta <[email protected]>
@lbliii Could you point me to where I should also update this in the docs? |
echo "RAY_ADDRESS: $RAY_ADDRESS" | ||
|
||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are starting the head node differently than RayClient, we should make sure we carry the same env variables, which could be XENNA_RAY_METRICS_PORT
or XENNA_RESPECT_CUDA_VISIBLE_DEVICES
.. I'm not sure about the other two here
Curator/nemo_curator/core/utils.py
Lines 117 to 121 in e4f9571
os.environ["DASHBOARD_METRIC_PORT"] = str(get_free_port(DEFAULT_RAY_DASHBOARD_METRIC_PORT)) | |
os.environ["AUTOSCALER_METRIC_PORT"] = str(get_free_port(DEFAULT_RAY_AUTOSCALER_METRIC_PORT)) | |
# We set some env vars for Xenna here. This is only used for Xenna clusters. | |
os.environ["XENNA_RAY_METRICS_PORT"] = str(ray_metrics_port) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify is the suggestion here to also enable prometheus grafana metrics etc? Or to ensure that env variables are exported across the head and the client.
The current setup doesn't export the RAY METRICS PORT or CUDA_VISIBLE_DEVICES anywhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or to ensure that env variables are exported across the head and the client.
This. Grafana and prometheus will be slightly tricky to do in one pass. But I think we should still enable XENNA_RESPECT_CUDA_VISIBLE_DEVICES
for instance which will allow us to make sure even on SLURM nodes users can just use subset of GPUs
@sarahyurick jfyi one thing I observed from the timeouts is that it always hangs right after the url_generation tests for wiki before timing out vs in successful runs the suite takes 12 minutes. Maybe there's something about one of the tests that hangs on these nodes. |
Yes I noticed the same thing and was not able to determine the root cause. I did not see it when testing locally either. Maybe we can open an issue if it continues to be a blocker. |
######################################################## | ||
# Container specific variables | ||
######################################################## | ||
: "${IMAGE:=nvcr.io/nvidia/nemo-curator:25.09}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
: "${IMAGE:=nvcr.io/nvidia/nemo-curator:25.09}" | |
: "${IMAGE:=nvcr.io/nvidia/nemo-curator}" |
Should work? We could also add a comment saying that this script is for 25.09 and above.
RAY_CLIENT_ADDRESS=$HEAD_NODE_IP:$CLIENT_PORT | ||
export RAY_GCS_ADDRESS | ||
export RAY_CLIENT_ADDRESS | ||
export RAY_ADDRESS="ray://$RAY_CLIENT_ADDRESS" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should change this to
export RAY_ADDRESS= RAY_GCS_ADDRESS
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately that doesn't work in the current slurm setup because a different container on the head node is started up and that fails due to not finding a file in the /tmp
directory that ray usually creates. Similar to this: #1174 (comment).
The higher level question is connecting to a ray cluster with the client server port is also a valid way to connect to a remote cluster. Any ideas why that isn't working here?
-w ${HEAD_NODE_NAME} \ | ||
--container-image=$IMAGE \ | ||
--container-mounts=$CONTAINER_MOUNTS \ | ||
bash -c "ray start \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to change the API limits here, @praateekmahajan can you confim. Change this line to
bash -c "RAY_MAX_LIMIT_FROM_API_SERVER=40000 RAY_MAX_LIMIT_FROM_DATA_SOURCE=40000 ray start \
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. I'll add these.
Description
Adds initial example slurm scripts for single and multi-node runs.
Usage
N/A
Checklist