Initial slurm deployment scripts #1168

ayushdg · 2025-10-03T22:39:00Z

Description

Adds initial example slurm scripts for single and multi-node runs.

Usage

N/A

Checklist

I am familiar with the Contributing Guide.
[N/A] New or Existing tests cover these changes.
The documentation is up to date with these changes.

Signed-off-by: Ayush Dattagupta <[email protected]>

ayushdg · 2025-10-03T22:39:55Z

@lbliii Could you point me to where I should also update this in the docs?

praateekmahajan · 2025-10-03T23:05:45Z

tutorials/deployment/slurm/ray-sbatch-job.sh

+echo "RAY_ADDRESS: $RAY_ADDRESS"
+
+
+


Since we are starting the head node differently than RayClient, we should make sure we carry the same env variables, which could be XENNA_RAY_METRICS_PORT or XENNA_RESPECT_CUDA_VISIBLE_DEVICES.. I'm not sure about the other two here

Curator/nemo_curator/core/utils.py

Lines 117 to 121 in e4f9571

os.environ["DASHBOARD_METRIC_PORT"] = str(get_free_port(DEFAULT_RAY_DASHBOARD_METRIC_PORT))

os.environ["AUTOSCALER_METRIC_PORT"] = str(get_free_port(DEFAULT_RAY_AUTOSCALER_METRIC_PORT))

# We set some env vars for Xenna here. This is only used for Xenna clusters.

os.environ["XENNA_RAY_METRICS_PORT"] = str(ray_metrics_port)

To clarify is the suggestion here to also enable prometheus grafana metrics etc? Or to ensure that env variables are exported across the head and the client.

The current setup doesn't export the RAY METRICS PORT or CUDA_VISIBLE_DEVICES anywhere

Or to ensure that env variables are exported across the head and the client.

This. Grafana and prometheus will be slightly tricky to do in one pass. But I think we should still enable XENNA_RESPECT_CUDA_VISIBLE_DEVICES for instance which will allow us to make sure even on SLURM nodes users can just use subset of GPUs

ayushdg · 2025-10-06T20:37:53Z

@sarahyurick jfyi one thing I observed from the timeouts is that it always hangs right after the url_generation tests for wiki before timing out vs in successful runs the suite takes 12 minutes. Maybe there's something about one of the tests that hangs on these nodes.

sarahyurick · 2025-10-06T20:41:59Z

@sarahyurick jfyi one thing I observed from the timeouts is that it always hangs right after the url_generation tests for wiki before timing out vs in successful runs the suite takes 12 minutes. Maybe there's something about one of the tests that hangs on these nodes.

Yes I noticed the same thing and was not able to determine the root cause. I did not see it when testing locally either. Maybe we can open an issue if it continues to be a blocker.

sarahyurick · 2025-10-06T20:54:36Z

tutorials/deployment/slurm/ray-sbatch-job.sh

+########################################################
+# Container specific variables
+########################################################
+: "${IMAGE:=nvcr.io/nvidia/nemo-curator:25.09}"


Suggested change

: "${IMAGE:=nvcr.io/nvidia/nemo-curator:25.09}"

: "${IMAGE:=nvcr.io/nvidia/nemo-curator}"

Should work? We could also add a comment saying that this script is for 25.09 and above.

abhinavg4 · 2025-10-10T09:44:12Z

tutorials/deployment/slurm/ray-sbatch-job.sh

+RAY_CLIENT_ADDRESS=$HEAD_NODE_IP:$CLIENT_PORT
+export RAY_GCS_ADDRESS
+export RAY_CLIENT_ADDRESS
+export RAY_ADDRESS="ray://$RAY_CLIENT_ADDRESS"


I think we should change this to

export RAY_ADDRESS= RAY_GCS_ADDRESS

Unfortunately that doesn't work in the current slurm setup because a different container on the head node is started up and that fails due to not finding a file in the /tmp directory that ray usually creates. Similar to this: #1174 (comment).

The higher level question is connecting to a ray cluster with the client server port is also a valid way to connect to a remote cluster. Any ideas why that isn't working here?

abhinavg4 · 2025-10-10T09:45:29Z

tutorials/deployment/slurm/ray-sbatch-job.sh

+  -w ${HEAD_NODE_NAME} \
+  --container-image=$IMAGE \
+  --container-mounts=$CONTAINER_MOUNTS \
+    bash -c "ray start \


I think we need to change the API limits here, @praateekmahajan can you confim. Change this line to

bash -c "RAY_MAX_LIMIT_FROM_API_SERVER=40000 RAY_MAX_LIMIT_FROM_DATA_SOURCE=40000 ray start \

Good catch. I'll add these.

Initial slurm deployment scripts

d7dd76d

Signed-off-by: Ayush Dattagupta <[email protected]>

copy-pr-bot bot temporarily deployed to test October 3, 2025 22:39 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci October 3, 2025 22:39 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci October 3, 2025 22:39 Failure

copy-pr-bot bot temporarily deployed to nemo-ci October 3, 2025 22:39 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci October 3, 2025 22:39 Failure

copy-pr-bot bot temporarily deployed to nemo-ci October 3, 2025 22:39 Inactive

praateekmahajan reviewed Oct 3, 2025

View reviewed changes

copy-pr-bot bot had a problem deploying to nemo-ci October 4, 2025 00:00 Failure

copy-pr-bot bot had a problem deploying to nemo-ci October 6, 2025 14:55 Failure

Merge branch 'main' into slurm-scripts

f3d9ea8

sarahyurick reviewed Oct 6, 2025

View reviewed changes

federico-dambrosio mentioned this pull request Oct 8, 2025

XennaExecutor - RuntimeError: Ray Client is already connected. when RAY_ADDRESS is set to remote cluster #1174

Open

abhinavg4 reviewed Oct 10, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Initial slurm deployment scripts #1168

Initial slurm deployment scripts #1168

Uh oh!

ayushdg commented Oct 3, 2025

Uh oh!

ayushdg commented Oct 3, 2025

Uh oh!

praateekmahajan Oct 3, 2025

Uh oh!

ayushdg Oct 3, 2025

Uh oh!

praateekmahajan Oct 6, 2025

Uh oh!

ayushdg commented Oct 6, 2025

Uh oh!

sarahyurick commented Oct 6, 2025

Uh oh!

sarahyurick Oct 6, 2025

Uh oh!

abhinavg4 Oct 10, 2025

Uh oh!

ayushdg Oct 10, 2025

Uh oh!

abhinavg4 Oct 10, 2025

Uh oh!

ayushdg Oct 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	os.environ["DASHBOARD_METRIC_PORT"] = str(get_free_port(DEFAULT_RAY_DASHBOARD_METRIC_PORT))
	os.environ["AUTOSCALER_METRIC_PORT"] = str(get_free_port(DEFAULT_RAY_AUTOSCALER_METRIC_PORT))

	# We set some env vars for Xenna here. This is only used for Xenna clusters.
	os.environ["XENNA_RAY_METRICS_PORT"] = str(ray_metrics_port)

	: "${IMAGE:=nvcr.io/nvidia/nemo-curator:25.09}"
	: "${IMAGE:=nvcr.io/nvidia/nemo-curator}"

Initial slurm deployment scripts #1168

Are you sure you want to change the base?

Initial slurm deployment scripts #1168

Uh oh!

Conversation

ayushdg commented Oct 3, 2025

Description

Usage

Checklist

Uh oh!

ayushdg commented Oct 3, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ayushdg commented Oct 6, 2025

Uh oh!

sarahyurick commented Oct 6, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants