Skip to content

Conversation

ayushdg
Copy link
Contributor

@ayushdg ayushdg commented Oct 3, 2025

Description

Adds initial example slurm scripts for single and multi-node runs.

Usage

N/A

Checklist

  • I am familiar with the Contributing Guide.
  • [N/A] New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

Signed-off-by: Ayush Dattagupta <[email protected]>
@ayushdg
Copy link
Contributor Author

ayushdg commented Oct 3, 2025

@lbliii Could you point me to where I should also update this in the docs?

echo "RAY_ADDRESS: $RAY_ADDRESS"



Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are starting the head node differently than RayClient, we should make sure we carry the same env variables, which could be XENNA_RAY_METRICS_PORT or XENNA_RESPECT_CUDA_VISIBLE_DEVICES.. I'm not sure about the other two here

os.environ["DASHBOARD_METRIC_PORT"] = str(get_free_port(DEFAULT_RAY_DASHBOARD_METRIC_PORT))
os.environ["AUTOSCALER_METRIC_PORT"] = str(get_free_port(DEFAULT_RAY_AUTOSCALER_METRIC_PORT))
# We set some env vars for Xenna here. This is only used for Xenna clusters.
os.environ["XENNA_RAY_METRICS_PORT"] = str(ray_metrics_port)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify is the suggestion here to also enable prometheus grafana metrics etc? Or to ensure that env variables are exported across the head and the client.

The current setup doesn't export the RAY METRICS PORT or CUDA_VISIBLE_DEVICES anywhere

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or to ensure that env variables are exported across the head and the client.

This. Grafana and prometheus will be slightly tricky to do in one pass. But I think we should still enable XENNA_RESPECT_CUDA_VISIBLE_DEVICES for instance which will allow us to make sure even on SLURM nodes users can just use subset of GPUs

@ayushdg
Copy link
Contributor Author

ayushdg commented Oct 6, 2025

@sarahyurick jfyi one thing I observed from the timeouts is that it always hangs right after the url_generation tests for wiki before timing out vs in successful runs the suite takes 12 minutes. Maybe there's something about one of the tests that hangs on these nodes.

@sarahyurick
Copy link
Contributor

@sarahyurick jfyi one thing I observed from the timeouts is that it always hangs right after the url_generation tests for wiki before timing out vs in successful runs the suite takes 12 minutes. Maybe there's something about one of the tests that hangs on these nodes.

Yes I noticed the same thing and was not able to determine the root cause. I did not see it when testing locally either. Maybe we can open an issue if it continues to be a blocker.

########################################################
# Container specific variables
########################################################
: "${IMAGE:=nvcr.io/nvidia/nemo-curator:25.09}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
: "${IMAGE:=nvcr.io/nvidia/nemo-curator:25.09}"
: "${IMAGE:=nvcr.io/nvidia/nemo-curator}"

Should work? We could also add a comment saying that this script is for 25.09 and above.

RAY_CLIENT_ADDRESS=$HEAD_NODE_IP:$CLIENT_PORT
export RAY_GCS_ADDRESS
export RAY_CLIENT_ADDRESS
export RAY_ADDRESS="ray://$RAY_CLIENT_ADDRESS"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should change this to

export RAY_ADDRESS= RAY_GCS_ADDRESS

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately that doesn't work in the current slurm setup because a different container on the head node is started up and that fails due to not finding a file in the /tmp directory that ray usually creates. Similar to this: #1174 (comment).

The higher level question is connecting to a ray cluster with the client server port is also a valid way to connect to a remote cluster. Any ideas why that isn't working here?

-w ${HEAD_NODE_NAME} \
--container-image=$IMAGE \
--container-mounts=$CONTAINER_MOUNTS \
bash -c "ray start \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to change the API limits here, @praateekmahajan can you confim. Change this line to

bash -c "RAY_MAX_LIMIT_FROM_API_SERVER=40000 RAY_MAX_LIMIT_FROM_DATA_SOURCE=40000 ray start \

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I'll add these.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants