Skip to content

Running multinode on azure #463

@KhalidAlt

Description

@KhalidAlt

Describe the bug

I am trying to run multinode grpo in azure. Unfortunately, I have not been successful yet. I wrote Ray orchisterator following your code ray.sub for azure. However, it is still not working and I am not sure what is the issue.

Steps/Code to reproduce bug

Please take look to my branch here:

https://github.com/KhalidAlt/NeMo-RL/tree/feature/azure-support

I created three files:

azure_config.yaml [ this responsible for launching a job into azure ]
run.sh [ this responsible to prepare the nodes for multinode training run_grpo_math.py ]
run_multinode.sh [orchisterate ray for multinode training and run run_grpo_math.py code ]

Expected behavior

The expected behavior is to run multinode jobs without any issues. However, this is not the case. The code failes everytime. Here is logs to:

logs [head node] :

Generating train split:   0%|          | 0/40315 [00:00<?, ? examples/s]
Generating train split: 100%|██████████| 40315/40315 [00:00<00:00, 114832.57 examples/s]
Generating train split: 100%|██████████| 40315/40315 [00:00<00:00, 114623.95 examples/s]
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
WARNING:huggingface_hub.file_download:Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`

Generating train split:   0%|          | 0/30 [00:00<?, ? examples/s]
Generating train split: 100%|██████████| 30/30 [00:00<00:00, 5463.23 examples/s]

Map:   0%|          | 0/40315 [00:00<?, ? examples/s]
Map:   5%|▌         | 2147/40315 [00:00<00:01, 21061.18 examples/s]
Map:  11%|█         | 4460/40315 [00:00<00:01, 22259.64 examples/s]
Map:  17%|█▋        | 6788/40315 [00:00<00:01, 22719.41 examples/s]
Map:  25%|██▍       | 10074/40315 [00:00<00:01, 22313.36 examples/s]
Map:  31%|███       | 12346/40315 [00:00<00:01, 22441.97 examples/s]
Map:  39%|███▊      | 15528/40315 [00:00<00:01, 21937.85 examples/s]
Map:  45%|████▌     | 18150/40315 [00:00<00:01, 20296.18 examples/s]
Map:  52%|█████▏    | 20808/40315 [00:01<00:01, 19405.05 examples/s]
Map:  58%|█████▊    | 23510/40315 [00:01<00:00, 18943.11 examples/s]
Map:  64%|██████▎   | 25617/40315 [00:01<00:00, 19449.75 examples/s]
Map:  69%|██████▉   | 27828/40315 [00:01<00:00, 20124.38 examples/s]
Map:  74%|███████▍  | 30004/40315 [00:01<00:00, 20557.53 examples/s]
Map:  82%|████████▏ | 33127/40315 [00:01<00:00, 20648.19 examples/s]
Map:  89%|████████▉ | 36000/40315 [00:01<00:00, 20081.04 examples/s]
Map:  95%|█████████▍| 38190/40315 [00:01<00:00, 20525.81 examples/s]
Map: 100%|██████████| 40315/40315 [00:01<00:00, 19752.44 examples/s]
Map: 100%|██████████| 40315/40315 [00:01<00:00, 20408.05 examples/s]

Map:   0%|          | 0/30 [00:00<?, ? examples/s]
Map: 100%|██████████| 30/30 [00:00<00:00, 6615.27 examples/s]
2025-06-01 07:06:57,547	INFO worker.py:1654 -- Connecting to existing Ray cluster at address: 10.1.0.106:6379...
2025-06-01 07:06:57,559	INFO worker.py:1832 -- Connected to Ray cluster. View the dashboard at �[1m�[32mhttp://10.1.0.106:8265 �[39m�[22m
wandb: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
wandb: Currently logged in as: khalidalt to https://api.wandb.ai. Use `wandb login --relogin` to force relogin
wandb: creating run
wandb: Tracking run with wandb version 0.19.8
wandb: Run data is saved locally in logs/exp_001/wandb/wandb/run-20250601_070700-svla8jst
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run allam-alpha-7b-v2-grpo-v19.0-tp1-pp1-mbs1-ep3-bsz1024-lr5e-6
wandb: ⭐️ View project at https://wandb.ai/khalidalt/grpo-dev
wandb: 🚀 View run at https://wandb.ai/khalidalt/grpo-dev/runs/svla8jst
INFO:nemo_rl.utils.venvs:NEMO_RL_VENV_DIR is set to /eph/nvme0/azureml/cr/j/azure_id_job/exe/wd/venvs.
Using CPython 3.12.10
Creating virtual environment at: venvs/nemo_rl.models.generation.vllm.VllmGenerationWorker
warning: `VIRTUAL_ENV=/opt/nemo_rl_venv` does not match the project environment path `venvs/nemo_rl.models.generation.vllm.VllmGenerationWorker` and will be ignored; use `--active` to target the active environment instead
Installed 194 packages in 1.96s
Finished creating venv /eph/nvme0/azureml/cr/j/azure_id_job/exe/wd/venvs/nemo_rl.models.generation.vllm.VllmGenerationWorker
�[33m(raylet, ip=10.1.0.111)�[0m bash: line 1: /eph/nvme0/azureml/cr/j/azure_id_job/exe/wd/venvs/nemo_rl.models.generation.vllm.VllmGenerationWorker/bin/python: No such file or directory
Initialized WandbLogger for project grpo-dev, run allam-alpha-7b-v2-grpo-v19.0-tp1-pp1-mbs1-ep3-bsz1024-lr5e-6 at logs/exp_001/wandb
  ✓ Training dataloader loaded with 40315 samples
  ✓ Validation dataloader loaded with 480 samples

▶ Setting up compute cluster...
  ✓ Ray cluster initialized with 2 nodes


▶ Setting up model and training...
�[36m(VllmGenerationWorker pid=12883)�[0m INFO 06-01 07:07:15 [__init__.py:239] Automatically detected platform cuda.
�[36m(VllmGenerationWorker pid=12883)�[0m INFO 06-01 07:07:15 [__init__.py:239] Automatically detected platform cuda.
�[36m(VllmGenerationWorker pid=12881)�[0m INFO 06-01 07:07:33 [config.py:689] This model supports multiple tasks: {'embed', 'generate', 'score', 'classify', 'reward'}. Defaulting to 'generate'.
�[36m(VllmGenerationWorker pid=12881)�[0m INFO 06-01 07:07:33 [config.py:1901] Chunked prefill is enabled with max_num_batched_tokens=16384.
�[36m(VllmGenerationWorker pid=12881)�[0m WARNING 06-01 07:07:33 [cuda.py:96] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
�[36m(VllmGenerationWorker pid=12884)�[0m INFO 06-01 07:07:16 [__init__.py:239] Automatically detected platform cuda.�[32m [repeated 14x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)�[0m
�[36m(VllmGenerationWorker pid=12881)�[0m INFO 06-01 07:07:33 [config.py:689] This model supports multiple tasks: {'embed', 'generate', 'score', 'classify', 'reward'}. Defaulting to 'generate'.
�[36m(VllmGenerationWorker pid=12881)�[0m INFO 06-01 07:07:33 [config.py:1901] Chunked prefill is enabled with max_num_batched_tokens=16384.
�[36m(VllmGenerationWorker pid=12881)�[0m WARNING 06-01 07:07:33 [cuda.py:96] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
�[36m(VllmGenerationWorker pid=12880)�[0m INFO 06-01 07:07:33 [config.py:689] This model supports multiple tasks: {'generate', 'embed', 'classify', 'reward', 'score'}. Defaulting to 'generate'.
�[36m(VllmGenerationWorker pid=12880)�[0m INFO 06-01 07:07:33 [config.py:689] This model supports multiple tasks: {'generate', 'embed', 'classify', 'reward', 'score'}. Defaulting to 'generate'.
�[36m(VllmGenerationWorker pid=12878)�[0m INFO 06-01 07:07:33 [config.py:689] This model supports multiple tasks: {'reward', 'score', 'generate', 'embed', 'classify'}. Defaulting to 'generate'.
�[36m(VllmGenerationWorker pid=12878)�[0m INFO 06-01 07:07:33 [config.py:689] This model supports multiple tasks: {'reward', 'score', 'generate', 'embed', 'classify'}. Defaulting to 'generate'.
�[36m(VllmGenerationWorker pid=12884)�[0m INFO 06-01 07:07:34 [config.py:689] This model supports multiple tasks: {'generate', 'classify', 'score', 'reward', 'embed'}. Defaulting to 'generate'.
�[36m(VllmGenerationWorker pid=12884)�[0m INFO 06-01 07:07:34 [config.py:689] This model supports multiple tasks: {'generate', 'classify', 'score', 'reward', 'embed'}. Defaulting to 'generate'.
�[36m(VllmGenerationWorker pid=12879)�[0m INFO 06-01 07:07:34 [config.py:689] This model supports multiple tasks: {'classify', 'generate', 'embed', 'reward', 'score'}. Defaulting to 'generate'.
�[36m(VllmGenerationWorker pid=12879)�[0m INFO 06-01 07:07:34 [config.py:689] This model supports multiple tasks: {'classify', 'generate', 'embed', 'reward', 'score'}. Defaulting to 'generate'.
�[36m(VllmGenerationWorker pid=12883)�[0m INFO 06-01 07:07:34 [config.py:689] This model supports multiple tasks: {'score', 'generate', 'classify', 'embed', 'reward'}. Defaulting to 'generate'.
�[36m(VllmGenerationWorker pid=12883)�[0m INFO 06-01 07:07:34 [config.py:689] This model supports multiple tasks: {'score', 'generate', 'classify', 'embed', 'reward'}. Defaulting to 'generate'.
�[36m(VllmGenerationWorker pid=12882)�[0m INFO 06-01 07:07:35 [config.py:689] This model supports multiple tasks: {'classify', 'embed', 'score', 'generate', 'reward'}. Defaulting to 'generate'.
�[36m(VllmGenerationWorker pid=12882)�[0m INFO 06-01 07:07:35 [config.py:689] This model supports multiple tasks: {'classify', 'embed', 'score', 'generate', 'reward'}. Defaulting to 'generate'.
�[36m(VllmGenerationWorker pid=12881)�[0m INFO 06-01 07:07:35 [core.py:61] Initializing a V1 LLM engine (v0.8.4) with config: model='deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B', speculative_config=None, tokenizer='deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B', skip_tokenizer_init=True, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.DUMMY, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=4, served_model_name=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}
�[36m(VllmGenerationWorker pid=12881)�[0m INFO 06-01 07:07:35 [core.py:61] Initializing a V1 LLM engine (v0.8.4) with config: model='deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B', speculative_config=None, tokenizer='deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B', skip_tokenizer_init=True, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.DUMMY, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=4, served_model_name=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}
�[36m(VllmGenerationWorker pid=12877)�[0m INFO 06-01 07:07:35 [config.py:689] This model supports multiple tasks: {'embed', 'generate', 'classify', 'reward', 'score'}. Defaulting to 'generate'.
�[36m(VllmGenerationWorker pid=12877)�[0m INFO 06-01 07:07:35 [config.py:689] This model supports multiple tasks: {'embed', 'generate', 'classify', 'reward', 'score'}. Defaulting to 'generate'.
�[36m(VllmGenerationWorker pid=12881)�[0m INFO 06-01 07:07:36 [worker_base.py:589] Injected <class 'nemo_rl.models.generation.vllm_backend.VllmInternalWorkerExtension'> into <class 'vllm.v1.worker.gpu_worker.Worker'> for extended collective_rpc calls ['report_device_id', 'update_weights_from_ipc_handles']
�[36m(VllmGenerationWorker pid=12881)�[0m WARNING 06-01 07:07:36 [utils.py:2444] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x142e6266b1d0>
�[36m(VllmGenerationWorker pid=12881)�[0m INFO 06-01 07:07:36 [worker_base.py:589] Injected <class 'nemo_rl.models.generation.vllm_backend.VllmInternalWorkerExtension'> into <class 'vllm.v1.worker.gpu_worker.Worker'> for extended collective_rpc calls ['report_device_id', 'update_weights_from_ipc_handles']�[33m(raylet, ip=10.1.0.111)�[0m [2025-06-01 07:08:05,269 E 3168 3168] (raylet) worker_pool.cc:581: Some workers of the worker process(5309) have not registered within the timeout. The process is dead, probably it crashed during start.
�[33m(raylet, ip=10.1.0.111)�[0m bash: line 1: /eph/nvme0/azureml/cr/j/azure_id_job/exe/wd/venvs/nemo_rl.models.generation.vllm.VllmGenerationWorker/bin/python: No such file or directory�[32m [repeated 7x across cluster]�[0m
�[33m(raylet, ip=10.1.0.111)�[0m [2025-06-01 07:09:05,296 E 3168 3168] (raylet) worker_pool.cc:581: Some workers of the worker process(6797) have not registered within the timeout. The process is dead, probably it crashed during start.�[32m [repeated 9x across cluster]�[0m
�[33m(raylet, ip=10.1.0.111)�[0m bash: line 1: /eph/nvme0/azureml/cr/j/azure_id_job/exe/wd/venvs/nemo_rl.models.generation.vllm.VllmGenerationWorker/bin/python: No such file or directory�[32m [repeated 8x across cluster]�[0m

Environment overview (please complete the following information)

  • Environment location: docker
  • Method of install: build docker in azure registery
  • If method of install is [Docker], provide docker pull & docker run commands used

docker file

ARG BASE_IMAGE=nvcr.io/nvidia/cuda:12.8.1-cudnn-devel-ubuntu24.04
FROM ${BASE_IMAGE} AS base
# It is more convenient for users to run as root
USER root
RUN apt-get update && apt-get install -y --no-install-recommends \
    jq \
    curl \
    git \
    wget \
    && rm -rf /var/lib/apt/lists/* && \
    apt-get clean
# Install uv and python
ARG UV_VERSION=0.7.2
ARG PYTHON_VERSION=3.12
ENV PATH="/root/.local/bin:$PATH"
RUN curl -LsSf https://astral.sh/uv/${UV_VERSION}/install.sh | sh && \
    uv python install ${PYTHON_VERSION}
# Disable usage stats by default for users who are sensitive to sharing usage.
# Users are encouraged to enable if the wish.
ENV RAY_USAGE_STATS_ENABLED=0

FROM base AS hermetic
WORKDIR /opt/nemo-rl

# Clone the nemo-rl repository
ARG NEMO_RL_REPO_URL=https://github.com/NVIDIA/NeMo-RL.git
ARG NEMO_RL_BRANCH=v0.2.1
RUN git clone ${NEMO_RL_REPO_URL} . && \
    git checkout ${NEMO_RL_BRANCH}

ENV UV_PROJECT_ENVIRONMENT=/opt/nemo_rl_venv
ENV VIRTUAL_ENV=/opt/nemo_rl_venv
ENV UV_LINK_MODE=copy

# Create and activate virtual environment
RUN <<"EOF" 
uv venv /opt/nemo_rl_venv
# uv sync has a more reliable resolver than simple uv pip install which can fail
# Sync each training + inference backend one at a time (since they may conflict)
# to warm the uv cache, then at the end just sync the default dependencies.
# Do everything in one layer to prevent large layers.
uv sync --link-mode symlink --locked --extra vllm --no-install-project
uv sync --link-mode symlink --locked --extra mcore --no-install-project --no-build-isolation
uv sync --link-mode symlink --locked --all-groups --no-install-project
EOF
ENV PATH="/opt/nemo_rl_venv/bin:$PATH"

FROM hermetic AS release
ARG NEMO_RL_COMMIT
ARG NVIDIA_BUILD_ID
ARG NVIDIA_BUILD_REF
ENV NEMO_RL_COMMIT=${NEMO_RL_COMMIT:-<unknown>}
ENV NVIDIA_BUILD_ID=${NVIDIA_BUILD_ID:-<unknown>}
ENV NVIDIA_BUILD_REF=${NVIDIA_BUILD_REF:-<unknown>}
LABEL com.nvidia.build.id="${NVIDIA_BUILD_ID}"
LABEL com.nvidia.build.ref="${NVIDIA_BUILD_REF}"

RUN apt-get update && apt-get install -y --no-install-recommends openssh-server wget

# Setup azcopy
RUN wget https://aka.ms/downloadazcopy-v10-linux -O azcopy.tar.gz && \
    tar -xvf azcopy.tar.gz && \
    cp ./azcopy_linux_amd64_*/azcopy /usr/bin/ && \
    chmod 755 /usr/bin/azcopy && \
    rm -f azcopy.tar.gz && \
    rm -rf ./azcopy_linux_amd64_*/

# Install Python dependencies
RUN uv pip install --no-cache-dir \
    azureml-core \
    azureml-mlflow \
    tqdm \
    notebook \
    jupyterlab

# Install NeMo-RL from the cloned repository
WORKDIR /opt/nemo-rl
RUN uv pip install -e .

Additional Notes

  • I tried to assign VIRTUAL_ENV into output dir that exists on all nodes. This make sure that the this file:
    venvs/nemo_rl.models.generation.vllm.VllmGenerationWorker exists on both nodes [ I tested with 2 nodes ]. However, the error still there because for some reason python file exists on the first node and not the other node which have only python3 export VIRTUAL_ENV=$EXP_PATH

  • I am using nemo-rl v0.2.1 However, I build the docker from the main since the one in v0.2.1 was showing that ray is not found

  • I did small change in the run_grpo_math.py:

from:

init_ray()

to

    if ray.is_initialized:
        print("Ray has been already initalized...")
        
    else:
        init_ray()

because init_ray() was causing the following error:

  File "/opt/nemo_rl_venv/lib/python3.12/site-packages/ray/_private/worker.py", line 1748, in init
    raise ValueError(
ValueError: When connecting to an existing cluster, resources must not be provided.
+ cleanup_ray
+ echo 'Cleaning up Ray processes...'
Cleaning up Ray processes...
+ ray stop --force

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions