-
Notifications
You must be signed in to change notification settings - Fork 233
Description
Describe the bug
I am trying to run multinode grpo in azure. Unfortunately, I have not been successful yet. I wrote Ray orchisterator following your code ray.sub for azure. However, it is still not working and I am not sure what is the issue.
Steps/Code to reproduce bug
Please take look to my branch here:
https://github.com/KhalidAlt/NeMo-RL/tree/feature/azure-support
I created three files:
azure_config.yaml [ this responsible for launching a job into azure ]
run.sh [ this responsible to prepare the nodes for multinode training run_grpo_math.py ]
run_multinode.sh [orchisterate ray for multinode training and run run_grpo_math.py code ]
Expected behavior
The expected behavior is to run multinode jobs without any issues. However, this is not the case. The code failes everytime. Here is logs to:
logs [head node] :
Generating train split: 0%| | 0/40315 [00:00<?, ? examples/s]
Generating train split: 100%|██████████| 40315/40315 [00:00<00:00, 114832.57 examples/s]
Generating train split: 100%|██████████| 40315/40315 [00:00<00:00, 114623.95 examples/s]
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
WARNING:huggingface_hub.file_download:Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Generating train split: 0%| | 0/30 [00:00<?, ? examples/s]
Generating train split: 100%|██████████| 30/30 [00:00<00:00, 5463.23 examples/s]
Map: 0%| | 0/40315 [00:00<?, ? examples/s]
Map: 5%|▌ | 2147/40315 [00:00<00:01, 21061.18 examples/s]
Map: 11%|█ | 4460/40315 [00:00<00:01, 22259.64 examples/s]
Map: 17%|█▋ | 6788/40315 [00:00<00:01, 22719.41 examples/s]
Map: 25%|██▍ | 10074/40315 [00:00<00:01, 22313.36 examples/s]
Map: 31%|███ | 12346/40315 [00:00<00:01, 22441.97 examples/s]
Map: 39%|███▊ | 15528/40315 [00:00<00:01, 21937.85 examples/s]
Map: 45%|████▌ | 18150/40315 [00:00<00:01, 20296.18 examples/s]
Map: 52%|█████▏ | 20808/40315 [00:01<00:01, 19405.05 examples/s]
Map: 58%|█████▊ | 23510/40315 [00:01<00:00, 18943.11 examples/s]
Map: 64%|██████▎ | 25617/40315 [00:01<00:00, 19449.75 examples/s]
Map: 69%|██████▉ | 27828/40315 [00:01<00:00, 20124.38 examples/s]
Map: 74%|███████▍ | 30004/40315 [00:01<00:00, 20557.53 examples/s]
Map: 82%|████████▏ | 33127/40315 [00:01<00:00, 20648.19 examples/s]
Map: 89%|████████▉ | 36000/40315 [00:01<00:00, 20081.04 examples/s]
Map: 95%|█████████▍| 38190/40315 [00:01<00:00, 20525.81 examples/s]
Map: 100%|██████████| 40315/40315 [00:01<00:00, 19752.44 examples/s]
Map: 100%|██████████| 40315/40315 [00:01<00:00, 20408.05 examples/s]
Map: 0%| | 0/30 [00:00<?, ? examples/s]
Map: 100%|██████████| 30/30 [00:00<00:00, 6615.27 examples/s]
2025-06-01 07:06:57,547 INFO worker.py:1654 -- Connecting to existing Ray cluster at address: 10.1.0.106:6379...
2025-06-01 07:06:57,559 INFO worker.py:1832 -- Connected to Ray cluster. View the dashboard at �[1m�[32mhttp://10.1.0.106:8265 �[39m�[22m
wandb: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.
wandb: Currently logged in as: khalidalt to https://api.wandb.ai. Use `wandb login --relogin` to force relogin
wandb: creating run
wandb: Tracking run with wandb version 0.19.8
wandb: Run data is saved locally in logs/exp_001/wandb/wandb/run-20250601_070700-svla8jst
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run allam-alpha-7b-v2-grpo-v19.0-tp1-pp1-mbs1-ep3-bsz1024-lr5e-6
wandb: ⭐️ View project at https://wandb.ai/khalidalt/grpo-dev
wandb: 🚀 View run at https://wandb.ai/khalidalt/grpo-dev/runs/svla8jst
INFO:nemo_rl.utils.venvs:NEMO_RL_VENV_DIR is set to /eph/nvme0/azureml/cr/j/azure_id_job/exe/wd/venvs.
Using CPython 3.12.10
Creating virtual environment at: venvs/nemo_rl.models.generation.vllm.VllmGenerationWorker
warning: `VIRTUAL_ENV=/opt/nemo_rl_venv` does not match the project environment path `venvs/nemo_rl.models.generation.vllm.VllmGenerationWorker` and will be ignored; use `--active` to target the active environment instead
Installed 194 packages in 1.96s
Finished creating venv /eph/nvme0/azureml/cr/j/azure_id_job/exe/wd/venvs/nemo_rl.models.generation.vllm.VllmGenerationWorker
�[33m(raylet, ip=10.1.0.111)�[0m bash: line 1: /eph/nvme0/azureml/cr/j/azure_id_job/exe/wd/venvs/nemo_rl.models.generation.vllm.VllmGenerationWorker/bin/python: No such file or directory
Initialized WandbLogger for project grpo-dev, run allam-alpha-7b-v2-grpo-v19.0-tp1-pp1-mbs1-ep3-bsz1024-lr5e-6 at logs/exp_001/wandb
✓ Training dataloader loaded with 40315 samples
✓ Validation dataloader loaded with 480 samples
▶ Setting up compute cluster...
✓ Ray cluster initialized with 2 nodes
▶ Setting up model and training...
�[36m(VllmGenerationWorker pid=12883)�[0m INFO 06-01 07:07:15 [__init__.py:239] Automatically detected platform cuda.
�[36m(VllmGenerationWorker pid=12883)�[0m INFO 06-01 07:07:15 [__init__.py:239] Automatically detected platform cuda.
�[36m(VllmGenerationWorker pid=12881)�[0m INFO 06-01 07:07:33 [config.py:689] This model supports multiple tasks: {'embed', 'generate', 'score', 'classify', 'reward'}. Defaulting to 'generate'.
�[36m(VllmGenerationWorker pid=12881)�[0m INFO 06-01 07:07:33 [config.py:1901] Chunked prefill is enabled with max_num_batched_tokens=16384.
�[36m(VllmGenerationWorker pid=12881)�[0m WARNING 06-01 07:07:33 [cuda.py:96] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
�[36m(VllmGenerationWorker pid=12884)�[0m INFO 06-01 07:07:16 [__init__.py:239] Automatically detected platform cuda.�[32m [repeated 14x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)�[0m
�[36m(VllmGenerationWorker pid=12881)�[0m INFO 06-01 07:07:33 [config.py:689] This model supports multiple tasks: {'embed', 'generate', 'score', 'classify', 'reward'}. Defaulting to 'generate'.
�[36m(VllmGenerationWorker pid=12881)�[0m INFO 06-01 07:07:33 [config.py:1901] Chunked prefill is enabled with max_num_batched_tokens=16384.
�[36m(VllmGenerationWorker pid=12881)�[0m WARNING 06-01 07:07:33 [cuda.py:96] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
�[36m(VllmGenerationWorker pid=12880)�[0m INFO 06-01 07:07:33 [config.py:689] This model supports multiple tasks: {'generate', 'embed', 'classify', 'reward', 'score'}. Defaulting to 'generate'.
�[36m(VllmGenerationWorker pid=12880)�[0m INFO 06-01 07:07:33 [config.py:689] This model supports multiple tasks: {'generate', 'embed', 'classify', 'reward', 'score'}. Defaulting to 'generate'.
�[36m(VllmGenerationWorker pid=12878)�[0m INFO 06-01 07:07:33 [config.py:689] This model supports multiple tasks: {'reward', 'score', 'generate', 'embed', 'classify'}. Defaulting to 'generate'.
�[36m(VllmGenerationWorker pid=12878)�[0m INFO 06-01 07:07:33 [config.py:689] This model supports multiple tasks: {'reward', 'score', 'generate', 'embed', 'classify'}. Defaulting to 'generate'.
�[36m(VllmGenerationWorker pid=12884)�[0m INFO 06-01 07:07:34 [config.py:689] This model supports multiple tasks: {'generate', 'classify', 'score', 'reward', 'embed'}. Defaulting to 'generate'.
�[36m(VllmGenerationWorker pid=12884)�[0m INFO 06-01 07:07:34 [config.py:689] This model supports multiple tasks: {'generate', 'classify', 'score', 'reward', 'embed'}. Defaulting to 'generate'.
�[36m(VllmGenerationWorker pid=12879)�[0m INFO 06-01 07:07:34 [config.py:689] This model supports multiple tasks: {'classify', 'generate', 'embed', 'reward', 'score'}. Defaulting to 'generate'.
�[36m(VllmGenerationWorker pid=12879)�[0m INFO 06-01 07:07:34 [config.py:689] This model supports multiple tasks: {'classify', 'generate', 'embed', 'reward', 'score'}. Defaulting to 'generate'.
�[36m(VllmGenerationWorker pid=12883)�[0m INFO 06-01 07:07:34 [config.py:689] This model supports multiple tasks: {'score', 'generate', 'classify', 'embed', 'reward'}. Defaulting to 'generate'.
�[36m(VllmGenerationWorker pid=12883)�[0m INFO 06-01 07:07:34 [config.py:689] This model supports multiple tasks: {'score', 'generate', 'classify', 'embed', 'reward'}. Defaulting to 'generate'.
�[36m(VllmGenerationWorker pid=12882)�[0m INFO 06-01 07:07:35 [config.py:689] This model supports multiple tasks: {'classify', 'embed', 'score', 'generate', 'reward'}. Defaulting to 'generate'.
�[36m(VllmGenerationWorker pid=12882)�[0m INFO 06-01 07:07:35 [config.py:689] This model supports multiple tasks: {'classify', 'embed', 'score', 'generate', 'reward'}. Defaulting to 'generate'.
�[36m(VllmGenerationWorker pid=12881)�[0m INFO 06-01 07:07:35 [core.py:61] Initializing a V1 LLM engine (v0.8.4) with config: model='deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B', speculative_config=None, tokenizer='deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B', skip_tokenizer_init=True, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.DUMMY, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=4, served_model_name=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}
�[36m(VllmGenerationWorker pid=12881)�[0m INFO 06-01 07:07:35 [core.py:61] Initializing a V1 LLM engine (v0.8.4) with config: model='deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B', speculative_config=None, tokenizer='deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B', skip_tokenizer_init=True, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.DUMMY, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=4, served_model_name=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}
�[36m(VllmGenerationWorker pid=12877)�[0m INFO 06-01 07:07:35 [config.py:689] This model supports multiple tasks: {'embed', 'generate', 'classify', 'reward', 'score'}. Defaulting to 'generate'.
�[36m(VllmGenerationWorker pid=12877)�[0m INFO 06-01 07:07:35 [config.py:689] This model supports multiple tasks: {'embed', 'generate', 'classify', 'reward', 'score'}. Defaulting to 'generate'.
�[36m(VllmGenerationWorker pid=12881)�[0m INFO 06-01 07:07:36 [worker_base.py:589] Injected <class 'nemo_rl.models.generation.vllm_backend.VllmInternalWorkerExtension'> into <class 'vllm.v1.worker.gpu_worker.Worker'> for extended collective_rpc calls ['report_device_id', 'update_weights_from_ipc_handles']
�[36m(VllmGenerationWorker pid=12881)�[0m WARNING 06-01 07:07:36 [utils.py:2444] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x142e6266b1d0>
�[36m(VllmGenerationWorker pid=12881)�[0m INFO 06-01 07:07:36 [worker_base.py:589] Injected <class 'nemo_rl.models.generation.vllm_backend.VllmInternalWorkerExtension'> into <class 'vllm.v1.worker.gpu_worker.Worker'> for extended collective_rpc calls ['report_device_id', 'update_weights_from_ipc_handles']�[33m(raylet, ip=10.1.0.111)�[0m [2025-06-01 07:08:05,269 E 3168 3168] (raylet) worker_pool.cc:581: Some workers of the worker process(5309) have not registered within the timeout. The process is dead, probably it crashed during start.
�[33m(raylet, ip=10.1.0.111)�[0m bash: line 1: /eph/nvme0/azureml/cr/j/azure_id_job/exe/wd/venvs/nemo_rl.models.generation.vllm.VllmGenerationWorker/bin/python: No such file or directory�[32m [repeated 7x across cluster]�[0m
�[33m(raylet, ip=10.1.0.111)�[0m [2025-06-01 07:09:05,296 E 3168 3168] (raylet) worker_pool.cc:581: Some workers of the worker process(6797) have not registered within the timeout. The process is dead, probably it crashed during start.�[32m [repeated 9x across cluster]�[0m
�[33m(raylet, ip=10.1.0.111)�[0m bash: line 1: /eph/nvme0/azureml/cr/j/azure_id_job/exe/wd/venvs/nemo_rl.models.generation.vllm.VllmGenerationWorker/bin/python: No such file or directory�[32m [repeated 8x across cluster]�[0m
Environment overview (please complete the following information)
- Environment location: docker
- Method of install: build docker in azure registery
- If method of install is [Docker], provide
docker pull&docker runcommands used
docker file
ARG BASE_IMAGE=nvcr.io/nvidia/cuda:12.8.1-cudnn-devel-ubuntu24.04
FROM ${BASE_IMAGE} AS base
# It is more convenient for users to run as root
USER root
RUN apt-get update && apt-get install -y --no-install-recommends \
jq \
curl \
git \
wget \
&& rm -rf /var/lib/apt/lists/* && \
apt-get clean
# Install uv and python
ARG UV_VERSION=0.7.2
ARG PYTHON_VERSION=3.12
ENV PATH="/root/.local/bin:$PATH"
RUN curl -LsSf https://astral.sh/uv/${UV_VERSION}/install.sh | sh && \
uv python install ${PYTHON_VERSION}
# Disable usage stats by default for users who are sensitive to sharing usage.
# Users are encouraged to enable if the wish.
ENV RAY_USAGE_STATS_ENABLED=0
FROM base AS hermetic
WORKDIR /opt/nemo-rl
# Clone the nemo-rl repository
ARG NEMO_RL_REPO_URL=https://github.com/NVIDIA/NeMo-RL.git
ARG NEMO_RL_BRANCH=v0.2.1
RUN git clone ${NEMO_RL_REPO_URL} . && \
git checkout ${NEMO_RL_BRANCH}
ENV UV_PROJECT_ENVIRONMENT=/opt/nemo_rl_venv
ENV VIRTUAL_ENV=/opt/nemo_rl_venv
ENV UV_LINK_MODE=copy
# Create and activate virtual environment
RUN <<"EOF"
uv venv /opt/nemo_rl_venv
# uv sync has a more reliable resolver than simple uv pip install which can fail
# Sync each training + inference backend one at a time (since they may conflict)
# to warm the uv cache, then at the end just sync the default dependencies.
# Do everything in one layer to prevent large layers.
uv sync --link-mode symlink --locked --extra vllm --no-install-project
uv sync --link-mode symlink --locked --extra mcore --no-install-project --no-build-isolation
uv sync --link-mode symlink --locked --all-groups --no-install-project
EOF
ENV PATH="/opt/nemo_rl_venv/bin:$PATH"
FROM hermetic AS release
ARG NEMO_RL_COMMIT
ARG NVIDIA_BUILD_ID
ARG NVIDIA_BUILD_REF
ENV NEMO_RL_COMMIT=${NEMO_RL_COMMIT:-<unknown>}
ENV NVIDIA_BUILD_ID=${NVIDIA_BUILD_ID:-<unknown>}
ENV NVIDIA_BUILD_REF=${NVIDIA_BUILD_REF:-<unknown>}
LABEL com.nvidia.build.id="${NVIDIA_BUILD_ID}"
LABEL com.nvidia.build.ref="${NVIDIA_BUILD_REF}"
RUN apt-get update && apt-get install -y --no-install-recommends openssh-server wget
# Setup azcopy
RUN wget https://aka.ms/downloadazcopy-v10-linux -O azcopy.tar.gz && \
tar -xvf azcopy.tar.gz && \
cp ./azcopy_linux_amd64_*/azcopy /usr/bin/ && \
chmod 755 /usr/bin/azcopy && \
rm -f azcopy.tar.gz && \
rm -rf ./azcopy_linux_amd64_*/
# Install Python dependencies
RUN uv pip install --no-cache-dir \
azureml-core \
azureml-mlflow \
tqdm \
notebook \
jupyterlab
# Install NeMo-RL from the cloned repository
WORKDIR /opt/nemo-rl
RUN uv pip install -e .
Additional Notes
-
I tried to assign VIRTUAL_ENV into output dir that exists on all nodes. This make sure that the this file:
venvs/nemo_rl.models.generation.vllm.VllmGenerationWorkerexists on both nodes [ I tested with 2 nodes ]. However, the error still there because for some reason python file exists on the first node and not the other node which have only python3export VIRTUAL_ENV=$EXP_PATH -
I am using nemo-rl v0.2.1 However, I build the docker from the main since the one in v0.2.1 was showing that ray is not found
-
I did small change in the run_grpo_math.py:
from:
init_ray()
to
if ray.is_initialized:
print("Ray has been already initalized...")
else:
init_ray()
because init_ray() was causing the following error:
File "/opt/nemo_rl_venv/lib/python3.12/site-packages/ray/_private/worker.py", line 1748, in init
raise ValueError(
ValueError: When connecting to an existing cluster, resources must not be provided.
+ cleanup_ray
+ echo 'Cleaning up Ray processes...'
Cleaning up Ray processes...
+ ray stop --force