Single-Node multi-GPU Deepspeed training fails with cuda OOM on Azure #17179
Unanswered
gabriead
asked this question in
DDP / multi-GPU / multi-node
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi community,
we are currently trying to run Pytorch-Lightning on Azure (specs below) using a single node with four GPU's for training a transformer.
It starts training (refer to std_log_process_0.tx) and then runs into a cuda out-of-memory error. Apparently none of the other GPU's has started processing (refer to std_log_process_1.txt, std_log_process_2.txt, std_log_process_3.txt ) so far. We couldn't find anything helpful to fix this issue so we are counting on the community to help us out.
Compute Specs on Azure: Standard_NC64as_T4_v3 (single node, 4 GPU's)
Environment:
channels:
dependencies:
Docker Image: openmpi4.1.0-cuda11.3-cudnn8-ubuntu20.04
`
import pandas as pd
import pytorch_lightning as pl
from pytorch_lightning.callbacks.progress import TQDMProgressBar
from pytorch_lightning.callbacks.early_stopping import EarlyStopping
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from data_module import LightningDataModule
from model_module import LightningModel
class CustomT5Trainer:
from sklearn.model_selection import train_test_split
from azureml.core import Datastore, Workspace
from trainer import CustomT5Trainer
from azure_helper import CustomAzureHelper
import torch
import os
import gc
gc.collect()
torch.cuda.empty_cache()
import os
from pytorch_lightning.plugins.environments import ClusterEnvironment
from pytorch_lightning.strategies import DeepSpeedStrategy, DDPStrategy
class OpenMPIClusterEnvironment(ClusterEnvironment):
def init(self, devices: int = 4) -> None:
super().init()
self.devices = devices
def train(trainer):
if name == "main":
`
This is the notebook we use for starting the training:
`
ws = Workspace.from_config("config_atd.json")
datastore = Datastore.get(ws, 'atd_datastore')
gpu_cluster_name = "gpu-compute-.....-4x16"
try:
gpu_cluster = ComputeTarget(workspace=ws, name=gpu_cluster_name)
print('Found existing cluster, use it.')
except ComputeTargetException:
compute_config = AmlCompute.provisioning_configuration(vm_size='Standard_NC12s_v3',
max_nodes=2)
gpu_cluster = ComputeTarget.create(ws, gpu_cluster_name, compute_config)
gpu_cluster.wait_for_completion(show_output=True)
env = Environment.from_conda_specification('test', 'env.yaml')
env.docker.enabled = True
env.docker.base_image = (
"mcr.microsoft.com/azureml/openmpi4.1.0-cuda11.3-cudnn8-ubuntu20.04"
)
cluster = ws.compute_targets[gpu_cluster_name]
job_config = MpiConfiguration(node_count=1, process_count_per_node=4)
src = ScriptRunConfig(
source_directory='source_files/',
script='training_script.py',
compute_target=cluster,
environment=env,
distributed_job_config = job_config,
)
run = Experiment(ws, 'test').submit(src)
`
The output of the error log for the four GPU's is attached:
std_log_process_3.txt
mpi_log.txt
std_log_process_0.txt
std_log_process_1.txt
std_log_process_2.txt
Beta Was this translation helpful? Give feedback.
All reactions