Lightning-AI pytorch-lightning Ddp Multi Gpu Multi Node · Discussions · GitHub

Welcome to Lightning Discussions!
General williamFalcon

Sort by: Latest activity

DDP / multi-GPU / multi-node Discussions

Any questions about DDP or multi GPU things

You must be logged in to vote

Correct usages of "find unused parameters" with DDP

fstmsn asked Sep 18, 2022 in DDP / multi-GPU / multi-node · Unanswered

0
You must be logged in to vote

ddp_sharded crash during model save

MaugrimEP asked Aug 1, 2022 in DDP / multi-GPU / multi-node · Answered

7
You must be logged in to vote

Configuring NUMA aware DDP training without Slurm

laserkelvin asked Jun 13, 2022 in DDP / multi-GPU / multi-node · Unanswered

1
You must be logged in to vote

ModelCheckpointCallback.best_model_score is None after distributed training
callback: model checkpoint strategy: ddp DistributedDataParallel
hikushalhere asked Jul 7, 2022 in DDP / multi-GPU / multi-node · Unanswered

2
You must be logged in to vote

Can't use multi-gpu to train the model.

NayeeC asked Aug 30, 2021 in DDP / multi-GPU / multi-node · Unanswered

9
You must be logged in to vote

DDP deadlock detected from rank 1 and CUDA error: operation not supported on A10
distributed Generic distributed-related topic accelerator: cuda Compute Unified Device Architecture GPU
fugokidi asked Aug 20, 2022 in DDP / multi-GPU / multi-node · Unanswered

0
You must be logged in to vote

How can I use only OSS in Pytorch lightning?

CD21a asked Aug 9, 2022 in DDP / multi-GPU / multi-node · Unanswered

2
You must be logged in to vote

get same loss on different GPU device

dichencd asked Aug 13, 2022 in DDP / multi-GPU / multi-node · Answered

1
You must be logged in to vote

Correct way to use all_gather in DDP

cnut1648 asked Aug 10, 2022 in DDP / multi-GPU / multi-node · Unanswered

0
You must be logged in to vote

Exception: process 0 terminated with exit code 1 when DDP
strategy: ddp DistributedDataParallel
wdy06 asked Aug 20, 2019 in DDP / multi-GPU / multi-node · Unanswered

7
You must be logged in to vote

Additionally check number of devices of Trainer for evaluation before warning when using DistributedSampler

function2-llx asked Aug 6, 2022 in DDP / multi-GPU / multi-node · Unanswered

0
You must be logged in to vote

Behaviour of accumulate_gradients and multi-gpu
distributed Generic distributed-related topic callback: gradient accumulation
RaivoKoot asked Jan 25, 2021 in DDP / multi-GPU / multi-node · Unanswered

1
You must be logged in to vote

Training not proceeding

kad99kev asked Aug 4, 2022 in DDP / multi-GPU / multi-node · Unanswered

0
You must be logged in to vote

How to use all the available GPUs
accelerator: cuda Compute Unified Device Architecture GPU trainer: argument
mnslarcher asked Apr 14, 2022 in DDP / multi-GPU / multi-node · Answered

3
You must be logged in to vote

Distributed training with multiple optimizers
distributed Generic distributed-related topic optimization
piseabhijeet asked Oct 29, 2021 in DDP / multi-GPU / multi-node · Answered

1
You must be logged in to vote

How to use only fairscale's OSS

toriving asked Jul 29, 2022 in DDP / multi-GPU / multi-node · Unanswered

0
You must be logged in to vote

Restarting parts of cluster
distributed Generic distributed-related topic
BaruchG asked Jul 21, 2022 in DDP / multi-GPU / multi-node · Unanswered

0
You must be logged in to vote

Sharding and training multiple models at once for a large scale reinforcement learning
strategy: deepspeed pl Generic label for PyTorch Lightning package
thomfoster asked Jul 11, 2022 in DDP / multi-GPU / multi-node · Unanswered

7
You must be logged in to vote

GPU memory consumption fluctuates rapidly with FSDP training
strategy: fairscale fsdp (removed) Fully Sharded Data Parallel
manideep2510 asked Jul 11, 2022 in DDP / multi-GPU / multi-node · Unanswered

0
You must be logged in to vote

Combine outputs in test epochs when using DDP
strategy: ddp DistributedDataParallel
WouterDurnez asked Dec 15, 2021 in DDP / multi-GPU / multi-node · Answered

14
You must be logged in to vote

Wandb-dependent Model Checkpoint

kelvins64 asked Jul 2, 2022 in DDP / multi-GPU / multi-node · Unanswered

0
You must be logged in to vote

How to configure the number of GPU parameter of different nodes in fsdp mode

wangleiofficial asked Jul 2, 2022 in DDP / multi-GPU / multi-node · Unanswered

0
You must be logged in to vote

Gradient accumulation and total number of training steps
callback: gradient accumulation trainer: argument
konstantinjdobler asked Jun 29, 2022 in DDP / multi-GPU / multi-node · Answered

1
You must be logged in to vote

Trying to create tensor with negative dimension with ddp_sharded
strategy: fairscale sharded (removed) Sharded Data Parallel
Riccorl asked Jun 27, 2022 in DDP / multi-GPU / multi-node · Answered

1
You must be logged in to vote

Weird DDP RNG/seed behavior
reproducibility strategy: ddp DistributedDataParallel
amit-miller asked Jun 23, 2022 in DDP / multi-GPU / multi-node · Unanswered

2