AWS Deep Learning AMI not compatible with multi-GPU training of MolMIM #644

xinyu-dev · 2025-01-23T01:06:50Z

Issue

When running multi-GPU pretraining of MolMIM on AWS EC2, the following issues were observed:

Pretrain on 1x A10 GPU instance on AWS works OK.
Pretrain 1x L40S GPU instance + set devices:1 also works OK.
Pretrain 4x A10 GPU instance + set devices:1 also works OK.
Pretrain on 4x A 10 GPU instance + set devices: 4 generates error similar to below.
Pretrain on 4x L40S GPU instance + set devices: 4 generates similar error

Error log:
log.txt

How to replicate:

Specific AWS EC2 configurations:

Launch a 4xA10 GPU instance or a 4xL40S GPU instance on AWS EC2.
Choose Deep Learning Base OSS Nvidia Driver GPU AMI (Ubuntu 22.04) 20241115 (ami-0f2ad13ff5f6b6f7c) as the AMI.
Run

docker run --rm -it --gpus all -p 8888:8888 nvcr.io/nvidia/clara/bionemo-framework:1.10.1 "/bin/bash"

Try running MolMIM pertaining

Workaround

Switch to ami-075a0f15f2d44a65e (NVIDIA GPU-Optimized AMI). Note that some users might not have the liberty to switch AMIs therefore it might be great if we can figure out what is going on here. Thanks very much!

The text was updated successfully, but these errors were encountered:

xinyu-dev added the bug Something isn't working label Jan 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWS Deep Learning AMI not compatible with multi-GPU training of MolMIM #644

AWS Deep Learning AMI not compatible with multi-GPU training of MolMIM #644

xinyu-dev commented Jan 23, 2025

AWS Deep Learning AMI not compatible with multi-GPU training of MolMIM #644

AWS Deep Learning AMI not compatible with multi-GPU training of MolMIM #644

Comments

xinyu-dev commented Jan 23, 2025

Issue

How to replicate:

Workaround