Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS Deep Learning AMI not compatible with multi-GPU training of MolMIM #644

Open
xinyu-dev opened this issue Jan 23, 2025 · 0 comments
Open
Labels
bug Something isn't working

Comments

@xinyu-dev
Copy link
Collaborator

Issue

When running multi-GPU pretraining of MolMIM on AWS EC2, the following issues were observed:

  1. Pretrain on 1x A10 GPU instance on AWS works OK.
  2. Pretrain 1x L40S GPU instance + set devices:1 also works OK.
  3. Pretrain 4x A10 GPU instance + set devices:1 also works OK.
  4. Pretrain on 4x A 10 GPU instance + set devices: 4 generates error similar to below.
  5. Pretrain on 4x L40S GPU instance + set devices: 4 generates similar error

Error log:
log.txt

How to replicate:

Specific AWS EC2 configurations:

  1. Launch a 4xA10 GPU instance or a 4xL40S GPU instance on AWS EC2.
  2. Choose Deep Learning Base OSS Nvidia Driver GPU AMI (Ubuntu 22.04) 20241115 (ami-0f2ad13ff5f6b6f7c) as the AMI.
  3. Run
docker run --rm -it --gpus all -p 8888:8888 nvcr.io/nvidia/clara/bionemo-framework:1.10.1 "/bin/bash"
  1. Try running MolMIM pertaining

Workaround

Switch to ami-075a0f15f2d44a65e (NVIDIA GPU-Optimized AMI). Note that some users might not have the liberty to switch AMIs therefore it might be great if we can figure out what is going on here. Thanks very much!

@xinyu-dev xinyu-dev added the bug Something isn't working label Jan 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant