Hello,
First, I would like to express my gratitude for open-sourcing this amazing project. I encountered the following error messages while trying to reproduce the results according to the Run test script section in the README, which ultimately led to a program crash.
[gpu2-System-Product-Name:306 :0:306] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
[gpu2-System-Product-Name:305 :0:305] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
32 0x000000000022aac5 _start() ???:0
=================================
Traceback (most recent call last):
File "/home/stellatrain/explore-dp/backend/test/test_end_to_end.py", line 160, in <module>
mp.spawn(method,
File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 239, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
while not context.join():
File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 140, in join
raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 1 terminated with signal SIGSEGV
Initially, I suspected that insufficient memory might be the cause, and I tried the following solutions:
- Reduced the batch size.
- Run Docker with increased memory and shared memory settings.
Unfortunately, these attempts did not resolve the issue.
Experimental Setup
The experiment is set up on two nodes, each with two GPUs. Specific details are as follows:
- Node 1: Ubuntu 22.04, 2 * NVIDIA GeForce RTX 3090 Ti (24GB memory each)
- Node 2: Ubuntu 20.04, 2 * NVIDIA GeForce RTX 3090 (24GB memory each)
Could you please assist me with this issue?
Thank you very much!
Hello,
First, I would like to express my gratitude for open-sourcing this amazing project. I encountered the following error messages while trying to reproduce the results according to the
Run test scriptsection in theREADME, which ultimately led to a program crash.Initially, I suspected that insufficient memory might be the cause, and I tried the following solutions:
Unfortunately, these attempts did not resolve the issue.
Experimental Setup
The experiment is set up on two nodes, each with two GPUs. Specific details are as follows:
Could you please assist me with this issue?
Thank you very much!