Segmentation Fault During Test Script Execution

Hello,

First, I would like to express my gratitude for open-sourcing this amazing project. I encountered the following error messages while trying to reproduce the results according to the `Run test script` section in the `README`, which ultimately led to a program crash. 

```bash
[gpu2-System-Product-Name:306  :0:306] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
[gpu2-System-Product-Name:305  :0:305] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
```

```bash
32 0x000000000022aac5 _start()  ???:0
=================================
Traceback (most recent call last):
  File "/home/stellatrain/explore-dp/backend/test/test_end_to_end.py", line 160, in <module>
    mp.spawn(method,
  File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 239, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
    while not context.join():
  File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 140, in join
    raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 1 terminated with signal SIGSEGV
```

Initially, I suspected that insufficient memory might be the cause, and I tried the following solutions:

1. Reduced the batch size.
2. Run Docker with increased memory and shared memory settings.

Unfortunately, these attempts did not resolve the issue.

---

**Experimental Setup**

The experiment is set up on two nodes, each with two GPUs. Specific details are as follows:

- **Node 1:** Ubuntu 22.04, 2 * NVIDIA GeForce RTX 3090 Ti (24GB memory each)
- **Node 2:** Ubuntu 20.04, 2 * NVIDIA GeForce RTX 3090 (24GB memory each)

---

Could you please assist me with this issue?

Thank you very much!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segmentation Fault During Test Script Execution #1

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Segmentation Fault During Test Script Execution #1

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions