Skip to content

Segmentation Fault During Test Script Execution #1

@kuilz

Description

@kuilz

Hello,

First, I would like to express my gratitude for open-sourcing this amazing project. I encountered the following error messages while trying to reproduce the results according to the Run test script section in the README, which ultimately led to a program crash.

[gpu2-System-Product-Name:306  :0:306] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
[gpu2-System-Product-Name:305  :0:305] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
32 0x000000000022aac5 _start()  ???:0
=================================
Traceback (most recent call last):
  File "/home/stellatrain/explore-dp/backend/test/test_end_to_end.py", line 160, in <module>
    mp.spawn(method,
  File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 239, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
    while not context.join():
  File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 140, in join
    raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 1 terminated with signal SIGSEGV

Initially, I suspected that insufficient memory might be the cause, and I tried the following solutions:

  1. Reduced the batch size.
  2. Run Docker with increased memory and shared memory settings.

Unfortunately, these attempts did not resolve the issue.


Experimental Setup

The experiment is set up on two nodes, each with two GPUs. Specific details are as follows:

  • Node 1: Ubuntu 22.04, 2 * NVIDIA GeForce RTX 3090 Ti (24GB memory each)
  • Node 2: Ubuntu 20.04, 2 * NVIDIA GeForce RTX 3090 (24GB memory each)

Could you please assist me with this issue?

Thank you very much!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions