Skip to content

CoreDump Occurred in Process Running Within Nvidia PyTorch Runtime Image #353

@lvdunlin

Description

@lvdunlin

Recently tested using dynolog to dynamically collect profile data for PyTorch programs. It was found that CoreDump occurs in the following Nvidia PyTorch runtime images, specifically as follows:

nvcr.io/nvidia/pytorch:25.01-py3

Image

nvcr.io/nvidia/pytorch:25.02-py3

Image

GPU: A800

My test program is as follows:

import os
import torch
import torch.nn as nn
import time
from torchvision.models import resnet18

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = resnet18().to(device)
inputs = torch.randn(800, 3, 224, 224).to(device)
targets = torch.randint(0, 1000, (800,)).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.001)

def test(iterations=200):
    print("pid: {}".format(os.getpid()))
    _ = model(inputs)
    
    times = []
    for idx in range(1, iterations):
        start_time = time.perf_counter()
        
        outputs = model(inputs)
        loss = criterion(outputs, targets)
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        times.append(time.perf_counter() - start_time)
        print("Iter {} cost {:.3f}".format(idx, times[-1]))

    return times


if __name__ == "__main__":
    test()

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions