Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid results of type 1 transform into (64, 64, 64) grid on A100 GPU #575

Open
pavel-shmakov opened this issue Oct 15, 2024 · 6 comments

Comments

@pavel-shmakov
Copy link

pavel-shmakov commented Oct 15, 2024

We've encountered an issue where cufinufft.nufft3d1 outputs wildly incorrect results for very specific inputs and only on certain GPUs. This can be reproduced by running the following code on an A100 GPU:

import torch
import cufinufft

points = torch.load("points.pt")
values = torch.load("values.pt")
spectrum = cufinufft.nufft3d1(
        *points,
        values,
        (64, 64, 64),
        isign=-1,
        eps=1e-06
)
print(torch.linalg.norm(spectrum).item())

Here's an archive with points.pt and values.pt: inputs.zip

The value is many orders of magnitude greater than it should be. It also grows quickly with decreasing eps.

Notes:

  • We reproduced this both for cufinufft 2.2.0 and 2.3.0.
  • Reproduced on A100, but not on A10G. We haven't tried other GPUs.
  • The "blow-up" happens for specific grid sizes: from 61 to 64, while for 60, 65 and beyond it goes back to normal. This is for float32 inputs; for float64, we saw a "blow-up" for grid size 32.
  • We compiled cufinufft from sources to investigate further, but surprisingly couldn't reproduce the bug. We've tried compiling from master and v2.3.X as well as various compilation options. If you could point us to the compilation options with which the release version of libcufinufft.so is built, that would be helpful, and we can investigate further!
@pavel-shmakov
Copy link
Author

Smaller reproducer with just one point:

batch_size = 32
v = torch.tensor([[1] for i in range(batch_size)], dtype=torch.complex64, device="cuda")
p = torch.tensor([[0], [0], [0]], dtype=torch.float32, device='cuda')
spectrum = cufinufft.nufft3d1(*p, v, (64, 64, 64), eps=1e-6)

The spectra should be 1 everywhere, which it is for batch_size < 16. For batch_size >= 16 it starts misbehaving.

@DiamonDinoia
Copy link
Collaborator

What happens if we use GM instead of SM? https://finufft.readthedocs.io/en/latest/c_gpu.html#options-for-gpu-code gpu_method should be supported in python too.

@pavel-shmakov
Copy link
Author

With gpu_method=1 we are also getting an incorrect, but very different answer on A100:

batch_size = 32
n_modes = 64
points = torch.tensor([[0], [0], [0]], dtype=torch.float32, device='cuda')
values = torch.tensor([[1] for i in range(batch_size)], dtype=torch.complex64, device="cuda")
for gpu_method in [1, 2]:
    spectrum = cufinufft.nufft3d1(*points, values, (n_modes, n_modes, n_modes), eps=1e-6, gpu_method=gpu_method)
    print(f"{gpu_method=}: {spectrum[0, 0, 0, 0].item()}")

A100:

gpu_method=1: (-4.974409603164531e-05-0.00036744182580150664j)
gpu_method=2: (2097152.5-0.16087542474269867j)

On T4 all good:

gpu_method=1: (1.000000238418579+0j)
gpu_method=2: (1.000000238418579+0j)

@DiamonDinoia
Copy link
Collaborator

@janden could you provide the command to do a debug build with pip? I saw this type of errors when using debug symbols. In my tests if I compile with -G nvcc generates an incorrect binary that stacks overflows while spreading but it does not crash. It just generates an output that is wrong in some points.

@pavel-shmakov could you try a bigger eps? 1e-2 or 1e-3?

@pavel-shmakov
Copy link
Author

@pavel-shmakov could you try a bigger eps? 1e-2 or 1e-3?

eps=0.1: (0.5681691765785217-1.674233078956604j)
eps=0.01: 0j
eps=0.001: (0.1657368689775467-0.00020104330906178802j)
eps=0.0001: (1.0010954141616821+0j)
eps=1e-05: (2097339.75+10.764257431030273j)
eps=1e-06: (2097152.5-0.16087542474269867j)

@DiamonDinoia
Copy link
Collaborator

DiamonDinoia commented Jan 21, 2025

@pavel-shmakov for the local compilation which version of CUDA are you using?
we create the binary using this script: https://github.com/flatironinstitute/finufft/blob/master/tools/cufinufft/distribution_helper.sh

If we move to email we could share binary wheels with different flags to narrow down the issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants