Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alphafold runs will not find the GPU #1029

Closed
tuttlelm opened this issue Oct 14, 2024 · 3 comments
Closed

Alphafold runs will not find the GPU #1029

tuttlelm opened this issue Oct 14, 2024 · 3 comments

Comments

@tuttlelm
Copy link

tuttlelm commented Oct 14, 2024

Sometime in the past several months, my Alphafold install stopped being able to find and use the GPU (nvidia RTX A4500, NVIDIA-SMI 535.183.06 Driver Version: 535.183.06 CUDA Version: 12.2 )

I have been attempting a fresh install, and still no luck.

I am able to have docker find the GPU using the following command:

docker run --rm --gpus all nvidia/cuda:12.2.2-cudnn8-runtime-ubuntu20.04 nvidia-smi

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.06             Driver Version: 535.183.06   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A4500               On  | 00000000:01:00.0  On |                  Off |
| 30%   34C    P8              23W / 200W |    818MiB / 20470MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

During the install I had to use the NVIDIA Docker cgroup issue fix referenced in the README (NVIDIA/nvidia-docker#1447 (comment)) and modify the Dockerfile according to another issue (#945)

When I submit a run I get the errors below. It will run, but only using the CPU so it takes forever.


I1014 09:03:18.529073 128453379199424 run_docker.py:258] /bin/bash: /opt/conda/lib/libtinfo.so.6: no version information available (required by /bin/bash)
I1014 09:03:23.667894 128453379199424 run_docker.py:258] I1014 16:03:23.667354 129322417205888 xla_bridge.py:863] Unable to initialize backend 'cuda': jaxlib/cuda/versions_helpers.cc:98: operation cuInit(0) failed: Unknown CUDA error 303; cuGetErrorName failed. This probably means that JAX was unable to load the CUDA libraries.
I1014 09:03:23.668071 128453379199424 run_docker.py:258] I1014 16:03:23.667572 129322417205888 xla_bridge.py:863] Unable to initialize backend 'rocm': NOT_FOUND: Could not find registered platform with name: "rocm". Available platform names are: CUDA

Any recommendations are welcome
Thanks!

@tiburonpiwi
Copy link

Hi,
same error for me with CUDA 12.6, driver 560.35.03 and 4 Nvidia L40S. nvidia-smi and docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi outputs are ok.
Any help welcome
Thanks

@jung-geun
Copy link

There might be an issue with the AlphaFold execution script.

First, verify if the container can access GPU properly:

docker run --rm -it --gpus all --entrypoint /bin/bash alphafold

Inside the container, check if nvidia-smi and jax library are properly connected:

nvidia-smi

python -c "import jax; nmp = jax.numpy.ones((20000, 20000)); print('Device:', nmp.device()); result = jax.numpy.dot(nmp, nmp); print('Done')"

If these work normally, the issue might be with the docker-py library. You can verify this by running the following test:

import unittest
import docker

class TestDocker(unittest.TestCase):
    def test_docker(self):
        client = docker.from_env()
        device_requests = [
            docker.types.DeviceRequest(
                driver="nvidia",
                capabilities=[["gpu"]],
            )
        ]

        logs = client.containers.run(
            "nvidia/cuda:12.2.2-runtime-ubuntu20.04",
            "nvidia-smi",
            runtime="nvidia",
            device_requests=device_requests,
            remove=True,
        )

        print(logs.decode("utf-8"))

if __name__ == "__main__":
    unittest.main()

If this test runs successfully and shows nvidia-smi output, look for other potential issues.

If the test fails, the issue is likely with docker-py's GPU device recognition. You can fix this by modifying the AlphaFold script:

# alphafold/docker/run_docker.py

# Original code - line 232
client = docker.from_env()
device_requests = [
    docker.types.DeviceRequest(driver='nvidia', capabilities=[['gpu']])
] if FLAGS.use_gpu else None

# Modified code
client = docker.from_env()
device_requests = (
    [docker.types.DeviceRequest(driver="nvidia", capabilities=[["gpu"]], count=-1)]
    if use_gpu
    else None
)

I encountered this issue when using docker-py==5.0.0 with the latest system Docker version. The exact cause is unclear, but it appears to be related to GPU device recognition between docker-py and the Docker daemon.

The issue can be resolved by adding the count=-1 parameter to the DeviceRequest, which explicitly tells docker-py to use all available GPUs. This seems to be a compatibility issue between specific versions of docker-py and the Docker daemon's GPU handling.

If you're experiencing similar issues, try the modification shown in the code above.

I hope this solution works for your case.

If not, please let me know and we can explore other potential solutions. This issue appears to be version-specific between docker-py and Docker daemon, so there might be alternative approaches worth investigating.

@tuttlelm
Copy link
Author

tuttlelm commented Oct 30, 2024

Thanks so much for the response. Modifying the run_docker.py script with count=-1 seems to have done the trick. I no longer get the initial Unknown CUDA error 303 and the GPU is being used for the runs.

I ran the recommended tests inside the container and those passed. I was not sure how to create the docker-py library test script within the container, so I could not run it there. Running outside the container just gave a bunch of errors.

Update: I do still have an issue with the minimization portion.
Error loading CUDA module: CUDA_ERROR_UNSUPPORTED_PTX_VERSION (222)

But as other's have noted the GPU isn't really necessary for the relaxation steps, so using --enable_gpu_relax=false has everything running nicely again. Thank goodness!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants