-
Notifications
You must be signed in to change notification settings - Fork 21
csrc/cpu/comm/ccl.cpp:8:10: fatal error: oneapi/ccl.hpp: No such file or directory #8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
So it is failing for you on last package to build. I have seeing the same error on DeepSpeed package on Fedora when building on virtual machine which does not access to real GPU but have not had time to investigate further. On my mageia 9 machines with GPU connected I do not see the error. It could be that deepspeed build will fallback to cpu-only build if not gpu is not detected. Couple of questions.
After that rocminfo should printout information about detected environment, including the gpu. For example on my laptop it will print: Name: gfx1035
Then for example:
Once the notebook opens on browser, you can use the run-command on top to move from one-cell to another and it will always printout what it will detects. By default it shows at the moment on output one of my run with rx 6800. |
Sorry for the late response, only now got around to using this PC again. The selected GPU should be a 7900 XTX, as glxinfo says:
And thus i selected gfx1100:
Running those commands:
Now i added myself to that group and then get the following errors:
Building the hello world failed a test:
|
I was able to overcome this issue by executing:
I'm not sure though, whether I would recommend this for a production system, as it may have bad security implications. Besides that I have the same problem as @eitch. |
For me /dev/kfd is by default crw-rw-rw- 1 root render 239, 0 May 31 18:51 /dev/kfd Now when thinking it I remember fighting with this very many releases ago as the group name used for opening the driver is/was hardcoded in rocm-code. Some linux distros used the same group name and others not. ls -la /dev/kfd what it shows for your group owner? |
Btw, I think the oneapi/ccl.hpp is some intel cpu package. I have not had yet time to check why DeepSpeed tries to use it on some systems. |
Are the example codes in And from 7900xtx it would be nice to know if this gpu benchmark works for you? https://github.com/lamikr/pytorch-gpu-benchmark.git There may be newer versions from it in upstream. At the time I tested with it, I needed to do quite a lot of changes to benhmark calling code to get it working with newer python and numby versions. And the original benchmark launch script only supported nvidia-smi/nvidia cards, so I added there the rocm-smi support. |
I can trigger this when building on virtual machine which does not have amd gpu driver exposed via /dev/kfd.
And on my regular computer where deepspeed builds ok, I have logs:
So in builds that works the deepspeed.ops.comm.deepspeed_ccl_comm_op extension does not get triggered. Now that you have set the /dev/kfd access rights working, could you try now clean new build by downloading the DeepSpeed source code again and then rebuilding. (Just to check if wrong access right was the problem)
|
Busy building now, but here are the permissions for kfd:
|
That benchmark doesn't work:
|
I could build now, but at the end there was a problem with deepspeed:
And then when i try to execute, i get this:
|
Hi. You need to add "source" word before calling the env_rocm.sh, like this:
In this way the environment variable changes to PATH, etc. will stay after the script execution ends. Sorry I was not clear enought with the gpu benchmark scipt to run. For some unknown reason, I have there another script to be used with amd gpu:
which has the amd support. Can you try with that one? |
Hmm, so root is owning the /dev/kfd device driver in your environment and no group-owner? Btw, so you think that the /dev/kfd permission change + source code re-download solved the DeepSpeed build problem for you? |
Right, i somehow forgot the source command, now it works:
Now as i call the right benchmark script, some tests work, while others don't:
More benchmarks were still running while writing this comment. They seem to not throw/log any more errors. Just some warnings. |
Yes, that benchmark is very extensive and runs for a while. It's been long time I run it from start to end but I remember that the RX 6800 showed pretty good numbers, I think somewhere in range between nvidia 2080 and 3080. I have not checked if the upstream version of test been modernized for newer python and pytorch versions. In that case only "test.sh" script would proably need to be changed so that it could detect between nvidia and amd gpus. Something like
|
Now the test ran through, but i get the error about |
I also get this issue with https://github.com/lamikr/rocm_sdk_builder/tree/releases/rocm_sdk_builder_611 Should i try the master branch instead? |
It should be now fixed both in master and releases/rocm_sdk_builder_611 branches
I have now added to the end of the babs.sh script execution also the check if the permissions of /dev/kfd are ok. You can test the thing also by running
If that works, the permissions should be ok and Deepspeed should build. |
At the build step i get:
Perhaps it's better to delete everything and rebuild from scratch? Even though building costs many hours .. |
Oh, sorry. You can fetch that new repo with command:
And you can fetch git changes to old repositories with this change: './babs.sh -f' This will update at least the rocm_smi_lib where are now fix for git tags so that the library naming gets corrected. I would also force the rebuild of couple of projects to get only them rebuild. So new list of commands is little bit more longer unless you want to rebuild everything to verify all works :-)
|
@eitch I updated the gpu benchmark on https://github.com/lamikr/pytorch-gpu-benchmark Lets close this thread and continue discussion there about benchmarks. |
While running
./babs.sh -b
i received this error:I'm, running on Ubuntu:
And I'm using a RX 7900 XTX
The text was updated successfully, but these errors were encountered: