-
Notifications
You must be signed in to change notification settings - Fork 214
chore(dev-deps): add pixi setup for cuda_pathfinder
#1090
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
cuda_pathfinder/pyproject.toml
Outdated
"nvidia-cufftmp-cu12; sys_platform != 'win32'", | ||
"nvidia-libmathdx-cu12", | ||
] | ||
cu12-ext = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is splitting out -ext
important technically?
In the long run, it's possible that some of the libs will add support for win32
. What would we do then? Move those to the non-ext
group?
I'd have a preference for keeping nvidia
in the names, e.g. nvidia_cu12
or nvidia-cu12
. In the back of my mind: 1. It's more explicit/readable. 2. Maybe/maybe-not we'll have non-nvidia wheels in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is splitting out -ext important technically?
Yes, because pixi cannot solve the environment's dependencies without it. Pixi doesn't know anything about markers, for example.
In the long run, it's possible that some of the libs will add support for win32. What would we do then? Move those to the non-ext group?
Yep.
I'd have a preference for keeping nvidia in the names, e.g. nvidia_cu12 or nvidia-cu12. In the back of my mind: 1. It's more explicit/readable. 2. Maybe/maybe-not we'll have non-nvidia wheels in the future.
I find the nvidia
unnecessarily verbose. I don't see what help this additional information provides that makes development easier or anything more clear. On the other hand, more typing is concretely more annoying :)
Can we wait for those wheels to show up before solving a problem we don't yet have?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naming is very important to me. I realize I have a reputation to err on the more explicit side, but I'm holding on to the believe it's actually helpful for people not immersed in the context. Imagine someone not on our team looking at cu12
here, there isn't much they can mentally attach to. Such super stingy naming forces people to look around more than they have to otherwise. Similarly, it's much more likely to send AIs down a rabbit hole. It's also much more difficult to pin-point references to super short names, especially cu12
or similar will have many unrelated matches.
How about (not sure about underscore vs minus, any works for me): nvidia_wheels_cu12
Also, ext
doesn't tell anything about the reason for the splitting. At the moment I understand only: we need this for pixi. What could be terse hints that make sense here? pixi
is probably too specific? Does win32
make sense? Other things to come to mind: restricted, partially, platform? But I don't know enough about the technicalities enforcing the splitting to know what make sense; just "ext" leaves people with "for unspecified reasons split out", which I'd strongly prefer to avoid.
I'm hoping we can find a healthy compromise, that gives readers/AIs a reasonable chance to get on the right track at a glance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a preference for underscores versus dashes.
Such super stingy naming forces people to look around more than they have to otherwise.
Perhaps, but let's try and be more concrete. Here's my thinking.
My goal with this PR, and the pixi work in general, is to get devs the ability to run the test suite as soon as humanly possible.
To that end, naming a collection of dependencies nvidia_wheels_cu12
provides no information that furthers that goal that I can see. There are multiple pieces of information in that name that are irrelevant for running the test suite and getting started contributing to the codebase IMO. Since these environments are designed to be run with pixi run -e <environment>
giving them long names makes things more annoying to use. I really really want this stuff to be easy to use regularly. Long names are directly opposed to that goal.
Here's some thoughts on the proposed naming components and some questions.
nvidia
refers to something, I'm not sure what exactly, perhaps the source of the dependencies? How does that help clarify anything except what company built the wheel? Is that information useful for the purposes of development?wheels
refers to a packaging implementation detail that seems like it actively harms initial understanding by providing an unnecessary level of detail. At the level of "run the test suite", I don't really care about whether a thing is a wheel or not. In fact, this is even worse in that you can easily mislead developers if some part of that list is built from source for some reason. Should we then name thingsnvidia_maybe_wheels_if_not_built_from_source
? I really hope not.
Regarding ext
, we can call that extended
, which is what I am trying to convey there. Definitely open to different naming, but I am trying to keep things short for a reason.
I'm being intentionally vague with that name, to allow for eventual elimination of that set of environments. A comment would suffice here IMO.
In general I find the practice of "name everything with very specific and forget about the length of names" actively harms development by making the amount of stuff I have to sift through that much greater. I still have to ask someone what anything means if I'm new to the codebase.
I also don't care about optimizing for AI digestibility. As of now, it's primarily humans that still have to read all this code and there's plenty to do there without having to consider AI. Again, this feels like solving a problem we (as developers of this library) do not have.
How should we proceed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My goal with this PR, and the pixi work in general, is to get devs the ability to run the test suite as soon as humanly possible.
That's a good goal.
How should we proceed?
Could you please fix up the .github/workflows/test-wheel-*.yml
files? ― I'm actually not sure what the new commands will look like. Seeing that would help me understand better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding ext, ...
A comment would suffice here IMO.
Yes, I agree. ext
is fine. I see from the changes in the .yml
files and the toolshed/
directory that it doesn't appear there.
I think we still need to update these:
|
I forgot to add: Please don't worry about testing the |
set -euo pipefail | ||
pushd cuda_pathfinder | ||
pip install --only-binary=:all: -v . --group "test_nvidia_wheels_cu${TEST_CUDA_MAJOR}" --group test_nvidia_wheels_host | ||
pip install --only-binary=:all: -v . --group "cu${TEST_CUDA_MAJOR}" --group host |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that I see that -ext
doesn't appear here, could this work?
pip install --only-binary=:all: -v . --group "test-cu${TEST_CUDA_MAJOR}"
And then in pyproject.toml
we pull in cu12-host
just like we pull in cu12-ext
?
That would be strictly better IMO than the status quo with two --group
arguments.
But TBH I'm scratching my head a bit ... how does the --group
feature work?
What I mean:
cu12 = {...}
cu12-ext = {{ include-group = "cu12" }, ...}
What surprises me: you have cu12
here, not cu12-ext
.
But I'm getting the sense the include-group
feature allows us to have just one --group
here while still having a nice structure in the pyproject.toml
file for clarity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, I will use the *-ext
groups wherever we are pip install
-ing things.
cuda_pathfinder/pyproject.toml
Outdated
"nvidia-cufftmp-cu12; sys_platform != 'win32'", | ||
"nvidia-libmathdx-cu12", | ||
] | ||
cu12-ext = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding ext, ...
A comment would suffice here IMO.
Yes, I agree. ext
is fine. I see from the changes in the .yml
files and the toolshed/
directory that it doesn't appear there.
…u12` or `--group test-cu13` to install all test dependencies for the given CTK major version.
I just added commit 0e132de to see what you think. It works for me. Does it also work for pixi?
It's really easy to remember now, e.g. |
It doesn't work for pixi, because the The way I originally had it was the best compromise I could come up with. Ideally pixi would support markers, so it could figure out which platforms to solve for given the platform I'm running on, but that's not the case today. |
Hm, I need to learn how pixi works, and how I can test locally myself; also so I don't break pixi in the future when making changes in pathfinder. At the moment I don't understand why It'd be great if we could meet and you show me how it works. |
Local testing at commit 9ac1364 (Make Testing: NOTE: The linux-aarch64-specific aspect is that
|
I'm starting to test under Windows (still at commit 9ac1364). I installed
Success:
First trial:
This is the dreaded |
That usually means there's some kind of specification that is too broad that is including |
Taking a look. |
Oh, I see. You're on windows, nevermind what I said then! |
Without changing anything, this works:
See below. — Hypothesis(!):
|
…o commit 9ac1364 (with this pixi run -e cu12-win-64 --verbose -- pytest -v tests succeeds).
Commit fa384e3 (Avoid use of I.e.
|
/ok to test |
…13-linux, to resolve a regression observed under linux-64
Unfortunately, commit fa384e3 causes a regression under Fortunately it's an easy fix: commit 2715310 (avoid use of
|
… libs and headers are installed from wheels into site-packages.
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cpcloud Ready to merge from my viewpoint!
"nvidia-cublas-cu12", | ||
"nvidia-cuda-cccl-cu12", | ||
"nvidia-cuda-nvcc-cu12", | ||
"nvidia-cuda-nvrtc-cu12", | ||
"nvidia-cuda-runtime-cu12", | ||
"nvidia-cufft-cu12", | ||
"nvidia-curand-cu12", | ||
"nvidia-cusolver-cu12", | ||
"nvidia-cusparse-cu12", | ||
"nvidia-npp-cu12", | ||
"nvidia-nvfatbin-cu12", | ||
"nvidia-nvjitlink-cu12", | ||
"nvidia-nvjpeg-cu12", | ||
"nvidia-cudss-cu12", | ||
"nvidia-cufftmp-cu12; sys_platform != 'win32'", | ||
"nvidia-libmathdx-cu12", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has the potential to now end up with different minor versions of these libraries whereas using cuda-toolkit[...]=12.*
would guaranteed that the minor version of cuda-toolkit
that was resolved would control the minor versions of all of the libraries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooof ... this is a tough one.
pixi
just doesn't work with the cuda-toolkit
meta-package.
Telling exactly what works and what doesn't is tricky, to a significant degree because I observed many times that the pixi
behavior isn't fully deterministic, even after pixi clean && rm -f pixi.lock
. I'm sure cu13-linux-64
never passed testing before commit 9ac1364, even though I tried repeatedly from scratch (some subtests passed, but not all, because some libraries were missing). The cu12-*
environments worked on some platforms with the cuda-toolkit
meta-package, but not all. I didn't try everything repeatedly from scratch, there is a small possibility that all cu12-*
environments work with the meta-package when trying repeatedly, but it surely isn't stable.
Even with this PR as-is (i.e. no use of the meta-package at all), occasionally I have to combine pixi clean
and running commands repeatedly until it works again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forgot to add:
This has the potential to now end up with different minor versions of these libraries
I don't think this will ever matter for the purpose of test_load_nvidia_dynamic_libs.py.
I'm not saying it's a good situation, but I think there is very little to no tangible reward for making it perfect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this will ever matter for the purpose of test_load_nvidia_dynamic_libs.py.
If one library has a binary dependency on another library, I don't think compatibility guarantees are made outside of driver version. I.e. lib2
depends on lib1
, the only guarantee we currently have is that it works if both lib2
and lib1
are the same patch version. So from the perspective of test_load_nvidia_dynamic_libs.py
we could end up in an environment with different patch versions and someone blows up at load time because we load an incompatible dependency (since the dependencies aren't captured in python package dependencies to support using a system CTK).
I'm not saying it's a good situation, but I think there is very little to no tangible reward for making it perfect.
I agree it's not the end of the world, but maybe the answer is instead of trying to make pyproject.toml
work for pixi at this time we should ship a separate pixi.toml
in order to separate things out? This way we don't need to worry about compatibility issues between pixi
and other tools handling pyproject.toml
standard fields for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i.e. we could use the pixi specific syntax for handling pypi dependencies (https://pixi.sh/latest/reference/pixi_manifest/#pypi-dependencies) which allows specifying extras in a pixi specific way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After seeing Keith's last comment here, I decided it'll be useful to break out this small achievement developed here: PR #1125
…ONDA_PREFIX/bin` does not work under Windows).
…without adding anything related to pixi).
…t adding anything related to pixi). (#1125)
@cpcloud I think this is back to where we were, and I tried for 20 minutes or so, giving up on |
Could you explain how it's failing and share any form of stacktraces or other information related to the failures? In general, I'm not sure about the general approach here of using a mix of conda packages and pip packages in a single environment. We have way too much native code underneath our stuff that this is just asking for incompatibilities. |
I wrote:
Looking at it again with a fresh eye: Face-palm, sorry, I was on the wrong colossus workstation (I pulled up the wrong ssh command). Keith wrote:
The behavior was mostly that no test was executed. Which actually makes sense, because I was on the wrong workstation. Concretely, I ran I believe I also saw occasional flaky behavior, which is a little distracting, but just running again usually works. I'm just now trying to reproduce flaky behavior, but it's going very slow, because each invocation of |
I didn't succeed finding a systematic reproducer for getting pixi into a bad state, in particular this error: A few attempts to start from a clearly defined state to get that error didn't work out. Let's see how it works for others. |
Add a pixi setup to
cuda_pathfinder
.