-
Notifications
You must be signed in to change notification settings - Fork 2
Investigate Rust-Based Python-Accessed Stores #64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
You could access it via Python calls, but you wouldn't be able to access the underlying Rust store directly. pyo3 doesn't support that sort of dynamic linking (yet?) because it's hard to do in a stable way across pyo3, python, and Rust versions. |
Are you aware of the performance of this? Re: #44 (comment) this then returns a copy? Am I understanding that right, when you go from python into the rust? |
Well the proposal in #44 (comment) would have you compile your own stores on the Rust side (just reusing the Rust code from pyo3-object_store), so you wouldn't have any overhead. But in principle, it's possible to also call the Python functions exported by |
This PR in async-tiff contains an example of reusing |
Very cool @kylebarron !!! |
I've now published It looks like you'd want to add Is there a reason why you require And it looks super messy to take a Instead of storing a |
Also, at least for |
I think it is related to storing the configs in the
The I think this is all somewhat because we don't get any info about the store when an array is created, and it just starts getting thrown at us with each chunk. It would be fantastic to get an upstream change into |
Indeed, that does sound like a really helpful update. It would be possible to have a similar config caching approach for |
@kylebarron's awesome Why do I ask? To zoom out for a moment: I'm really excited about being able to train machine learning models for forecasting renewable power generation directly from public Zarr datasets of numerical weather predictions, like those on dynamical.org. Modern GPUs are shockingly fast, so the bottleneck when training these ML models is often the IO. The dream would be for these datasets to be published as sharded Zarrs with pretty tiny chunks. And then, to train our ML models, we'd spin up a virtual machine with a fast network interface card (100 or 200 gigabits per second) near to the data. Crucially, we'd need a Zarr library that could efficiently load (at least) thousands of chunks per second. But most ML researchers use Python. So I'm looking into ways to achieve this level of performance for Python users. TBH, I'm nervous that we'll bump into Python's overhead when trying to load thousands of chunks per second. So, I guess I have several questions:
I can dedicate some time to helping build things if needs be. |
I'm pretty sure we can do this with Zarr-python after performance optimization of our IO layer. At a minimum, it would be very valuable to know what things we would need to change in order to achieve this throughput. |
@JackKelly we always welcome contributions. The main blocker here IMO would be the fact that our code is very As for what @d-v-b said, I tend to agree. I don't think Of course contributions are welcome! Happy to provide any assistance! |
So if you wanted to contribute @JackKelly, I think the first step would be to get an But I am excited to learn more! |
Yes for the |
@LDeakin In this My thought was that these limitations could be addressed here given that the userbase for python includes more remote users (unlike you or me), although I agree that the current setup should be tried first. |
Thanks so much for all your replies! Very interesting stuff. TL;DR: If I've understood correctly, it sounds like I should start by benchmarking Assuming these benchmarks demonstrates that Where should the benchmarks live?Does anyone have any strong opinions about where I should implement these benchmarks? The three options I'm aware of are:
Python vs RustMy hunch is still that, at some point,
|
Thanks for the detailed reply. Your assessment looks good to me. Where you add the benchmark is up to you, but if you want the pure-rust benchmark, Re: |
This is getting a bit derailed (but is very interesting!). See below for some newly added async vs sync
@ilan-gold the problem is within
@JackKelly If you are putting time into benchmarks it would be great to see:
I think that is a major enough change that you may be better off writing the benchmarks from scratch. Footnotes
|
Sounds good to me! I've started making some notes for my benchmarking plans here: https://github.com/JackKelly/zarr_cloud_benchmarks. Please feel free to comment on that repo (so this thread can return to its original topic 🙂. Sorry for derailing this thread a bit!) |
I'm quite curious how the obstore-based zarr-Python will compare with a native zarrs implementation. In any case, |
Thanks! I'll benchmark |
@JackKelly another Zarr benchmark repo spotted: https://github.com/HEFTIEProject/zarr-benchmarks |
Thank you! Just a quick note that, before benchmarking Zarr readers, I'm going to experiment with using Parquet to store weather forecasts... So I might not get round to benchmarking Zarr readers for a little while, I'm sorry! |
Currently a store can be written in rust, wrapped in pyo3, and then sent to our library. But we don't use that store. We should investigate whether it's possible to somehow "work around" and grab the underlying rust store without compromising performance instead of re-implementing/re-instantiating it in Rust within our package.
The text was updated successfully, but these errors were encountered: