Skip to content

Investigate Rust-Based Python-Accessed Stores #64

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ilan-gold opened this issue Dec 2, 2024 · 23 comments
Open

Investigate Rust-Based Python-Accessed Stores #64

ilan-gold opened this issue Dec 2, 2024 · 23 comments
Labels
enhancement New feature or request good first issue Good for newcomers investigation

Comments

@ilan-gold
Copy link
Collaborator

Currently a store can be written in rust, wrapped in pyo3, and then sent to our library. But we don't use that store. We should investigate whether it's possible to somehow "work around" and grab the underlying rust store without compromising performance instead of re-implementing/re-instantiating it in Rust within our package.

@ilan-gold ilan-gold added enhancement New feature or request good first issue Good for newcomers investigation labels Dec 2, 2024
@kylebarron
Copy link

You could access it via Python calls, but you wouldn't be able to access the underlying Rust store directly. pyo3 doesn't support that sort of dynamic linking (yet?) because it's hard to do in a stable way across pyo3, python, and Rust versions.

@ilan-gold
Copy link
Collaborator Author

You could access it via Python calls

Are you aware of the performance of this?

Re: #44 (comment) this then returns a copy? Am I understanding that right, when you go from python into the rust?

@kylebarron
Copy link

Well the proposal in #44 (comment) would have you compile your own stores on the Rust side (just reusing the Rust code from pyo3-object_store), so you wouldn't have any overhead.

But in principle, it's possible to also call the Python functions exported by obstore. And then when obstore.get or obstore.get_range returned, you could extract the returned result into a PyBuffer<u8> buffer protocol object. But you'd have to acquire the GIL whenever you call the Python function. So the pure-rust approach above is better.

@kylebarron
Copy link

kylebarron commented Feb 26, 2025

This PR in async-tiff contains an example of reusing pyo3-object_store. Once object_store 0.12 is released (should be in the next few days), we can release pyo3-object_store v0.1, and then we can prototype integrating that here.

@ilan-gold
Copy link
Collaborator Author

Very cool @kylebarron !!!

@kylebarron
Copy link

I've now published pyo3-object_store 0.1, which uses pyo3 0.23, and pyo3-object_store 0.2, which uses pyo3 0.24.

It looks like you'd want to add object_store support into your StoreConfig?
https://github.com/ilan-gold/zarrs-python/blob/03af8d0fdedebf847d13f88768b0cc7171238df6/src/store.rs#L26-L30

Is there a reason why you require Hash, PartialEq, Eq, PartialOrd, Ord on StoreConfig? Those are a lot of bounds.

And it looks super messy to take a LocalStore and FsspecStore and access their underlying HTTP config settings.

Instead of storing a StoreConfig, why don't you store a Store directly?

@kylebarron
Copy link

Also, at least for object_store, it's bad to be recreating a store from a config all the time, because that prevents connection pooling from taking place.

@LDeakin
Copy link
Member

LDeakin commented Mar 17, 2025

Is there a reason why you require Hash, PartialEq, Eq, PartialOrd, Ord on StoreConfig? Those are a lot of bounds.

I think it is related to storing the configs in the StoreManager.

Also, at least for object_store, it's bad to be recreating a store from a config all the time, because that prevents connection pooling from taking place.

The StoreManager is caching stores with the same config.

I think this is all somewhat because we don't get any info about the store when an array is created, and it just starts getting thrown at us with each chunk. It would be fantastic to get an upstream change into zarr-python to get this info when the codec pipeline is created.

@kylebarron
Copy link

I think this is all somewhat because we don't get any info about the store when an array is created, and it just starts getting thrown at us with each chunk. It would be fantastic to get an upstream change into zarr-python to get this info when the codec pipeline is created.

Indeed, that does sound like a really helpful update. It would be possible to have a similar config caching approach for object_store, but it wouldn't be my first approach.

@JackKelly
Copy link

@kylebarron's awesome obstore-based Store has recently been merged into zarr-python. Does that make it any easier to directly call object_store from zarrs-python without going via Python? (I'm guessing not???)

Why do I ask? To zoom out for a moment: I'm really excited about being able to train machine learning models for forecasting renewable power generation directly from public Zarr datasets of numerical weather predictions, like those on dynamical.org. Modern GPUs are shockingly fast, so the bottleneck when training these ML models is often the IO. The dream would be for these datasets to be published as sharded Zarrs with pretty tiny chunks. And then, to train our ML models, we'd spin up a virtual machine with a fast network interface card (100 or 200 gigabits per second) near to the data.

Crucially, we'd need a Zarr library that could efficiently load (at least) thousands of chunks per second. Zarrs feels like a great candidate to be able to do this. I haven't tried yet, but my guess is that zarrs + object_store can already achieve this level of performace today, for Rust developers.

But most ML researchers use Python. So I'm looking into ways to achieve this level of performance for Python users. TBH, I'm nervous that we'll bump into Python's overhead when trying to load thousands of chunks per second.

So, I guess I have several questions:

  1. Assuming that zarrs-python has to go via Python to call obstore, could zarrs-python + obstore load thousands of Zarr chunks per second from object storage?
  2. If not, is there a clean way to allow zarrs-python to directly call object_store from Rust, without going via Python?
  3. If not, would a useful contibution be a minimal Python wrapper around zarrs and obstore, that does not use zarr-python? (This work would probably include building a simple xarray backend).

I can dedicate some time to helping build things if needs be.

@d-v-b
Copy link

d-v-b commented May 3, 2025

Crucially, we'd need a Zarr library that could efficiently load (at least) thousands of chunks per second.

I'm pretty sure we can do this with Zarr-python after performance optimization of our IO layer. At a minimum, it would be very valuable to know what things we would need to change in order to achieve this throughput.

@ilan-gold
Copy link
Collaborator Author

ilan-gold commented May 3, 2025

@JackKelly we always welcome contributions. The main blocker here IMO would be the fact that our code is very sync-focused and not very async-focused. As far as I can tell, we would need a fair amount of duplication of current methods to achieve an async API. Doable, but since none of us three maintainers work with cloud data as far as I can tell, not a priority for us personally. io_uring has been on our collective radar but no one has taken it up, so having the async code in place would be awesome for that as a first step!

As for what @d-v-b said, I tend to agree. I don't think io_uring in python is happening any time soon for a variety of reasons, but in Rust it seems feasible. So for that our library is a good fit and something I look forward to. Aside from that, while our performance here is good, there is definitely more juice to be squeezed out of the python implementation as benchmarks shows (which does make this library very useful for knowing that as well!), but when the python implementation is "optimized," we can be quite close. For example, our read benchmarks (both chunk-by-chunk and full read) are quite competitive outside sharding, so I don't see why sharding is a fundamental bottle neck, although if it is, I would love to understand why as @d-v-b pointed out.

Of course contributions are welcome! Happy to provide any assistance!

@ilan-gold
Copy link
Collaborator Author

ilan-gold commented May 3, 2025

So if you wanted to contribute @JackKelly, I think the first step would be to get an async setup here running first, and then we can look at individual stores. If you're reading public data (i.e., no need for credentials), it is probably easier to use one of the existing rust-based stores zarrs supports rather than do the rust -> python -> rust thing with obstore (for example, I assume opendal has similar performance characteristics: https://docs.rs/zarrs_opendal/latest/zarrs_opendal/).

But I am excited to learn more!

@LDeakin
Copy link
Member

LDeakin commented May 4, 2025

Crucially, we'd need a Zarr library that could efficiently load (at least) thousands of chunks per second. Zarrs feels like a great candidate to be able to do this.

Yes for the zarrs sync API, but the async array API has a performance limitation. I'd say step 1 before trying any of this Python stuff is to see if zarrs itself hits your performance target with remote data. That performance limitation can be addressed, it just hasn't been my priority as I work exclusively with local filesystems. Also, writing a runtime-agnostic async Rust library is still not great, but the Rust 2025H1 language goals look promising.

@ilan-gold
Copy link
Collaborator Author

ilan-gold commented May 5, 2025

@LDeakin In this zarrs-python library we could experiment with different async APIs/runtimes since we have a finer control than with pure zarrs, no? i.e., we could spawn tasks as needed where async_retrieve_array_subset cannot? Just making sure I understand.

My thought was that these limitations could be addressed here given that the userbase for python includes more remote users (unlike you or me), although I agree that the current setup should be tried first.

@JackKelly
Copy link

JackKelly commented May 5, 2025

Thanks so much for all your replies! Very interesting stuff.

TL;DR: If I've understood correctly, it sounds like I should start by benchmarking zarrs and zarr-python v3.0.7. Specifically, I should measure bandwidth when reading sharded zarrs from cloud object storage, from a VM with a 100 gigabit per second network interface card (NIC). When benchmarking zarr-python, I'll try zarr-python with and without the new obstore-based Store, and with and without @aldenks's PR #3004: "Coalesce and parallelize partial shard reads".

Assuming these benchmarks demonstrates that zarr-python is not yet saturating the NIC, then I'll first work on helping to speed up zarr-python (by writing Python, rather than writing Rust), as suggested above by @d-v-b.

Where should the benchmarks live?

Does anyone have any strong opinions about where I should implement these benchmarks? The three options I'm aware of are:

  1. https://github.com/LDeakin/zarr_benchmarks
  2. https://github.com/zarr-developers/zarr-benchmark
  3. A new repo

Python vs Rust

My hunch is still that, at some point, zarr-python will hit up against a performance wall in Python. NICs are getting faster. SSDs are getting faster. CPUs aren't getting much faster. Loaded sharded zarrs requires looping over chunks; and Python loops are slow. So, at some point soon, the bottleneck when trying to read huge numbers of small chunks from sharded Zarrs will be the software stack rather than the hardware's IO bandwidth. So, further down the track, I'm also really interested in zarrs for reading from object storage.

io_uring

A year and a half ago, I started work on a project called light-speed-io to use io_uring to read sharded Zarrs from local SSDs. I got as far as proving to myself that the hype is right: io_uring can be insanely fast and efficient: you can easily max-out a PCIe 5 SSD, with minimal CPU overhead. But then it became clearer to me that in my little world (weather forecasting), most users and most datasets are moving to the cloud. So I paused work on light-speed-io.

I think my main take-away from my work on io_uring is that, yes, io_uring can be very fast and efficient. But you have to do a remarkable amount of work to realise that performance. Specifically, I found it's necessary to have multiple user-space threads, each with their own uring, and you have to efficiently distribute work across those threads with a work-stealing scheduler. And, of course, the performance benefits of io_uring only appear when you keep hundreds of IO operations in flight at any given moment (or more). And life gets very fiddly when you're using DIRECT_O with io_uring: All your buffers have to be aligned; handling errors becomes quite fiddly; etc.

Looking back, LSIO was far too ambitious. I think it could be turned into a crate with a much narrower scope for enabling async local file IO, using io_uring (whilst looking after all the pain of spinning up multiple worker threads).

BTW, there is some work which suggests that io_uring could have benefits for highly concurrent network IO (as well as local IO). So I do agree that the "perfect" Zarr library would have the option to use io_uring for both local IO and network IO.

@ilan-gold
Copy link
Collaborator Author

Thanks for the detailed reply. Your assessment looks good to me.

Where you add the benchmark is up to you, but if you want the pure-rust benchmark, zarrs_benchmarks probably makes sense. If you use only this package, it's easy to turn on-and-off so the other repo could work.

Re: io_uring I have peaked at the library more than once, and it's one of the reasons I also feel good about io_uring. Thanks for the info there. I think with PyO3 here the limitation is that we would be tied to a runtime and I'm not sure light-speed-io has that. There is support for custom runtimes in PyO3 though: https://docs.rs/pyo3-asyncio/latest/pyo3_asyncio/#rusts-event-loop

@LDeakin
Copy link
Member

LDeakin commented May 6, 2025

This is getting a bit derailed (but is very interesting!). See below for some newly added async vs sync zarrs benchmarks:

  • Inner chunk reading benchmark feat: add inner chunk benchmark zarr_benchmarks#8
    • zarr-python is far too slow to include in this benchmark
    • zarrs has an API for caching decoded shard indexes, so it has an unfair advantage here. I'm not sure if the other implementations have anything similar.
  • This discussion highlights zarrs async vs sync performance Local async benchmarks zarr_benchmarks#9
  • zarrs (sync) is blazingly fast 🚀 at reading inner chunks, but async needs optimisation.

In this zarrs-python library we could experiment with different async APIs/runtimes since we have a finer control than with pure zarrs, no? i.e., we could spawn tasks as needed where async_retrieve_array_subset cannot? Just making sure I understand.

@ilan-gold the problem is within async_retrieve_array_subset. But if chunk parallelism is external, then it is circumvented. This is what we do anyway in zarrs-python (albeit with the sync array API) because we are constrained by the CodecPipeline interface.

zarr-python does all the mapping of indexes to chunks, and this would be a major bottleneck for an array with tiny chunks. I think this could only be addressed by changing the CodecPipeline interface so that implementations could handle the indexing, but I don't know if that is feasible.

Does anyone have any strong opinions about where I should implement these benchmarks

@JackKelly If you are putting time into benchmarks it would be great to see:

  • More array configurations evaluated: chunk sizes, shard sizes, codecs, etc.
  • Local and remote stores
  • More implementations
  • Ensure the best possible API is used in each benchmark for each implementation

I think that is a major enough change that you may be better off writing the benchmarks from scratch. zarrs_tools has convenient binaries for benchmarking zarrs1. The benchmarks in LDeakin/zarr_benchmarks take far too long to run because the datasets are geared around my use case: massive arrays, big shards, and tiny $32^3$ inner chunks, and zarr-python is slow with sharded arrays with tiny chunks.

Footnotes

  1. When zarrs supports ZEP0008, those binaries could support arbitrary cloud stores. That would address https://github.com/LDeakin/zarrs_tools/issues/10 and https://github.com/LDeakin/zarr_benchmarks/issues/7

@JackKelly
Copy link

Sounds good to me! I've started making some notes for my benchmarking plans here: https://github.com/JackKelly/zarr_cloud_benchmarks.

Please feel free to comment on that repo (so this thread can return to its original topic 🙂. Sorry for derailing this thread a bit!)

@kylebarron
Copy link

I'm quite curious how the obstore-based zarr-Python will compare with a native zarrs implementation.

In any case, pyo3-object_store, the pyo3 integration for the object_store crate, should be ready to use if you're interested in exploring a Python API to zarrs.

@JackKelly
Copy link

Thanks! I'll benchmark zarr-python with and without your new obstore-based Store, and compare to zarrs. Like you, I'm excited to see how they compare!

@LDeakin
Copy link
Member

LDeakin commented May 10, 2025

@JackKelly another Zarr benchmark repo spotted: https://github.com/HEFTIEProject/zarr-benchmarks

@JackKelly
Copy link

Thank you!

Just a quick note that, before benchmarking Zarr readers, I'm going to experiment with using Parquet to store weather forecasts... So I might not get round to benchmarking Zarr readers for a little while, I'm sorry!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers investigation
Projects
None yet
Development

No branches or pull requests

5 participants