-
Notifications
You must be signed in to change notification settings - Fork 73
Accept Iterable for Performance #31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Sorry for the delay in my response. Currently my highest priority for the next update is to introduce a hashlib-compliant interface (#39). But I think your concern is important too, so I'll check if your proposal can be also implemented at that time. Thank you for elaborating your ideas! |
No worries! I ended up implementing this myself please feel free to use this code or take inspiration from it (though I doubt you'll need it): https://github.com/seung-lab/shard-computer/blob/main/MurmurHash3.cpp#L342-L391 |
I really appreciate your work! I'm relatively relunctant to make mmh3 dependent on other modules (pybind11), but I'll try to do similar things. Currently |
At least for my use case, the 64 bit function was mandated by a spec, so it might be good to have two versions if possible. Though, my use case is taken care of, so don't worry about me specifically. Thank you for addressing this! |
Really appreciate this library! Any upcoming plans to implement the functionality described above? It would be extremely handy for a project I'm working on. |
Hi. Thanks so much for being interested in the library! Right now I don't have time to make a major update to this project, but I'll try to do so in the second or third week in August (though I cannot guarantee it). |
Hi, thanks so much for this very useful library! I'm using it to randomize keys for billions of objects to create condensed files that contain groups of thousands to millions of objects at a time.
https://github.com/google/neuroglancer/blob/056a3548abffc3c76c93c7a906f1603ce02b5fa3/src/neuroglancer/datasource/precomputed/sharded.md
It's not critical, but there is a bottleneck step in the front of my Python processing pipeline where the hash is applied to all object labels at once to figure out how to assign them for further processing. The hash function is dominating this calculation.
It could be possible to thread this processing, but Python has the GIL. Multiprocessing could work, though the picke/unpickle will also take some time. I was thinking that a neat way to increase thoughput would be to process multiple hashes at once in C, that is, accept both a scalar and an iterator as input to the function. This would allow for the compiler to autovetorize and also avoid Python/C overheads. I'm getting ~66.5k hashes/sec on Apple Silicon M1 ARM64 currently.
I'm thinking of an interface similar to this. The second should be some buffer that is easy to read into numpy.
Thank you for your consideration and for all the effort you've put into this library!
The text was updated successfully, but these errors were encountered: