Accept Iterable for Performance #31

william-silversmith · 2021-11-19T21:18:35Z

Hi, thanks so much for this very useful library! I'm using it to randomize keys for billions of objects to create condensed files that contain groups of thousands to millions of objects at a time.

https://github.com/google/neuroglancer/blob/056a3548abffc3c76c93c7a906f1603ce02b5fa3/src/neuroglancer/datasource/precomputed/sharded.md

It's not critical, but there is a bottleneck step in the front of my Python processing pipeline where the hash is applied to all object labels at once to figure out how to assign them for further processing. The hash function is dominating this calculation.

Function: murmur at line 25
Time per Hit in microseconds
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    25                                           @profile
    26                                           def murmur(x):
    27     69734      74332.0      1.1      1.6    y = uint64(x).tobytes()
    28     69734    4381932.0     62.8     97.2    y = mmh3.hash64(y, x64arch=False)
    29     69733      52731.0      0.8      1.2    return uint64(y[0])

Total time: 5.44635 s
File: REDACTED
Function: compute_shard_location at line 145
Time per Hit in microseconds
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   145                                             @profile
   146                                             def compute_shard_location(self, key):
   147     69734      99805.0      1.4      1.8      chunkid = uint64(key) >> uint64(self.preshift_bits)
   148     69734    4703010.0     67.4     86.4      chunkid = self.hashfn(chunkid)
   149     69733      60072.0      0.9      1.1      minishard_number = uint64(chunkid & self.minishard_mask)
   150     69733      97034.0      1.4      1.8      shard_number = uint64((chunkid & self.shard_mask) >> uint64(self.minishard_bits))
   151     69733     312626.0      4.5      5.7      shard_number = format(shard_number, 'x').zfill(int(np.ceil(self.shard_bits / 4.0)))
   152     69733     102740.0      1.5      1.9      remainder = chunkid >> uint64(self.minishard_bits + self.shard_bits)
   153                                           
   154     69733      71060.0      1.0      1.3      return ShardLocation(shard_number, minishard_number, remainder)

It could be possible to thread this processing, but Python has the GIL. Multiprocessing could work, though the picke/unpickle will also take some time. I was thinking that a neat way to increase thoughput would be to process multiple hashes at once in C, that is, accept both a scalar and an iterator as input to the function. This would allow for the compiler to autovetorize and also avoid Python/C overheads. I'm getting ~66.5k hashes/sec on Apple Silicon M1 ARM64 currently.

I'm thinking of an interface similar to this. The second should be some buffer that is easy to read into numpy.

(lower, upper) = mmh3.hash64(some_bytes, x64arch=False)
[l1,h1,l2,h2] = mmh3.hash64(iterable_containing_bytes, x64arch=False)

Thank you for your consideration and for all the effort you've put into this library!

The text was updated successfully, but these errors were encountered:

hajimes · 2023-03-24T16:08:46Z

Sorry for the delay in my response.

Currently my highest priority for the next update is to introduce a hashlib-compliant interface (#39). But I think your concern is important too, so I'll check if your proposal can be also implemented at that time.

Thank you for elaborating your ideas!

william-silversmith · 2023-03-24T16:52:28Z

No worries! I ended up implementing this myself please feel free to use this code or take inspiration from it (though I doubt you'll need it):

https://github.com/seung-lab/shard-computer/blob/main/MurmurHash3.cpp#L342-L391
https://github.com/seung-lab/shard-computer/blob/main/shardcomputer.cpp#L52-L78

hajimes · 2023-03-25T05:36:58Z

I really appreciate your work! I'm relatively relunctant to make mmh3 dependent on other modules (pybind11), but I'll try to do similar things.

Currently mmh3.hash_from_buffer accepts a bytearray (including numpy.ndarray) but justs returns one value. Perhaps I could write a function that interprets an ndarray as a list of (fixed-size) objects, then returns a bytearray of hash results (by using the 128 bit version of mmh3).

william-silversmith · 2023-03-25T05:55:23Z

At least for my use case, the 64 bit function was mandated by a spec, so it might be good to have two versions if possible. Though, my use case is taken care of, so don't worry about me specifically. Thank you for addressing this!

https://github.com/seung-lab/cloud-volume/blob/master/cloudvolume/datasource/precomputed/sharding.py#LL64C2-L65C2

ruyimarone · 2023-07-25T22:21:51Z

Currently mmh3.hash_from_buffer accepts a bytearray (including numpy.ndarray) but justs returns one value. Perhaps I could write a function that interprets an ndarray as a list of (fixed-size) objects, then returns a bytearray of hash results (by using the 128 bit version of mmh3).

Really appreciate this library! Any upcoming plans to implement the functionality described above? It would be extremely handy for a project I'm working on.

hajimes · 2023-07-26T16:05:25Z

Hi. Thanks so much for being interested in the library!

Right now I don't have time to make a major update to this project, but I'll try to do so in the second or third week in August (though I cannot guarantee it).

hajimes self-assigned this Mar 24, 2023

hajimes added the enhancement label Mar 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accept Iterable for Performance #31

Accept Iterable for Performance #31

william-silversmith commented Nov 19, 2021 •

edited

Loading

hajimes commented Mar 24, 2023

william-silversmith commented Mar 24, 2023 •

edited

Loading

hajimes commented Mar 25, 2023

william-silversmith commented Mar 25, 2023

ruyimarone commented Jul 25, 2023

hajimes commented Jul 26, 2023

Accept Iterable for Performance #31

Accept Iterable for Performance #31

Comments

william-silversmith commented Nov 19, 2021 • edited Loading

hajimes commented Mar 24, 2023

william-silversmith commented Mar 24, 2023 • edited Loading

hajimes commented Mar 25, 2023

william-silversmith commented Mar 25, 2023

ruyimarone commented Jul 25, 2023

hajimes commented Jul 26, 2023

william-silversmith commented Nov 19, 2021 •

edited

Loading

william-silversmith commented Mar 24, 2023 •

edited

Loading