Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lookup semantics #21

Closed
meling opened this issue May 17, 2023 · 5 comments
Closed

Lookup semantics #21

meling opened this issue May 17, 2023 · 5 comments

Comments

@meling
Copy link

meling commented May 17, 2023

After playing around with RecSplit a bit, I came to realize that the lookup semantics (the operator()) is different from that of BBHash. That is, in RecSplit a key that is not in the map may return any value. In BBHash, a key that is not in the map will return a zero value, but may also return any value (a false positive). In my experience with BBHash, false positives are rare.

I'm wondering if RecSplit could be adjusted to a similar semantics; that is to return

  • 0 if it is known that the key is not in the map.
  • index values: 1, 2, ..., n corresponding to the n keys.
  • false positives may be allowed.

To be clear, I'm not asking that you change your implementation in this repository, only whether or not it is an inherent limitation with RecSplit's underlying data structures. I've looked at the code and the paper, but on first glance it was not trivial to understand.

@vigna
Copy link
Owner

vigna commented May 17, 2023 via email

@meling
Copy link
Author

meling commented May 17, 2023

Sorry, I'm of course wrong and I knew it... a big lapse of mind to claim it was rare. Below are some statistics from my reimplementation of BBHash. Looks like the number of false positives grows with the size of the map (0.82, 0.91, 0.96). However, the way we use BBHash, there is still some value in returning 0 (key is definitely not in the set).

I'm guessing that with RecSplit and using even fewer bits per key (e.g. 1.6), what you are saying is that these probabilities approaches 1??

=== RUN   TestManyKeys/name=Sequential_/gamma=2.0/keys=1000
    bbhash_test.go:151: BBHash(gamma=2.0, entries=1000, levels=6, bits=3392, size=424 B, bits per key=3.4)
    bbhash_test.go:152: found 82373084 false positive keys: 0.823731
=== RUN   TestManyKeys/name=Sequential_/gamma=2.0/keys=10000
    bbhash_test.go:151: BBHash(gamma=2.0, entries=10000, levels=9, bits=33408, size=4.1 KB, bits per key=3.3)
    bbhash_test.go:152: found 91337297 false positive keys: 0.913373
=== RUN   TestManyKeys/name=Sequential_/gamma=2.0/keys=100000
    bbhash_test.go:151: BBHash(gamma=2.0, entries=100000, levels=11, bits=330496, size=40.3 KB, bits per key=3.3)
    bbhash_test.go:152: found 96737320 false positive keys: 0.967373

@vigna
Copy link
Owner

vigna commented May 17, 2023

Yes, it's a matter of space. The more space you use, the more bits you have to test for false positives. 1.44 bits per key are just necessary to return a correct output on the key set. Anything more than that counts as bits you can consider as signatures. So if you have 1 bit more (2.44 overall) you might catch 50% false positives (but that doesn't really happen, it's an idealized model).

If you wanna obtain that result with RecSplit, just add the bits. Use a bit vector to store, say, 1 bit per key, which is the parity of the hash of the key. When you pass a key through RecSplit, check that the parity of its hash is equal to the stored parity. If not, return -1 or something. Easy! It will still use much less space than BBHash at that compression level, and catch 50% false positives.

BTW, Peter Sanders has a parallel implementation that builds quickly maps with 1.6 bits/key.

@meling
Copy link
Author

meling commented May 17, 2023

Thanks for the idea!

Yeah, I think I've seen the parallel implementation, but I'm testing on an Apple M2 Max, and I had some problems compiling it on this machine, but maybe that was the SIMD/GPU versions. Will check it again to see if there is a non-x86 variant.

Let me know if it is a different repo than this one: https://github.com/ByteHamster/GpuRecSplit

@meling
Copy link
Author

meling commented May 17, 2023

Closing as resolved; nothing to fix. Thanks for the help.

@meling meling closed this as completed May 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants