Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Are they any bindings for Golang? #10

Closed
AlexeyAkhunov opened this issue Sep 7, 2021 · 6 comments
Closed

Are they any bindings for Golang? #10

AlexeyAkhunov opened this issue Sep 7, 2021 · 6 comments

Comments

@AlexeyAkhunov
Copy link

I would like to use this library (specifically RecSplit for creating MPHFs) for some experiments in our Erigon project (if you are interested I can describe the possible use case): https://github.com/ledgerwatch/erigon
But I do need Golang binding to do so. I assume there are no bindings yet, and I was thinking about creating a bounty for someone to develop it.

@AlexeyAkhunov
Copy link
Author

Actually, after code review and experiments, I realised we would need to modify the implementation quite a bit:

Firstly, we cannot assume that apply non-cryptographic hash function to the keys and effectively taking low 64-bit (this is what I discovered the implementation really does) as a fingerprint, would work. Because it would be easy to break such algorithm by creating collision in such fingeprints (as we are operating in adversarial environment). Secondly, we need to handle the case where all keys do not fit in RAM. But in principle the approach should work.

@vigna
Copy link
Owner

vigna commented Sep 10, 2021

So, I'm reading this just now. It should be easy to replace the hash function with a crypto-level one. For keys not fitting in RAM do you mean the hashes or the keys themselves? There's no need for the keys to be in RAM. One can even sort the hashes offline, but that's more work (stxxl?).

@AlexeyAkhunov
Copy link
Author

@vigna thanks for replying. I have started re-implementing it in golang. We have what we call ETL (Extract Transform Load) framework in our library to be able to sort keys by bucket numbers "offline" (potentially using temporary files). My main issue was to make sure we do not generate duplicates when applying initial hash (first_hash in your code) to the keys. So instead I am planning to bring full-sized keys to the procedure for each bucket, instead of starting with uint64 fingerprints: ledgerwatch/erigon-lib#67

@vigna
Copy link
Owner

vigna commented Sep 13, 2021

But wait, first_hash is made of 128 bits. The probability of collision is entirely negligible. I don't understand your point about 64-bit fingerprints.

@vigna vigna closed this as completed Sep 13, 2021
@AlexeyAkhunov
Copy link
Author

But wait, first_hash is made of 128 bits. The probability of collision is entirely negligible. I don't understand your point about 64-bit fingerprints.

I meant that when the bucket of keys arrive into the recSplit function, it is vector<uint64>, so it is more chance of having collisions. And we cannot rely on low probability of collisions in our use case, because someone could try to find collision and introduce it just to stop our system from working. That is why I am trying to bring full sized keys into the equivalent of recSplit function, because it then deals with the collisions by definition (by finding the functions that does not give collisions)

@vigna
Copy link
Owner

vigna commented Sep 13, 2021

But the bucket is maximum ≈2000 keys. The probability of collision of 2000 64-bit keys is still entirely negligible. If the first function is crypto, the keys are randomly chosen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants