Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RecSplit MPHF mapping #2

Closed
weaversa opened this issue Nov 6, 2019 · 4 comments
Closed

RecSplit MPHF mapping #2

weaversa opened this issue Nov 6, 2019 · 4 comments

Comments

@weaversa
Copy link

weaversa commented Nov 6, 2019

I am trying to get the mapping from keys to unique indices out of a RecSplit MPHF. I created a file with 4 strings and passed it to recsplit_dump_8, creating an MPHF. I modified recsplit_load.cpp (shown below) to display the mapping. However, the mapping is not a bijection. I also tried with a million keys and could not get an MPHF.

I've only been working with the tool for a few hours today, but I can't see how I'm using the interface incorrectly. Any help would be appreciated.

Here is the output of my run:

$ git status
On branch master
Your branch is up-to-date with 'origin/master'.
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

	modified:   benchmark/function/recsplit_load.cpp

Untracked files:
  (use "git add <file>..." to include in what will be committed)

	test.keys
	test.mphf

no changes added to commit (use "git add" and/or "git commit -a")
$ make recsplit
g++ -std=c++17 -I./ -O3 -DSTATS -march=native -DLEAF=8 -DALLOC_TYPE=MALLOC benchmark/function/recsplit_dump.cpp -o bin/recsplit_dump_8
g++ -std=c++17 -I./ -O3 -DSTATS -march=native -DLEAF=8 -DALLOC_TYPE=MALLOC benchmark/function/recsplit_dump128.cpp -o bin/recsplit_dump128_8
g++ -std=c++17 -I./ -O3 -DSTATS -march=native -DLEAF=8 -DALLOC_TYPE=MALLOC benchmark/function/recsplit_load.cpp -o bin/recsplit_load_8
g++ -std=c++17 -I./ -O3 -DSTATS -march=native -DLEAF=8 -DALLOC_TYPE=MALLOC benchmark/function/recsplit_load128.cpp -o bin/recsplit_load128_8
$ cat test.keys
fish
cat
dog
horse
$ bin/recsplit_dump_8 test.keys 2 test.mphf
Building...
Elias-Fano cumul sizes:  3.000000 bits/bucket
Elias-Fano cumul bits:   5.000000 bits/bucket
Elias-Fano cumul sizes:  1.500000 bits/key
Elias-Fano cumul bits:   2.500000 bits/key
Rice-Golomb descriptors: 1.500000 bits/key
Total bits:              5.500000 bits/key
Construction time: 0.000 s, 15366 ns/key
$ bin/recsplit_load_8 test.keys test.mphf
fish -> 1
cat -> 1
1 Duplicate!!!
dog -> 1
1 Duplicate!!!
horse -> 2
0 Missing!!!
3 Missing!!!

Here is my modification torecsplit_load.cpp to print the mapping.

$ cat benchmark/function/recsplit_load.cpp 
#include <algorithm>
#include <chrono>
#include <cstdio>
#include <fstream>
#include <iostream>
#include <random>
#include <sux/function/RecSplit.hpp>

#define SAMPLES (11)

using namespace std;
using namespace sux::function;

int main(int argc, char **argv) {
	if (argc < 3) {
		fprintf(stderr, "Usage: %s <keys> <mphf>\n", argv[0]);
		return 1;
	}

	ifstream fin(argv[1]);
	string str;
	vector<string> keys;
	while (getline(fin, str)) keys.push_back(str);
	fin.close();

	fstream fs;
	RecSplit<LEAF, ALLOC_TYPE> rs;

	fs.exceptions(fstream::failbit | fstream::badbit);
	fs.open(argv[2], std::fstream::in | std::fstream::binary);
	fs >> rs;
	fs.close();

	uint8_t *seen = (uint8_t *)calloc(keys.size(), sizeof(uint8_t));
	
	for (int k = 0; k < keys.size(); k++) {
	  uint64_t h = rs(keys[k]);
	  printf("%s -> %lu\n", keys[k].c_str(), h);
	  if(seen[h] == 1) {
	    printf("%lu Duplicate!!!\n", h);
	  } else {
	    seen[h] = 1;
	  }
	}

	for (int k = 0; k < keys.size(); k++) {
	  if(seen[k] == 0) {
	    printf("%u Missing!!!\n", k);
	  }
	}
	
	return 0;
}
@vigna
Copy link
Owner

vigna commented Nov 6, 2019

You're right. We have working tests, so my guess is that serializing such a small map triggers some bug in the serialization process: if you build the map in the code, it works, and if you try, with, like, 200 strings it works, too. The problem is with serializing a very small number of keys. I suspect some off-by-one.

Thanks for the bug report. The workaround for the time being is just using more keys :).

@vigna
Copy link
Owner

vigna commented Nov 6, 2019

Sorry—it's bullshit. Much easier. The RecSplit constructor which takes a file pointer (and which the dump tool uses) was implemented erroneously using the C getline(), which leaves the delimiter (e.g., \n) in the string. You are correctly reading the strings without the ending newline—hence the problem.

I'll fix this ASAP. In all our experiments we use dump128 and load128, so nobody every noticed this.

@vigna vigna closed this as completed Nov 6, 2019
@vigna
Copy link
Owner

vigna commented Nov 6, 2019

Fixed in 0dc7222.

@weaversa
Copy link
Author

weaversa commented Nov 7, 2019

Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants