This repo hosts source code to train locality-sensitive bucketing (LSB) functions.
A bucketing function
Here we develop a machine-learning framework to automatically learn
-
Environment: python vision >= 3.6
-
Data simulation. Codes in
/simulationcan generate a set of random pairs of length-n strings$(s,t)$ with various edit distances as needed. Given$d_1, d_2$ , training samples consist of tuples${(s,t,y)}$ ,$y = -1$ if$edit(s,t) \le d_1$ and$y = 1$ if$edit(s,t) \ge d_2$ . -
Model training. Codes for
$n = 20$ and$n=100$ are put in separate folders.siacnn_models_gpu.pyis a function library (including losses, evaluations, model structures and generating hash code) awaiting import. Thesiaincp_runner.pyis a trainer for Siamese Neural Network. Parameters are easily modified in the files following the annotations. To train a model, use command:python siaincp_runner.py -
Testing and hashcode generating. tester.py is a quick example of testing data
seq-n20-ED15-2.txtfor the pretained models stored intrained modelsand generating the hash code with the command; hash codes will be stored in a file namedhashcode_20k_40m_(d1,d2)s.hdf5.python tester.py -
Pre-trained models. More pre-trained models are available at zenodo.