This repository contains datasets and source code for the paper "Boosting exact pattern matching with eXtreme Gradient Boosting (and more)".
@article{SG2025,
author = {Susik, Robert and Grabowski, Szymon},
year = {2025},
month = {March},
day = {29},
title = {Boosting exact pattern matching with extreme gradient boosting (and more)},
journal = {The Journal of Supercomputing},
volume = {81},
number = {5},
pages = {673},
issn = {1573-0484},
doi = {10.1007/s11227-025-07165-2},
url = {https://doi.org/10.1007/s11227-025-07165-2}
}
For those who want to perform the research on our dataset, we provide it in convenient json format in dataset_ml_json folder.
To reproduce our results, please follow the instructions below.
The experiment consists of two parts.
-
The first part consists of measuring the times of exact pattern matching algorithms to create a dataset for machine learning.
- 📁
smart
- contains the source code for exact pattern matching, which is mostly taken from SMART tool (Nov 7 2018, 3b112e362b6ff9e3336b406cc8618164da91ef23). - 📁
data
- contains datasets and patterns excluding datasets that must be downloaded from Pizza&Chilli corpus (the tests were performed on 50MB texts). - 📄
run_tests.py
- executes exact pattern matching algorithms and saves the results todataset-raw.pickle
- 📄
utils.py
- contains helper functions - 📄
dataset-raw.pickle
- contains raw datasets with pattern searching times for all algorithms
- 📁
-
The second part is to prepare the dataset for machine learning and train models to predict the fastest algorithm for a given pattern.
- 📄
start_pred.py
- runs the tests (trainings and predictions) - 📄
datautils.py
- contains helper functions - 📄
dataset-full.pickle
- processed dataset, prepared for machine learning - 📔
dataset-full-notebook.ipynb
- jupyter notebook with results of exact pattern matching algorithms - 📄
train_datasets.pickle
- train dataset - 📄
test_datasets.pickle
- test dataset - 📄
trained_models.pickle
- contains trained models (make sure to have the same version of scikit-learn to load them, otherwise it will not be possible) - 📔
results-notebook.ipynb
- jupyter notebook with results of machine learning models
- 📄
Note: *.pickle
files are not included in the repository. They can be downloaded from:
Note 2: The text corpus is also not included in the repository. It can be downloaded from Pizza&Chilli (the tests were performed on 50MB texts) and then placed in the data
directory.
- Python 3.10.10 (other versions may work, but they were not tested)
- All required packages are listed in
requirements.txt
file.
-
Clone the repository
git clone https://github.com/rsusik/boosting-exact-pattern-matching.git
-
Install required packages by running
pip install -r requirements.txt
-
Download the
*.pickle
files from the links above and place them in the root directory of the repository. -
Run
start_pred.py
script.
The results may differ in prediction times slightly due to the hardware differences, but the impact on the overall results should be negligible.
Perform step 1 and 2 from the previous section and then:
-
Download the text corpus from Pizza&Chilli (the tests were performed on 50MB texts) and place it in the
data
directory. -
Execute searching algorithms to create the dataset:
cd smart python compile.py cd .. python run_tests.py
Running the searching algorithms may take a long time and lead to slightly different results due to the hardware differences.
# | Algorithm | Source |
---|---|---|
1 | ac | source |
2 | ag | source |
3 | akc | source |
4 | askip | source |
5 | aut | source |
6 | bf | source |
7 | bfs | source |
8 | blim | source |
9 | bm | source |
10 | bmh-sbndm | source |
11 | bndm | source |
12 | bndml | source |
13 | bom | source |
14 | br | source |
15 | bsdm | source |
16 | bww | source |
17 | bxs | source |
18 | dbww | source |
19 | ebom | source |
20 | epsm | source |
21 | fbom | source |
22 | fdm | source |
23 | ffs | source |
24 | fjs | source |
25 | fndm | source |
26 | fs | source |
27 | fsbndm | source |
28 | graspm | source |
29 | hor | source |
30 | iom | source |
31 | jom | source |
32 | kbndm | source |
33 | kmp | source |
34 | kmpskip | source |
35 | kr | source |
36 | ksa | source |
37 | lbndm | source |
38 | ldm | source |
39 | mp | source |
40 | ms | source |
41 | nsn | source |
42 | om | source |
43 | pbmh | source |
44 | qlqs | source |
45 | qs | source |
46 | raita | source |
47 | rcolussi | source |
48 | rf | source |
49 | sa | source |
50 | sabp | source |
51 | sbndm-bmh | source |
52 | sbndm | source |
53 | sebom | source |
54 | sfbom | source |
55 | simon | source |
56 | skip | source |
57 | smith | source |
58 | smoa | source |
59 | so | source |
60 | ssabs | source |
61 | ssef | source |
62 | ssm | source |
63 | tbm | source |
64 | tndm | source |
65 | tndma | source |
66 | trf | source |
67 | ts | source |
68 | tsa | source |
69 | tsw | source |
70 | tunedbm | source |
71 | tvsbs | source |
72 | tw | source |
73 | twfr | source |
74 | wc | source |
75 | wfr | source |
76 | wom | source |
77 | ww | source |
78 | zt | source |