Skip to content

rsusik/boosting-exact-pattern-matching

Repository files navigation

Boosting exact pattern matching with eXtreme Gradient Boosting (and more)

About

This repository contains datasets and source code for the paper "Boosting exact pattern matching with eXtreme Gradient Boosting (and more)".

Citation

@article{SG2025,
  author = {Susik, Robert and Grabowski, Szymon},
  year = {2025},
  month = {March},
  day = {29},
  title = {Boosting exact pattern matching with extreme gradient boosting (and more)},
  journal = {The Journal of Supercomputing},
  volume = {81},
  number = {5},
  pages = {673},
  issn = {1573-0484},
  doi = {10.1007/s11227-025-07165-2},
  url = {https://doi.org/10.1007/s11227-025-07165-2}
}

Dataset for ML

For those who want to perform the research on our dataset, we provide it in convenient json format in dataset_ml_json folder.

To reproduce our results, please follow the instructions below.

Contents

The experiment consists of two parts.

  1. The first part consists of measuring the times of exact pattern matching algorithms to create a dataset for machine learning.

    • 📁 smart - contains the source code for exact pattern matching, which is mostly taken from SMART tool (Nov 7 2018, 3b112e362b6ff9e3336b406cc8618164da91ef23).
    • 📁 data - contains datasets and patterns excluding datasets that must be downloaded from Pizza&Chilli corpus (the tests were performed on 50MB texts).
    • 📄 run_tests.py - executes exact pattern matching algorithms and saves the results to dataset-raw.pickle
    • 📄 utils.py - contains helper functions
    • 📄 dataset-raw.pickle - contains raw datasets with pattern searching times for all algorithms
  2. The second part is to prepare the dataset for machine learning and train models to predict the fastest algorithm for a given pattern.

    • 📄 start_pred.py - runs the tests (trainings and predictions)
    • 📄 datautils.py - contains helper functions
    • 📄 dataset-full.pickle - processed dataset, prepared for machine learning
    • 📔 dataset-full-notebook.ipynb - jupyter notebook with results of exact pattern matching algorithms
    • 📄 train_datasets.pickle - train dataset
    • 📄 test_datasets.pickle - test dataset
    • 📄 trained_models.pickle - contains trained models (make sure to have the same version of scikit-learn to load them, otherwise it will not be possible)
    • 📔 results-notebook.ipynb - jupyter notebook with results of machine learning models

Note: *.pickle files are not included in the repository. They can be downloaded from:

Note 2: The text corpus is also not included in the repository. It can be downloaded from Pizza&Chilli (the tests were performed on 50MB texts) and then placed in the data directory.

Requirements

  • Python 3.10.10 (other versions may work, but they were not tested)
  • All required packages are listed in requirements.txt file.

To reproduce the results

  1. Clone the repository

    git clone https://github.com/rsusik/boosting-exact-pattern-matching.git
  2. Install required packages by running

    pip install -r requirements.txt
  3. Download the *.pickle files from the links above and place them in the root directory of the repository.

  4. Run start_pred.py script.

The results may differ in prediction times slightly due to the hardware differences, but the impact on the overall results should be negligible.

To create the dataset from scratch

Perform step 1 and 2 from the previous section and then:

  1. Download the text corpus from Pizza&Chilli (the tests were performed on 50MB texts) and place it in the data directory.

  2. Execute searching algorithms to create the dataset:

    cd smart
    python compile.py
    cd ..
    python run_tests.py

Running the searching algorithms may take a long time and lead to slightly different results due to the hardware differences.

List of algorithms used in the experiment

# Algorithm Source
1 ac source
2 ag source
3 akc source
4 askip source
5 aut source
6 bf source
7 bfs source
8 blim source
9 bm source
10 bmh-sbndm source
11 bndm source
12 bndml source
13 bom source
14 br source
15 bsdm source
16 bww source
17 bxs source
18 dbww source
19 ebom source
20 epsm source
21 fbom source
22 fdm source
23 ffs source
24 fjs source
25 fndm source
26 fs source
27 fsbndm source
28 graspm source
29 hor source
30 iom source
31 jom source
32 kbndm source
33 kmp source
34 kmpskip source
35 kr source
36 ksa source
37 lbndm source
38 ldm source
39 mp source
40 ms source
41 nsn source
42 om source
43 pbmh source
44 qlqs source
45 qs source
46 raita source
47 rcolussi source
48 rf source
49 sa source
50 sabp source
51 sbndm-bmh source
52 sbndm source
53 sebom source
54 sfbom source
55 simon source
56 skip source
57 smith source
58 smoa source
59 so source
60 ssabs source
61 ssef source
62 ssm source
63 tbm source
64 tndm source
65 tndma source
66 trf source
67 ts source
68 tsa source
69 tsw source
70 tunedbm source
71 tvsbs source
72 tw source
73 twfr source
74 wc source
75 wfr source
76 wom source
77 ww source
78 zt source

About

Boosting exact pattern matching with eXtreme Gradient Boosting (and more)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published