Boosting exact pattern matching with eXtreme Gradient Boosting (and more)

About

This repository contains datasets and source code for the paper "Boosting exact pattern matching with eXtreme Gradient Boosting (and more)".

Citation

@article{SG2025,
  author = {Susik, Robert and Grabowski, Szymon},
  year = {2025},
  month = {March},
  day = {29},
  title = {Boosting exact pattern matching with extreme gradient boosting (and more)},
  journal = {The Journal of Supercomputing},
  volume = {81},
  number = {5},
  pages = {673},
  issn = {1573-0484},
  doi = {10.1007/s11227-025-07165-2},
  url = {https://doi.org/10.1007/s11227-025-07165-2}
}

Dataset for ML

For those who want to perform the research on our dataset, we provide it in convenient json format in dataset_ml_json folder.

To reproduce our results, please follow the instructions below.

The first part consists of measuring the times of exact pattern matching algorithms to create a dataset for machine learning.
- 📁 smart - contains the source code for exact pattern matching, which is mostly taken from SMART tool (Nov 7 2018, 3b112e362b6ff9e3336b406cc8618164da91ef23).
- 📁 data - contains datasets and patterns excluding datasets that must be downloaded from Pizza&Chilli corpus (the tests were performed on 50MB texts).
- 📄 run_tests.py - executes exact pattern matching algorithms and saves the results to dataset-raw.pickle
- 📄 utils.py - contains helper functions
- 📄 dataset-raw.pickle - contains raw datasets with pattern searching times for all algorithms
The second part is to prepare the dataset for machine learning and train models to predict the fastest algorithm for a given pattern.
- 📄 start_pred.py - runs the tests (trainings and predictions)
- 📄 datautils.py - contains helper functions
- 📄 dataset-full.pickle - processed dataset, prepared for machine learning
- 📔 dataset-full-notebook.ipynb - jupyter notebook with results of exact pattern matching algorithms
- 📄 train_datasets.pickle - train dataset
- 📄 test_datasets.pickle - test dataset
- 📄 trained_models.pickle - contains trained models (make sure to have the same version of scikit-learn to load them, otherwise it will not be possible)
- 📔 results-notebook.ipynb - jupyter notebook with results of machine learning models

Note: *.pickle files are not included in the repository. They can be downloaded from:

Note 2: The text corpus is also not included in the repository. It can be downloaded from Pizza&Chilli (the tests were performed on 50MB texts) and then placed in the data directory.

Requirements

Python 3.10.10 (other versions may work, but they were not tested)
All required packages are listed in requirements.txt file.

To reproduce the results

Clone the repository

git clone https://github.com/rsusik/boosting-exact-pattern-matching.git

Install required packages by running
```
pip install -r requirements.txt
```
Download the *.pickle files from the links above and place them in the root directory of the repository.
Run start_pred.py script.

The results may differ in prediction times slightly due to the hardware differences, but the impact on the overall results should be negligible.

To create the dataset from scratch

Perform step 1 and 2 from the previous section and then:

Download the text corpus from Pizza&Chilli (the tests were performed on 50MB texts) and place it in the data directory.

Execute searching algorithms to create the dataset:

cd smart
python compile.py
cd ..
python run_tests.py

Running the searching algorithms may take a long time and lead to slightly different results due to the hardware differences.

List of algorithms used in the experiment

#	Algorithm	Source
1	ac	source
2	ag	source
3	akc	source
4	askip	source
5	aut	source
6	bf	source
7	bfs	source
8	blim	source
9	bm	source
10	bmh-sbndm	source
11	bndm	source
12	bndml	source
13	bom	source
14	br	source
15	bsdm	source
16	bww	source
17	bxs	source
18	dbww	source
19	ebom	source
20	epsm	source
21	fbom	source
22	fdm	source
23	ffs	source
24	fjs	source
25	fndm	source
26	fs	source
27	fsbndm	source
28	graspm	source
29	hor	source
30	iom	source
31	jom	source
32	kbndm	source
33	kmp	source
34	kmpskip	source
35	kr	source
36	ksa	source
37	lbndm	source
38	ldm	source
39	mp	source
40	ms	source
41	nsn	source
42	om	source
43	pbmh	source
44	qlqs	source
45	qs	source
46	raita	source
47	rcolussi	source
48	rf	source
49	sa	source
50	sabp	source
51	sbndm-bmh	source
52	sbndm	source
53	sebom	source
54	sfbom	source
55	simon	source
56	skip	source
57	smith	source
58	smoa	source
59	so	source
60	ssabs	source
61	ssef	source
62	ssm	source
63	tbm	source
64	tndm	source
65	tndma	source
66	trf	source
67	ts	source
68	tsa	source
69	tsw	source
70	tunedbm	source
71	tvsbs	source
72	tw	source
73	twfr	source
74	wc	source
75	wfr	source
76	wom	source
77	ww	source
78	zt	source

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Boosting exact pattern matching with eXtreme Gradient Boosting (and more)

About

Citation

Dataset for ML

Contents

Requirements

To reproduce the results

To create the dataset from scratch

List of algorithms used in the experiment

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
data		data
dataset_ml_json		dataset_ml_json
smart		smart
.gitignore		.gitignore
README.md		README.md
dataset-full-notebook.ipynb		dataset-full-notebook.ipynb
datautils.py		datautils.py
prepare_dataset.py		prepare_dataset.py
requirements.txt		requirements.txt
results-notebook.ipynb		results-notebook.ipynb
run_tests.py		run_tests.py
start_pred.py		start_pred.py
utils.py		utils.py

rsusik/boosting-exact-pattern-matching

Folders and files

Latest commit

History

Repository files navigation

Boosting exact pattern matching with eXtreme Gradient Boosting (and more)

About

Citation

Dataset for ML

Contents

Requirements

To reproduce the results

To create the dataset from scratch

List of algorithms used in the experiment

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages