Skip to content

Commit a433661

Browse files
committed
update to the paper for JOSS
1 parent efd7847 commit a433661

File tree

5 files changed

+89
-65
lines changed

5 files changed

+89
-65
lines changed

.github/workflows/compile-paper.yml

Lines changed: 0 additions & 53 deletions
This file was deleted.

notes.txt

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,3 +13,9 @@ generate_corpus_CFG(complexity=20, num_sentences=100, enable_polysemy=false, bas
1313
No Changes to `~/Documents/repos/BenchmarkDataNLP.jl/Project.toml`
1414
No Changes to `~/Documents/repos/BenchmarkDataNLP.jl/Manifest.toml`
1515
.../repos/BenchmarkDataNLP.jl$ julia --project=. docs/make.jl
16+
17+
18+
# for the paper
19+
pandoc paper.md --pdf-engine=xelatex -o paper.pdf
20+
sudo apt-get install pandoc-citeproc
21+
pandoc paper.md --filter pandoc-citeproc --pdf-engine=xelatex -o paper.pdf

paper/paper.bib

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
2+
@inproceedings{copestake1996applying,
3+
title={Applying natural language processing techniques to speech prostheses},
4+
author={Copestake, Ann},
5+
booktitle={Working Notes of the 1996 AAAI Fall Symposium on Developing Assistive Technology for People with Disabilities},
6+
year={1996}
7+
}
8+
9+
10+
11+
@article{maletti2017survey,
12+
title={Survey: Finite-state technology in natural language processing},
13+
author={Maletti, Andreas},
14+
journal={Theoretical Computer Science},
15+
volume={679},
16+
pages={2--17},
17+
year={2017},
18+
publisher={Elsevier}
19+
}
20+
21+
22+
23+
@article{faye2012survey,
24+
title={A survey of RDF storage approaches},
25+
author={Faye, David C and Cure, Olivier and Blin, Guillaume},
26+
journal={Revue Africaine de Recherche en Informatique et Math{\'e}matiques Appliqu{\'e}es},
27+
volume={15},
28+
year={2012},
29+
publisher={Episciences. org}
30+
}
31+
32+
33+
@article{van1996generalized,
34+
title={Generalized context-free grammars},
35+
author={van Vugt, Nik{\`e}},
36+
journal={Internal Report},
37+
year={1996},
38+
publisher={Citeseer}
39+
}
40+
41+
@inproceedings{samsi2023words,
42+
title={From words to watts: Benchmarking the energy costs of large language model inference},
43+
author={Samsi, Siddharth and Zhao, Dan and McDonald, Joseph and Li, Baolin and Michaleas, Adam and Jones, Michael and Bergeron, William and Kepner, Jeremy and Tiwari, Devesh and Gadepally, Vijay},
44+
booktitle={2023 IEEE High Performance Extreme Computing Conference (HPEC)},
45+
pages={1--9},
46+
year={2023},
47+
organization={IEEE}
48+
}

paper/paper.md

Lines changed: 35 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,28 +1,51 @@
11
---
2-
title: 'Your Software Title'
2+
title: "BenchmarkDataNLP.jl: Synthetic Data Generation for NLP Benchmarking"
33
tags:
4-
- tag1
5-
- tag2
6-
- tag3
4+
- Julia
5+
- NLP
6+
- benchmarking
7+
- data generation
8+
- language models
79
authors:
8-
- name: Your Name
9-
orcid: Your ORCID
10+
- name: "Alexander V. Mantzaris"
11+
orcid: 0000-0002-0026-5725
1012
affiliation: 1
1113
affiliations:
12-
- name: Your Institution
14+
- name: "Department of Statistics and Data Science, University of Central Florida (UCF), USA"
1315
index: 1
14-
date: YYYY-MM-DD
16+
date: 25 January 2025
1517
bibliography: paper.bib
1618
---
1719

1820
# Summary
1921

20-
A brief summary of what your software does and its high-level functionality.
22+
**BenchmarkDataNLP.jl** is a package written in Julia Lang for generating synthetic text corpora that can be used to systematically benchmark and evaluate Natural Language Processing (NLP) models such as RNNs, LSTMs, and Large Language Models (LLMs). By enabling users to control core linguistic parameters such as the alphabet size, vocabulary size, grammatical expansion complexity, and semantic structures this library can help users test (as well as evaluate or debug) NLP models. Instead of exposing many of the key internal parameter choices to the user a minimal number are so that users do not have to focus on how to correctly configure the generation process. The key value for the user is the *complexity* integer value which controls the size of the alphabet, vocabulary and grammar expansions. This allows a range starting from 1 so that very simple text corpora can be generated. For example at complexity=1 there are 5 letters in the alphabet and 10 words used in the vocabulary with 2 grammar roles when a Context Free Grammar generator is select. Users can choose how many independent productions are desired which are each supplied as entries in a .jsonl file. The defaults provided in the documentation should suffice for most use cases.
2123

22-
# Statement of Need
24+
The generator methodology approaches are:
2325

24-
Explain the research purpose of the software and its context in related work.
26+
- Context Free Grammar [@van1996generalized]
27+
- Resource Description Framework (RDF, triple store) [@faye2012survey]
28+
- Finite State Machine [@maletti2017survey]
29+
- Template Strings [@copestake1996applying]
30+
31+
Each method offers different options to the user. With the Finite State Machine approach, the function *generate_fsm_corpus* allows for users to produce a deterministic set of productions so that upon successful training accuracies of 100% can be achieved. An application of this is that different models can be tested on the amount of computational requirements they have for a particular value of complexity. The 4 approaches cover a wide range of text expansion production methods that does not have parameter estimations as part of their usage requirements. The letters of the alphabet begin in the *HANGUL* start (44032, 0xAC00) and does not apply repetition of labels with integer identifiers at the end which are only convenient for implementations but may not be ideal for some tokenization approaches.
32+
33+
This package also hopes to bring down the cost required when exploring different models and architectures that would require large amounts of energy, time and finances to train on large natural language datasets [@samsi2023words]. Risks of this nature can often be expected when very novel approaches are explored.
34+
35+
# Statement of need
36+
37+
Performance evaluation of NLP systems often hinges on realistic datasets that capture the target domain’s linguistic nuances. However, training on large-scale, text corpora can be expensive or impractical, especially when testing very particular aspects of language complexity or model robustness. Synthetic data generation helps bridge these gaps by:
38+
39+
1. **Reproducibility**: Controlled parameters (e.g., sentence length, grammar depth, or concept re-use) allow reproducible experiments.
40+
2. **Customization**: Researchers can stress-test models by systematically varying language properties, such as the number of roles in a grammar or the frequency of filler tokens.
41+
3. **Scalability**: Large-scale data can be generated for benchmarking advanced architectures without the need for extensive, real-world data collection.
42+
4. **Targeted Evaluation**: By manipulating semantic structures (for example, adding context continuity with RDF triples or specialized placeholders), researchers can investigate whether models capture specific linguistic or contextual features.
43+
44+
Although several libraries and benchmarks (e.g., GLUE, SuperGLUE, and others) provide curated datasets, **BenchmarkDataNLP.jl** offers a unique approach by allowing *fine-grained control* of the underlying complexity of a synthetic corpus generation process. This capability is especially valuable when exploring model failure modes or for rapid prototyping of new model architectures that require specialized text patterns. It is hoped that this will bring down the cost for initial prototyping of new model architectures and allow a greater exploration. This can also help compare different modeling approaches.
45+
46+
# Acknowledgements
47+
48+
We thank the Julia community for their continued support of open-source scientific computing.
2549

2650
# References
2751

28-
Add your references here.

paper/paper.pdf

37.3 KB
Binary file not shown.

0 commit comments

Comments
 (0)