|
1 | 1 | --- |
2 | | -title: 'Your Software Title' |
| 2 | +title: "BenchmarkDataNLP.jl: Synthetic Data Generation for NLP Benchmarking" |
3 | 3 | tags: |
4 | | - - tag1 |
5 | | - - tag2 |
6 | | - - tag3 |
| 4 | + - Julia |
| 5 | + - NLP |
| 6 | + - benchmarking |
| 7 | + - data generation |
| 8 | + - language models |
7 | 9 | authors: |
8 | | - - name: Your Name |
9 | | - orcid: Your ORCID |
| 10 | + - name: "Alexander V. Mantzaris" |
| 11 | + orcid: 0000-0002-0026-5725 |
10 | 12 | affiliation: 1 |
11 | 13 | affiliations: |
12 | | - - name: Your Institution |
| 14 | + - name: "Department of Statistics and Data Science, University of Central Florida (UCF), USA" |
13 | 15 | index: 1 |
14 | | -date: YYYY-MM-DD |
| 16 | +date: 25 January 2025 |
15 | 17 | bibliography: paper.bib |
16 | 18 | --- |
17 | 19 |
|
18 | 20 | # Summary |
19 | 21 |
|
20 | | -A brief summary of what your software does and its high-level functionality. |
| 22 | +**BenchmarkDataNLP.jl** is a package written in Julia Lang for generating synthetic text corpora that can be used to systematically benchmark and evaluate Natural Language Processing (NLP) models such as RNNs, LSTMs, and Large Language Models (LLMs). By enabling users to control core linguistic parameters such as the alphabet size, vocabulary size, grammatical expansion complexity, and semantic structures this library can help users test (as well as evaluate or debug) NLP models. Instead of exposing many of the key internal parameter choices to the user a minimal number are so that users do not have to focus on how to correctly configure the generation process. The key value for the user is the *complexity* integer value which controls the size of the alphabet, vocabulary and grammar expansions. This allows a range starting from 1 so that very simple text corpora can be generated. For example at complexity=1 there are 5 letters in the alphabet and 10 words used in the vocabulary with 2 grammar roles when a Context Free Grammar generator is select. Users can choose how many independent productions are desired which are each supplied as entries in a .jsonl file. The defaults provided in the documentation should suffice for most use cases. |
21 | 23 |
|
22 | | -# Statement of Need |
| 24 | +The generator methodology approaches are: |
23 | 25 |
|
24 | | -Explain the research purpose of the software and its context in related work. |
| 26 | + - Context Free Grammar [@van1996generalized] |
| 27 | + - Resource Description Framework (RDF, triple store) [@faye2012survey] |
| 28 | + - Finite State Machine [@maletti2017survey] |
| 29 | + - Template Strings [@copestake1996applying] |
| 30 | + |
| 31 | +Each method offers different options to the user. With the Finite State Machine approach, the function *generate_fsm_corpus* allows for users to produce a deterministic set of productions so that upon successful training accuracies of 100% can be achieved. An application of this is that different models can be tested on the amount of computational requirements they have for a particular value of complexity. The 4 approaches cover a wide range of text expansion production methods that does not have parameter estimations as part of their usage requirements. The letters of the alphabet begin in the *HANGUL* start (44032, 0xAC00) and does not apply repetition of labels with integer identifiers at the end which are only convenient for implementations but may not be ideal for some tokenization approaches. |
| 32 | + |
| 33 | +This package also hopes to bring down the cost required when exploring different models and architectures that would require large amounts of energy, time and finances to train on large natural language datasets [@samsi2023words]. Risks of this nature can often be expected when very novel approaches are explored. |
| 34 | + |
| 35 | +# Statement of need |
| 36 | + |
| 37 | +Performance evaluation of NLP systems often hinges on realistic datasets that capture the target domain’s linguistic nuances. However, training on large-scale, text corpora can be expensive or impractical, especially when testing very particular aspects of language complexity or model robustness. Synthetic data generation helps bridge these gaps by: |
| 38 | + |
| 39 | +1. **Reproducibility**: Controlled parameters (e.g., sentence length, grammar depth, or concept re-use) allow reproducible experiments. |
| 40 | +2. **Customization**: Researchers can stress-test models by systematically varying language properties, such as the number of roles in a grammar or the frequency of filler tokens. |
| 41 | +3. **Scalability**: Large-scale data can be generated for benchmarking advanced architectures without the need for extensive, real-world data collection. |
| 42 | +4. **Targeted Evaluation**: By manipulating semantic structures (for example, adding context continuity with RDF triples or specialized placeholders), researchers can investigate whether models capture specific linguistic or contextual features. |
| 43 | + |
| 44 | +Although several libraries and benchmarks (e.g., GLUE, SuperGLUE, and others) provide curated datasets, **BenchmarkDataNLP.jl** offers a unique approach by allowing *fine-grained control* of the underlying complexity of a synthetic corpus generation process. This capability is especially valuable when exploring model failure modes or for rapid prototyping of new model architectures that require specialized text patterns. It is hoped that this will bring down the cost for initial prototyping of new model architectures and allow a greater exploration. This can also help compare different modeling approaches. |
| 45 | + |
| 46 | +# Acknowledgements |
| 47 | + |
| 48 | +We thank the Julia community for their continued support of open-source scientific computing. |
25 | 49 |
|
26 | 50 | # References |
27 | 51 |
|
28 | | -Add your references here. |
|
0 commit comments