Benchmark various LLM Structured Output frameworks: Instructor, Mirascope, Langchain, LlamaIndex, Fructose, Marvin, Outlines, LMFormatEnforcer, etc on tasks like multi-label classification, named entity recognition, synthetic data generation, etc.
- Multi-label classification
Framework Model Reliability Latency p95 (s) Fructose gpt-4o-mini-2024-07-18 1.000 1.138 Modelsmith gpt-4o-mini-2024-07-18 1.000 1.184 OpenAI Structured Output gpt-4o-mini-2024-07-18 1.000 1.201 Instructor gpt-4o-mini-2024-07-18 1.000 1.206 Outlines unsloth/llama-3-8b-Instruct-bnb-4bit 1.000 1.804* LMFormatEnforcer unsloth/llama-3-8b-Instruct-bnb-4bit 1.000 3.649* Llamaindex gpt-4o-mini-2024-07-18 0.996 0.853 Marvin gpt-4o-mini-2024-07-18 0.988 1.338 Mirascope gpt-4o-mini-2024-07-18 0.985 1.531 - Named Entity Recognition
Framework Model Reliability Latency p95 (s) Precision Recall F1 Score OpenAI Structured Output gpt-4o-mini-2024-07-18 1.000 3.459 0.834 0.748 0.789 LMFormatEnforcer unsloth/llama-3-8b-Instruct-bnb-4bit 1.000 6.573* 0.701 0.262 0.382 Instructor gpt-4o-mini-2024-07-18 0.998 2.438 0.776 0.768 0.772 Mirascope gpt-4o-mini-2024-07-18 0.989 3.879 0.768 0.738 0.752 Llamaindex gpt-4o-mini-2024-07-18 0.979 5.771 0.792 0.310 0.446 Marvin gpt-4o-mini-2024-07-18 0.979 3.270 0.822 0.776 0.798 - Synthetic Data Generation
Framework Model Reliability Latency p95 (s) Variety Instructor gpt-4o-mini-2024-07-18 1.000 1.923 0.750 Marvin gpt-4o-mini-2024-07-18 1.000 1.496 0.010 Llamaindex gpt-4o-mini-2024-07-18 1.000 1.003 0.020 Modelsmith gpt-4o-mini-2024-07-18 0.970 2.324 0.835 Mirascope gpt-4o-mini-2024-07-18 0.790 3.383 0.886 Outlines unsloth/llama-3-8b-Instruct-bnb-4bit 0.690 2.354* 0.942 OpenAI Structured Output gpt-4o-mini-2024-07-18 0.650 1.431 0.877 LMFormatEnforcer unsloth/llama-3-8b-Instruct-bnb-4bit 0.650 2.561* 0.662
* NVIDIA GeForce RTX 4080 Super GPU
- Install the requirements using
pip install -r requirements.txt
- Set the OpenAI api key:
export OPENAI_API_KEY=sk-...
- Run the benchmark using
python -m main run-benchmark
- Raw results are stored in the
results
directory. - Generate the results using:
- Multilabel classification:
python -m main generate-results
- NER:
python -m main generate-results --task ner
- Synthetic data generation:
python -m main generate-results --task synthetic_data_generation
- Multilabel classification:
- To get help on the command line arguments, add
--help
after the command. Eg.,python -m main run-benchmark --help
- Multi-label classification:
- Task: Given a text, predict the labels associated with it.
- Data:
- Base data: Alexa intent detection dataset
- Benchmarking test is run using synthetic data generated by running:
python -m data_sources.generate_dataset generate-multilabel-data
. - The synthetic data is generated by sampling and combining rows from the base data to achieve multiple classes per row according to some distribution for num classes per row. See
python -m data_sources.generate_dataset generate-multilabel-data --help
for more details.
- Prompt:
"Classify the following text: {text}"
- Evaluation Metrics:
- Reliability: The percentage of times the framework returns valid labels without errors. The average of all the rows
percent_successful
values. - Latency: The 95th percentile of the time taken to run the framework on the data.
- Reliability: The percentage of times the framework returns valid labels without errors. The average of all the rows
- Experiment Details: Run each row through the framework
n_runs
number of times and log the percent of successful runs for each row.
- Named Entity Recognition
- Task: Given a text, extract the entities present in it.
- Data:
- Base data: Synthetic PII Finance dataset
- Benchmarking test is run using a sampled data generated by running:
python -m data_sources.generate_dataset generate-ner-data
. - The data is sampled from the base data to achieve number of entities per row according to some distribution. See
python -m data_sources.generate_dataset generate-ner-data --help
for more details.
- Prompt:
Extract and resolve a list of entities from the following text: {text}
- Evaluation Metrics:
- Reliability: The percentage of times the framework returns valid labels without errors. The average of all the rows
percent_successful
values. - Latency: The 95th percentile of the time taken to run the framework on the data.
- Precision: The micro average of the precision of the framework on the data.
- Recall: The micro average of the recall of the framework on the data.
- F1 Score: The micro average of the F1 score of the framework on the data.
- Reliability: The percentage of times the framework returns valid labels without errors. The average of all the rows
- Experiment Details: Run each row through the framework
n_runs
number of times and log the percent of successful runs for each row.
- Synthetic Data Generation
- Task: Generate synthetic data similar according to a Pydantic data model schema.
- Data:
- Two level nested User details Pydantic schema.
- Prompt:
Generate a random person's information. The name must be chosen at random. Make it something you wouldn't normally choose.
- Evaluation Metrics:
- Reliability: The percentage of times the framework returns valid labels without errors. The average of all the rows
percent_successful
values. - Latency: The 95th percentile of the time taken to run the framework on the data.
- Variety: The percent of names that are unique compared to all names generated.
- Reliability: The percentage of times the framework returns valid labels without errors. The average of all the rows
- Experiment Details: Run each row through the framework
n_runs
number of times and log the percent of successful runs.
- Create a new pandas dataframe pickle file with the following columns:
text
: The text to be sent to the frameworklabels
: List of labels associated with the text- See
data/multilabel_classification.pkl
for an example.
- Add the path to the new pickle file in the
./config.yaml
file under thesource_data_pickle_path
key for all the frameworks you want to test. - Run the benchmark using
python -m main run-benchmark
to test the new data on all the frameworks! - Generate the results using
python -m main generate-results
The easiest way to create a new framework is to reference the ./frameworks/instructor_framework.py
file. Detailed steps are as follows:
- Create a .py file in frameworks directory with the name of the framework. Eg.,
instructor_framework.py
for the instructor framework. - In this .py file create a class that inherits
BaseFramework
fromframeworks.base
. - The class should define an
init
method that initializes the base class. Here are the arguments the base class expects:-
task
(str): the task that the framework is being tested on. Obtained from./config.yaml
file. Allowed values are"multilabel_classification"
and"ner"
-
prompt
(str): Prompt template used. Obtained from theinit_kwargs
in the./config.yaml
file. -
llm_model
(str): LLM model to be used. Obtained from theinit_kwargs
in the./config.yaml
file. -
llm_model_family
(str): LLM model family to be used. Current supported values as"openai"
and"transformers"
. Obtained from theinit_kwargs
in the./config.yaml
file. -
retries
(int): Number of retries for the framework. Default is$0$ . Obtained from theinit_kwargs
in the./config.yaml
file. -
source_data_picke_path
(str): Path to the source data pickle file. Obtained from theinit_kwargs
in the./config.yaml
file. -
sample_rows
(int): Number of rows to sample from the source data. Useful for testing on a smaller subset of data. Default is$0$ which uses all rows in source_data_pickle_path for the benchmarking. Obtained from theinit_kwargs
in the./config.yaml
file. -
response_model
(Any): The response model to be used. Internally passed by the benchmarking script.
-
- The class should define a
run
method that takes three arguments:-
task
: The task that the framework is being tested on. Obtained from thetask
in the./config.yaml
file. Eg.,"multilabel_classification"
-
n_runs
: number of times to repeat each text -
expected_response
: Output expected from the framework. Use default value ofNone
-
inputs
: a dictionary of{"text": str}
wherestr
is the text to be sent to the framework. Use default value of empty dictionary{}
-
- This
run
method should create anotherrun_experiment
function that takesinputs
as argument, runs that input through the framework and returns the output. - The
run_experiment
function should be annotated with the@experiment
decorator fromframeworks.base
withn_runs
,expected_resposne
andtask
as arguments. - The
run
method should call therun_experiment
function and return the four outputspredictions
,percent_successful
,metrics
andlatencies
. - Import this new class in
frameworks/__init__.py
. - Add a new entry in the
./config.yaml
file with the name of the class as the key. The yaml entry can have the following fields-
task
: the task that the framework is being tested on. Obtained from./config.yaml
file. Allowed values are"multilabel_classification"
and"ner"
-
n_runs
: number of times to repeat each text -
init_kwargs
: all the arguments that need to be passed to theinit
method of the class, including those mentioned in step 3 above.
-
- Framework related tasks:
Framework Multi-label classification Named Entity Recognition Synthetic Data Generation OpenAI Structured Output β OpenAI β OpenAI β OpenAI Instructor β OpenAI β OpenAI β OpenAI Mirascope β OpenAI β OpenAI β OpenAI Fructose β OpenAI π§ In Progress π§ In Progress Marvin β OpenAI β OpenAI β OpenAI Llamaindex β OpenAI β OpenAI β OpenAI Modelsmith β OpenAI π§ In Progress β OpenAI Outlines β HF Transformers π§ In Progress β HF Transformers LM format enforcer β HF Transformers β HF Transformers β HF Transformers Jsonformer β No Enum Support π Planning π Planning Strictjson β Non-standard schema β Non-standard schema β Non-standard schema Guidance π Planning π Planning π Planning DsPy π Planning π Planning π Planning Langchain π Planning π Planning π Planning - Others
- Latency metrics
- CICD pipeline for benchmark run automation
- Async run
Contributions are welcome! Here are the steps to contribute:
- Please open an issue with any new framework you would like to add. This will help avoid duplication of effort.
- Once the issue is assigned to you, pls submit a PR with the new framework!
To cite LLM Structured Output Benchmarks in your work, please use the following bibtex reference:
@software{marie_stephen_leo_2024_12327267,
author = {Marie Stephen Leo},
title = {{stephenleo/llm-structured-output-benchmarks:
Release for Zenodo}},
month = jun,
year = 2024,
publisher = {Zenodo},
version = {v0.0.1},
doi = {10.5281/zenodo.12327267},
url = {https://doi.org/10.5281/zenodo.12327267}
}
If this work helped you in any way, please consider β this repository to give me feedback so I can spend more time on this project.