Industrial AI Benchmark: MCQA Dataset and Specialized LLMs for Domain-Aware Reasoning

🚀 Highlights

Dataset – FailureSensorIQ
A curated dataset of 8,296 multiple-choice QA on sensor-failure reasoning, hosted on HuggingFace.
Leaderboard
Comparative results of 27 frontier and open-source LLMs evaluated on FailureSensorIQ, hosted on HuggingFace.
Extended Research I – Knowledge Distillation & Fine-Tuning
Distilling reasoning capabilities from LLMs into smaller SLMs and assessing performance on FailureSensorIQ. [Code] [Paper] [Poster]
Extended Research II – Embedding Models
Developing generalized embeddings for Industry 4.0 applications and benchmarking on FailureSensorIQ. [Code] [Paper] [Poster]

🆕 What's New?

[9/25/2025] Embedding paper has been accepted in EMNLP 2025 Industry Track!
[9/18/2025] FailureSensorIQ has been accepted in NeurIPS 2025 Datasets & Benchmarks Track!
[8/20/2025] Fine-Tuning paper has been accepted to EMNLP 2025 Findings!

1. Dataset

We introduce FailureSensorIQ, a Multi-Choice QA (MCQA) dataset that explores the relationships between sensors and failure modes for 10 industrial assets. By only leveraging the information found in ISO documents, we developed a data generation pipeline that creates questions of two types:

Failure Mode to Sensors (FM2Sensor): what sensor should be monitored to detect a failure early?
Sensor to Failure Mode (Sensor2FM): predict failures when anormal sensor behavior is detected.

The FailureSensorIQ dataset consists of 8,296 questions across 10 assets, with 2,667 single-correct-answer questions and 5,629 multi-correct-answer questions.

2. Evaluation: Perturbation-Uncertainty-Complexity analysis

We evaluate 27 frontier and open-source LLMs through our Perturbation–Uncertainty-Complexity evaluation pipeline, which systematically measures model robustness to question reformulation, model confidence, and ability to handle increasing distractor options, respectively.

Perturbation. We adopted the PertEval toolkit enabled us to create a copy of the perturbed dataset. We developed two versions of the perturbed dataset: (i) SimplePert, which modifies the formatting of the questions by reordering the options, adding a right parenthesis to each option, and changing the option labels from A, B, C, etc., to P, Q, R, and so on. (ii) ComplexPert, apply all the question permutation as well as use LLM (llama-3-70b in this case) to change the questions also.

Uncertainty. We adopted the LLM Uncertainty Bench framework to assess model uncertainty in Multi-Choice Question Answering. Each LLM is prompted with their Base Prompting method to output prediction probabilities for all answer options. To calibrate uncertainty estimates, we partition the dataset by asset type into a calibration set and a test set. Using the calibration set, we compute conformal scores that define a confidence threshold ˆq. For the test set, any answer option with a probability exceeding ˆq is selected as a prediction.

Complexity. We created OptionsPert, which extend each question in single-correct-answer MCQA to have 10 options. New choices are all distractors and we randomized the options. The purpose of this dataset is to reduce the likelihood of random guessing and enable systematic evaluation of model robustness under increased ambiguity.

3. LLMFeatureSelect and Kaggle experiments

kaggle folder contains the 3 dataset experiments we performed to evaluate the model's feature suggestions. It also contains LLMFeatureSelector, an sklearn pipeline for feature selection which uses and supports Hugging Face models.

4. Loading the dataset from HuggingFace 🤗

To load single_true_multi_choice_qa and multi_true_multi_choice_qa subsets from HuggingFace,

from datasets import load_dataset

# Login using e.g. `huggingface-cli login` to access this dataset
ds_sc = load_dataset("ibm-research/FailureSensorIQ", "single_true_multi_choice_qa")
ds_mc = load_dataset("ibm-research/FailureSensorIQ", "multi_true_multi_choice_qa")

To load the SimplePert, ComplexPert, OptionsPert of the dataset, check out load_dataset.ipynb.

5. Install and run our evaluation pipeline

Tested with python 3.10.4
Clone repo and submodules, create a conda env

git clone --recurse-submodules https://github.com/IBM/FailureSensorIQ.git
cd FailureSensorIQ
# Optional
conda create -n failuresensoriq python=3.10.4
conda activate failuresensoriq

Install requirements

pip install vllm==0.8.5.post1
pip install -r requirements.txt

Run evaluation pipeline

python run_eval.py <hf-model-id> full

full refers to evaluating on the full dataset. You can first try sample instead to try on a few samples of the dataset and make sure that everything runs as intended before running on the full dataset. If no argument is given, the code will fetch all the pending models for evaluation from huggingface and run them on the full dataset.

If everything ran successfully you should be able to see the performance metrics under results/demo-leaderboard/gpt2-demo/results_<model-name>.json

6. Hardware and troubleshooting

For running the evaluation pipeline we tested this on an A100 80GB. The hardware requirements depend on what model you choose to evaluate. all CUDA-capable devices are busy or unavailable Sometimes if the execution crashes/interrupts before it finishes, the vllm child process may be still running occupying memory in the GPU. We use the following to kill any jobs occupying GPU memory:

nvidia-smi | grep 'python' | awk '{ print $5 }' | xargs -n1 kill -9

7. Citation

If you use our dataset in your paper, please cite our dataset by

@inproceedings{
constantinides2025failuresensoriq,
title={FailureSensor{IQ}: A Multi-Choice {QA} Dataset for Understanding Sensor Relationships and Failure Modes},
author={Christodoulos Constantinides and Dhaval C Patel and Shuxin Lin and Claudio Guerrero and SUNIL DAGAJIRAO PATIL and Jayant Kalagnanam},
booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
year={2025},
url={https://openreview.net/forum?id=9KfkMAy2ut}
}

Name		Name	Last commit message	Last commit date
Latest commit History 122 Commits
LLM-Embeddings		LLM-Embeddings
eval_data		eval_data
fine_tune		fine_tune
kaggle		kaggle
keyword_extract		keyword_extract
llm_uncertainty_bench		llm_uncertainty_bench
mcqa_preparation		mcqa_preparation
perteval		perteval
resources		resources
results		results
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
load_dataset.ipynb		load_dataset.ipynb
prepare_dataset.ipynb		prepare_dataset.ipynb
renovate.json		renovate.json
requirements.txt		requirements.txt
run_eval.ipynb		run_eval.ipynb
run_eval.py		run_eval.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Industrial AI Benchmark: MCQA Dataset and Specialized LLMs for Domain-Aware Reasoning

🚀 Highlights

🆕 What's New?

1. Dataset

2. Evaluation: Perturbation-Uncertainty-Complexity analysis

3. LLMFeatureSelect and Kaggle experiments

4. Loading the dataset from HuggingFace 🤗

5. Install and run our evaluation pipeline

6. Hardware and troubleshooting

7. Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 6

Uh oh!

Languages

License

IBM/FailureSensorIQ

Folders and files

Latest commit

History

Repository files navigation

Industrial AI Benchmark: MCQA Dataset and Specialized LLMs for Domain-Aware Reasoning

🚀 Highlights

🆕 What's New?

1. Dataset

2. Evaluation: Perturbation-Uncertainty-Complexity analysis

3. LLMFeatureSelect and Kaggle experiments

4. Loading the dataset from HuggingFace 🤗

5. Install and run our evaluation pipeline

6. Hardware and troubleshooting

7. Citation

About

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 6

Uh oh!

Languages

Packages