RTI-Bench is the first structured dataset of Central Information Commission (CIC) decisions under India's Right to Information Act, 2005. It supports research in legal NLP, civic AI, and AI-assisted access to justice.
1,516 cases · 82.8% labelled · 5 commissioners · 3 document format generations · No LLM annotation
The dataset is hosted on HuggingFace: https://huggingface.co/datasets/joyboseroy/rti-bench
It contains two files:
| File | Records | Description |
|---|---|---|
hf_annotated.csv |
1,218 | Annotated instruction-response corpus with outcome labels, exemptions, public authority |
cic_combined.csv / .jsonl |
298 | Structured CIC PDF decisions with IRAC components, timelines, commissioner info |
INFORMATION_DIRECTED · APPEAL_DISMISSED · PENALTY_IMPOSED · PARTIAL_RELIEF · COMPLAINT_S18 · REMANDED · WITHDRAWN · UNKNOWN
- Outcome Prediction — predict Commission outcome from background narrative (macro-F1)
- Exemption Classification — identify RTI Act sections invoked (multi-label, micro-F1)
- Compliance Outcome Prediction — predict adjunct compliance ruling from original directive
- Plain-Language Summarisation — generate citizen-accessible decision summary (ROUGE-L, BERTScore)
rti-bench/
├── pipeline/
│ ├── rti_pipeline.py # PDF text extraction + rule-based doc type classification
│ ├── extract_hf_regex.py # Rule-based field extraction from HF instruction-response corpus
│ ├── extract_cic_pdfs_v4.py # Format-aware IRAC extractor for CIC PDFs (all 3 formats)
│ ├── fix_outcomes_v2.py # Outcome label refinement pass
│ └── check_hf_dataset.py # HuggingFace dataset inspection utility
├── data/
│ └── (see HuggingFace — files too large for GitHub)
├── paper/
│ └── RTI_Bench_arxiv.pdf # ArXiv paper
└── README.md
pip install pymupdf pandas requests tqdm python-dotenv datasetsDownloads jatinmehra/RTI-CASE-DATASET and extracts structured fields using regex. Runs in ~30 seconds for all 1,218 rows.
python pipeline/extract_hf_regex.py
# Output: hf_regex_annotated.jsonl, hf_regex_annotated.csvCollect PDFs from https://dsscic.nic.in (manual download with CAPTCHA), place in a folder, then:
python pipeline/rti_pipeline.py \
--pdf_dir ~/your_pdfs/ \
--output_dir ~/rti_output/ \
--no_llmThis extracts text using PyMuPDF and does rule-based document type classification. Runs in ~15 seconds for 300 PDFs.
Handles all three document format generations automatically (2023a, 2023b, 2026):
python pipeline/extract_cic_pdfs_v4.py \
--text_dir ~/rti_output/raw_text \
--inv ~/rti_output/corpus_inventory.csv \
--output_dir ~/rti_output/ \
--tag batch1Output includes: case number, commissioner, public authority, IRAC components (issue, application, rules cited, conclusion), procedural timeline, exemptions, outcome label.
A key finding of this work is that CIC decisions have evolved across three document templates:
| Format | n | Identifier | Key characteristics |
|---|---|---|---|
| 2023a | 111 | O R D E R heading |
Separate party lines, bilingual Hindi-English headers, Decision: conclusion |
| 2023b | 21 | Observations: section |
Date of Decision in header, separate Observations + Decision sections |
| 2026 | 166 | INFORMATION COMMISSIONER : label |
Inline appellant name, DECISION all-caps block, slash-separated dates |
The pipeline auto-detects format from structural signals and applies the appropriate extractor.
No LLM annotation. The entire pipeline uses deterministic regex and pattern matching. This was a deliberate choice:
- CIC documents follow consistent administrative templates — regular expressions are sufficient
- Full reproducibility: same input always produces same output
- Runs in seconds, not hours — no API costs, no rate limits
- No hallucination risk in a high-stakes legal domain
Two-stage outcome extraction. Rule-based classification of conclusion text uses a priority-ordered pattern set. UNKNOWN labels (17.2% of cases) are retained honestly rather than forced into incorrect categories.
| Metric | Value |
|---|---|
| Total cases | 1,516 |
| Primary decisions | 1,457 |
| Labelled outcomes | 1,206 (82.8%) |
| Exemption citations | 467 |
| Unique exemption sections | 10 |
| Commissioners represented | 5 |
| Document format generations | 3 |
| Years covered | 2017–2026 |
Top exemptions: 8(1)(j) personal information (158) · 8(1)(d) commercial confidence (77) · 8(1)(e) fiduciary (76) · 8(1)(h) investigation (71)
- 17.2% of primary cases carry
UNKNOWNoutcome labels; higher (49%) in the CIC PDF subset - Appellant name field has low coverage (<25%) due to format variation — not needed for benchmark tasks
- Concentrated among 5 commissioners from the CIC portal batch
- Does not include State Information Commission decisions
- Exemption extraction requires explicit section citation in text
If you use RTI-Bench in your research, please cite:
@misc{bose2025rtibench,
title = {RTI-Bench: A Structured Dataset for Indian Right-to-Information Decision Analysis and Outcome Prediction},
author = {Bose, Joy},
year = {2025},
eprint = {XXXX.XXXXX},
archivePrefix = {arXiv},
primaryClass = {cs.CL},
url = {https://huggingface.co/datasets/joyboseroy/rti-bench}
}Code: MIT License
Dataset: CC BY 4.0. Source data from the CIC portal is public domain under Indian government open data policy. The HuggingFace instruction-response subset is derived from jatinmehra/RTI-CASE-DATASET (original license applies to that subset).
Source A derived from jatinmehra/RTI-CASE-DATASET on HuggingFace. Source B collected from the Central Information Commission portal (dsscic.nic.in), which publishes decisions as public records under the RTI Act 2005.
This work is motivated by the belief that AI should help democratise access to justice — not replace it.