RTI-Bench: A Structured Dataset for Indian RTI Decision Analysis

RTI-Bench is the first structured dataset of Central Information Commission (CIC) decisions under India's Right to Information Act, 2005. It supports research in legal NLP, civic AI, and AI-assisted access to justice.

1,516 cases · 82.8% labelled · 5 commissioners · 3 document format generations · No LLM annotation

Dataset

The dataset is hosted on HuggingFace: https://huggingface.co/datasets/joyboseroy/rti-bench

It contains two files:

File	Records	Description
`hf_annotated.csv`	1,218	Annotated instruction-response corpus with outcome labels, exemptions, public authority
`cic_combined.csv` / `.jsonl`	298	Structured CIC PDF decisions with IRAC components, timelines, commissioner info

Outcome Labels

INFORMATION_DIRECTED · APPEAL_DISMISSED · PENALTY_IMPOSED · PARTIAL_RELIEF · COMPLAINT_S18 · REMANDED · WITHDRAWN · UNKNOWN

Benchmark Tasks

Outcome Prediction — predict Commission outcome from background narrative (macro-F1)
Exemption Classification — identify RTI Act sections invoked (multi-label, micro-F1)
Compliance Outcome Prediction — predict adjunct compliance ruling from original directive
Plain-Language Summarisation — generate citizen-accessible decision summary (ROUGE-L, BERTScore)

Repository Structure

rti-bench/
├── pipeline/
│   ├── rti_pipeline.py          # PDF text extraction + rule-based doc type classification
│   ├── extract_hf_regex.py      # Rule-based field extraction from HF instruction-response corpus
│   ├── extract_cic_pdfs_v4.py   # Format-aware IRAC extractor for CIC PDFs (all 3 formats)
│   ├── fix_outcomes_v2.py       # Outcome label refinement pass
│   └── check_hf_dataset.py      # HuggingFace dataset inspection utility
├── data/
│   └── (see HuggingFace — files too large for GitHub)
├── paper/
│   └── RTI_Bench_arxiv.pdf      # ArXiv paper
└── README.md

Quickstart

Requirements

pip install pymupdf pandas requests tqdm python-dotenv datasets

1. Annotate the HuggingFace instruction-response corpus

Downloads jatinmehra/RTI-CASE-DATASET and extracts structured fields using regex. Runs in ~30 seconds for all 1,218 rows.

python pipeline/extract_hf_regex.py
# Output: hf_regex_annotated.jsonl, hf_regex_annotated.csv

2. Extract text from CIC PDFs

Collect PDFs from https://dsscic.nic.in (manual download with CAPTCHA), place in a folder, then:

python pipeline/rti_pipeline.py \
  --pdf_dir ~/your_pdfs/ \
  --output_dir ~/rti_output/ \
  --no_llm

This extracts text using PyMuPDF and does rule-based document type classification. Runs in ~15 seconds for 300 PDFs.

3. Extract structured fields from CIC PDFs

Handles all three document format generations automatically (2023a, 2023b, 2026):

python pipeline/extract_cic_pdfs_v4.py \
  --text_dir ~/rti_output/raw_text \
  --inv ~/rti_output/corpus_inventory.csv \
  --output_dir ~/rti_output/ \
  --tag batch1

Output includes: case number, commissioner, public authority, IRAC components (issue, application, rules cited, conclusion), procedural timeline, exemptions, outcome label.

Document Format Generations

A key finding of this work is that CIC decisions have evolved across three document templates:

Format	n	Identifier	Key characteristics
2023a	111	`O R D E R` heading	Separate party lines, bilingual Hindi-English headers, `Decision:` conclusion
2023b	21	`Observations:` section	`Date of Decision` in header, separate Observations + Decision sections
2026	166	`INFORMATION COMMISSIONER :` label	Inline appellant name, `DECISION` all-caps block, slash-separated dates

The pipeline auto-detects format from structural signals and applies the appropriate extractor.

Pipeline Design Decisions

No LLM annotation. The entire pipeline uses deterministic regex and pattern matching. This was a deliberate choice:

CIC documents follow consistent administrative templates — regular expressions are sufficient
Full reproducibility: same input always produces same output
Runs in seconds, not hours — no API costs, no rate limits
No hallucination risk in a high-stakes legal domain

Two-stage outcome extraction. Rule-based classification of conclusion text uses a priority-ordered pattern set. UNKNOWN labels (17.2% of cases) are retained honestly rather than forced into incorrect categories.

Dataset Statistics

Metric	Value
Total cases	1,516
Primary decisions	1,457
Labelled outcomes	1,206 (82.8%)
Exemption citations	467
Unique exemption sections	10
Commissioners represented	5
Document format generations	3
Years covered	2017–2026

Top exemptions: 8(1)(j) personal information (158) · 8(1)(d) commercial confidence (77) · 8(1)(e) fiduciary (76) · 8(1)(h) investigation (71)

Limitations

17.2% of primary cases carry UNKNOWN outcome labels; higher (49%) in the CIC PDF subset
Appellant name field has low coverage (<25%) due to format variation — not needed for benchmark tasks
Concentrated among 5 commissioners from the CIC portal batch
Does not include State Information Commission decisions
Exemption extraction requires explicit section citation in text

Citation

If you use RTI-Bench in your research, please cite:

@misc{bose2025rtibench,
  title     = {RTI-Bench: A Structured Dataset for Indian Right-to-Information Decision Analysis and Outcome Prediction},
  author    = {Bose, Joy},
  year      = {2025},
  eprint    = {XXXX.XXXXX},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CL},
  url       = {https://huggingface.co/datasets/joyboseroy/rti-bench}
}

License

Code: MIT License

Dataset: CC BY 4.0. Source data from the CIC portal is public domain under Indian government open data policy. The HuggingFace instruction-response subset is derived from jatinmehra/RTI-CASE-DATASET (original license applies to that subset).

Acknowledgements

Source A derived from jatinmehra/RTI-CASE-DATASET on HuggingFace. Source B collected from the Central Information Commission portal (dsscic.nic.in), which publishes decisions as public records under the RTI Act 2005.

This work is motivated by the belief that AI should help democratise access to justice — not replace it.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
paper		paper
pipeline		pipeline
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RTI-Bench: A Structured Dataset for Indian RTI Decision Analysis

Dataset

Outcome Labels

Benchmark Tasks

Repository Structure

Quickstart

Requirements

1. Annotate the HuggingFace instruction-response corpus

2. Extract text from CIC PDFs

3. Extract structured fields from CIC PDFs

Document Format Generations

Pipeline Design Decisions

Dataset Statistics

Limitations

Citation

License

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RTI-Bench: A Structured Dataset for Indian RTI Decision Analysis

Dataset

Outcome Labels

Benchmark Tasks

Repository Structure

Quickstart

Requirements

1. Annotate the HuggingFace instruction-response corpus

2. Extract text from CIC PDFs

3. Extract structured fields from CIC PDFs

Document Format Generations

Pipeline Design Decisions

Dataset Statistics

Limitations

Citation

License

Acknowledgements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages