Skip to content

joyboseroy/rti-bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 

Repository files navigation

RTI-Bench: A Structured Dataset for Indian RTI Decision Analysis

Dataset on HuggingFace License: CC BY 4.0

RTI-Bench is the first structured dataset of Central Information Commission (CIC) decisions under India's Right to Information Act, 2005. It supports research in legal NLP, civic AI, and AI-assisted access to justice.

1,516 cases · 82.8% labelled · 5 commissioners · 3 document format generations · No LLM annotation


Dataset

The dataset is hosted on HuggingFace: https://huggingface.co/datasets/joyboseroy/rti-bench

It contains two files:

File Records Description
hf_annotated.csv 1,218 Annotated instruction-response corpus with outcome labels, exemptions, public authority
cic_combined.csv / .jsonl 298 Structured CIC PDF decisions with IRAC components, timelines, commissioner info

Outcome Labels

INFORMATION_DIRECTED · APPEAL_DISMISSED · PENALTY_IMPOSED · PARTIAL_RELIEF · COMPLAINT_S18 · REMANDED · WITHDRAWN · UNKNOWN

Benchmark Tasks

  1. Outcome Prediction — predict Commission outcome from background narrative (macro-F1)
  2. Exemption Classification — identify RTI Act sections invoked (multi-label, micro-F1)
  3. Compliance Outcome Prediction — predict adjunct compliance ruling from original directive
  4. Plain-Language Summarisation — generate citizen-accessible decision summary (ROUGE-L, BERTScore)

Repository Structure

rti-bench/
├── pipeline/
│   ├── rti_pipeline.py          # PDF text extraction + rule-based doc type classification
│   ├── extract_hf_regex.py      # Rule-based field extraction from HF instruction-response corpus
│   ├── extract_cic_pdfs_v4.py   # Format-aware IRAC extractor for CIC PDFs (all 3 formats)
│   ├── fix_outcomes_v2.py       # Outcome label refinement pass
│   └── check_hf_dataset.py      # HuggingFace dataset inspection utility
├── data/
│   └── (see HuggingFace — files too large for GitHub)
├── paper/
│   └── RTI_Bench_arxiv.pdf      # ArXiv paper
└── README.md

Quickstart

Requirements

pip install pymupdf pandas requests tqdm python-dotenv datasets

1. Annotate the HuggingFace instruction-response corpus

Downloads jatinmehra/RTI-CASE-DATASET and extracts structured fields using regex. Runs in ~30 seconds for all 1,218 rows.

python pipeline/extract_hf_regex.py
# Output: hf_regex_annotated.jsonl, hf_regex_annotated.csv

2. Extract text from CIC PDFs

Collect PDFs from https://dsscic.nic.in (manual download with CAPTCHA), place in a folder, then:

python pipeline/rti_pipeline.py \
  --pdf_dir ~/your_pdfs/ \
  --output_dir ~/rti_output/ \
  --no_llm

This extracts text using PyMuPDF and does rule-based document type classification. Runs in ~15 seconds for 300 PDFs.

3. Extract structured fields from CIC PDFs

Handles all three document format generations automatically (2023a, 2023b, 2026):

python pipeline/extract_cic_pdfs_v4.py \
  --text_dir ~/rti_output/raw_text \
  --inv ~/rti_output/corpus_inventory.csv \
  --output_dir ~/rti_output/ \
  --tag batch1

Output includes: case number, commissioner, public authority, IRAC components (issue, application, rules cited, conclusion), procedural timeline, exemptions, outcome label.


Document Format Generations

A key finding of this work is that CIC decisions have evolved across three document templates:

Format n Identifier Key characteristics
2023a 111 O R D E R heading Separate party lines, bilingual Hindi-English headers, Decision: conclusion
2023b 21 Observations: section Date of Decision in header, separate Observations + Decision sections
2026 166 INFORMATION COMMISSIONER : label Inline appellant name, DECISION all-caps block, slash-separated dates

The pipeline auto-detects format from structural signals and applies the appropriate extractor.


Pipeline Design Decisions

No LLM annotation. The entire pipeline uses deterministic regex and pattern matching. This was a deliberate choice:

  • CIC documents follow consistent administrative templates — regular expressions are sufficient
  • Full reproducibility: same input always produces same output
  • Runs in seconds, not hours — no API costs, no rate limits
  • No hallucination risk in a high-stakes legal domain

Two-stage outcome extraction. Rule-based classification of conclusion text uses a priority-ordered pattern set. UNKNOWN labels (17.2% of cases) are retained honestly rather than forced into incorrect categories.


Dataset Statistics

Metric Value
Total cases 1,516
Primary decisions 1,457
Labelled outcomes 1,206 (82.8%)
Exemption citations 467
Unique exemption sections 10
Commissioners represented 5
Document format generations 3
Years covered 2017–2026

Top exemptions: 8(1)(j) personal information (158) · 8(1)(d) commercial confidence (77) · 8(1)(e) fiduciary (76) · 8(1)(h) investigation (71)


Limitations

  • 17.2% of primary cases carry UNKNOWN outcome labels; higher (49%) in the CIC PDF subset
  • Appellant name field has low coverage (<25%) due to format variation — not needed for benchmark tasks
  • Concentrated among 5 commissioners from the CIC portal batch
  • Does not include State Information Commission decisions
  • Exemption extraction requires explicit section citation in text

Citation

If you use RTI-Bench in your research, please cite:

@misc{bose2025rtibench,
  title     = {RTI-Bench: A Structured Dataset for Indian Right-to-Information Decision Analysis and Outcome Prediction},
  author    = {Bose, Joy},
  year      = {2025},
  eprint    = {XXXX.XXXXX},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CL},
  url       = {https://huggingface.co/datasets/joyboseroy/rti-bench}
}

License

Code: MIT License

Dataset: CC BY 4.0. Source data from the CIC portal is public domain under Indian government open data policy. The HuggingFace instruction-response subset is derived from jatinmehra/RTI-CASE-DATASET (original license applies to that subset).


Acknowledgements

Source A derived from jatinmehra/RTI-CASE-DATASET on HuggingFace. Source B collected from the Central Information Commission portal (dsscic.nic.in), which publishes decisions as public records under the RTI Act 2005.

This work is motivated by the belief that AI should help democratise access to justice — not replace it.

About

Analysis of RTI files: First structured benchmark dataset for Indian RTI decisions · 1,516 CIC cases · outcome prediction · exemption classification · HuggingFace

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages