This repository contains the research codebase accompanying the Master’s thesis “Beyond Redundancy: Embedding-Aware Novelty Reranking in Retrieval-Augmented Generation” at the University of Pisa, Department of Information Engineering. The project investigates novelty-aware reranking as a way to enrich Retrieval-Augmented Generation (RAG) pipelines with diverse and non-redundant evidence. The work focuses on how embedding-based supervision can improve nugget coverage compared to traditional lexical heuristics.
Large Language Models (LLMs) often suffer from redundant retrieval, where multiple passages repeat the same fact in different forms. This redundancy wastes the limited context window and reduces the factual diversity of generated answers.
This research addresses the problem through three research questions (RQ1–RQ3):
-
RQ1 — Semantic Novelty Detection
Does clustering over dense embeddings produce more coherent novelty groups than lexical (Jaccard) baselines? -
RQ2 — Embedding-Aware Supervision
Does fine-tuning the Set-Encoder reranker with semantic novelty labels improve diversity-sensitive metrics compared to lexical supervision? -
RQ3 — Downstream RAG Evaluation
Does integrating novelty-aware reranking into RAG pipelines improve factual coverage and answer quality on benchmarks such as MMLU, GPQA, and the TREC RAG Track 2024?
The experimental pipeline is structured in three stages:
-
Semantic Clustering (RQ1)
- Encode top-k passages into dense embeddings
- Apply agglomerative clustering with cosine similarity
- Derive novelty labels that consolidate paraphrases into coherent groups
-
Fine-Tuning the Set-Encoder (RQ2)
- Stage 1: Duplicate-Aware InfoNCE (contrastive pretraining)
- Stage 2: Novelty-Aware RankNet (listwise reranking with novelty penalties)
- Compare lexical vs. semantic novelty supervision
-
Integration into RAG Pipelines (RQ3)
- First-stage retrieval (BM25 / official TREC pools)
- Reranking with Set-Encoder (lexical vs. semantic supervision)
- Generation with LLaMA-3.1-70B-Instruct
- Evaluation with MMLU, GPQA, and nugget-based metrics
- MS MARCO v2 → training
- MS MARCO v2.1 segmented → robustness testing
- TREC Deep Learning 2019/2020 → high-quality graded relevance
- GPQA → multi-hop reasoning
- MMLU → multi-domain multiple-choice
- TREC RAG Track 2024 → nugget-based evaluation (AutoNuggetizer)
- RQ1 → Semantic clustering (e.g., MiniLM, E5, BGE-M3) yields fewer, more coherent novelty groups than lexical Jaccard clustering, consolidating paraphrases effectively.
- RQ2 → Embedding-aware fine-tuning is comparable to lexical supervision on α-nDCG; lexical labels remain strong baselines due to stability and granularity.
- RQ3 → In RAG pipelines, novelty-aware reranking produces directionally higher nugget coverage (especially on importance-weighted metrics) but no statistically significant gains on accuracy-based tasks (MMLU, GPQA).
If you use this code, please cite:
@mastersthesis{namaki2025novelty,
title={Beyond Redundancy: Embedding-Aware Novelty Reranking in Retrieval-Augmented Generation},
author={Namaki Ghaneh, Daniel},
school={University of Pisa},
year={2025}
}Released under the MIT License.