diff --git a/README.md b/README.md
index 7d9b518..55efcc4 100644
--- a/README.md
+++ b/README.md
@@ -63,19 +63,12 @@ To evaluate retrieval results use the following command:
python retriever/eval.py --eval_file {eval_filename} --not_par_level
```
-Use the --not_par_level flag for asqa, where the gold metadata is not separated into document-level and paragraph-level ids.
+Use the ```--not_par_level``` flag for asqa, where the gold metadata is not separated into document-level and paragraph-level ids.
To generate 10 noisy docs in each percentile of neighbors for each query, add the ```--noise_experiment``` tag. Note that this is only implemented for dense retrieval and has only been tested for asqa.
-### Noise experiments
-TODO: add overall statement about noise experiments and how to run them. specifics below
-TODO: add info about retrieving ALL neighbors
-TODO: add info about noise percentile bins in `ret_utils.py` and `run.py`
-TODO: add info about evaluating closer neighbors with `preprocessing/sample_retrieved_neighbors.py` and `run.py`
-
-
-## Reader
+### Reader
There are existing config files in reader/configs or you can create your own. You may also override arguments in the config file with command line arguments or choose not to use config files and specify everything
via command line arguments.
@@ -84,211 +77,49 @@ via command line arguments.
python reader/run.py --config {config_name}
```
-### Reader evaluation
-
-ACLE evaluation is implemented in `run/eval.py`. The metrics are tailored to each dataset as outlined in the script.
-
-For ASQA and QAMPARI, use the following command
-```bash
-python reader/eval.py --f {result_file_name} --citations
-```
-
-For nq and bioasq, use the following command. Note that an option to run this eval without bert is offered because it can be somewhat time consuming.
-```bash
-python reader/eval.py --f {result_file_name} --citations --no_bert
-```
-
-The evaluation result will be saved in `result/`, with the same name as the input and a suffix `.score`.
-
-To generate per-k reader result plots with retrieval results on the same axis, run:
-
-```bash
-python reader/plot_per_k.py --eval_file {dataset}-{model_name}-None-shot{}-ndoc*-42-{cite-}{retriever}.json.score --ret_file {dataset}_retrieval-{retriever}.json --ret_metric {top-k accuracy/precision@k/recall@k}
-```
-
-
-
-
-
-
-
-
-
-# Enabling Large Language Models to Generate Text with Citations
-
-

*: ALCE is pronounced as /elk/ as ALCE is the Latin word for elk (Europe) or moose (North America).
-
-
-
-
-This repository contains the code and data for paper [Enabling Large Language Models to Generate Text with Citations](https://arxiv.org/abs/2305.14627).
-In this paper, we propose ALCE, a benchmark for **A**utomatic **L**LMs' **C**itation Evaluation.
-ALCE contains three datasets: ASQA, QAMPARI, and ELI5.
-We provide automatic evaluation code of LLM generations around three dimensions: fluency, correctness, and citation quality.
-This repository also includes code to reproduce the baselines in our paper.
-
-
-
-
-
-
-
-
-## Quick Links
-
- - [Requirements](#requirements)
- - [Data](#data)
- - [Code Structure](#code-structure)
- - [Reproducing Baselines](#reproducing-baselines)
- - [Evaluation](#evaluation)
- - [Human Evaluation](#human-evaluation)
- - [Bug or Questions](#bug-or-questions)
- - [Citation](#citation)
-
-
-## Requirements
-
-Please install the latest versions of PyTorch (`torch`), HuggingFace Transformers (`transformers`), HuggingFace Accelerate (`accelerate`), and the OpenAI API package (`openai`). This codebase is tested on
-`torch==2.1.0.dev20230514+cu118`, `transformers==4.28.1`, `accelerate==0.17.1`, and `openai==0.27.4` with Python 3.9.7.
-
-## Data
-
-You can download datasets (along with retrieval results) by running the following command:
-
-```bash
-bash download_data.sh
-```
-
-All the data will be stored in `data/`. Our data included top-100 DPR/GTR retrieved results for ASQA and QAMPARI, and top-100 BM25 retrieved results for QAMPARI. We also provide reranked oracle retrieval results, where top-5 passages can achieve the same recall as the original top-100 recall.
-
-### Retrieval
-
-You can reproduce the passage retrieval step with the following command:
-```bash
-python retrieval.py --data {path/to/data} --retriever {bm25/gtr} --output_file {path/to/output}
-```
-
-There are additional packages required for the retrieval steps.
-Specifically, you need to install `pyserini==0.21.0`(their github [repo](https://github.com/castorini/pyserini/tree/master) is helpful) and `sentence-transformers==2.2.2`.
+### Noise experiments
+The process for performing experiments with adding noisy documents to gold and retrieved documents in the interest of replicating performance gains observed in [The Power of Noise](https://arxiv.org/abs/2401.14887) is outlined here.
-For the BM25 retrieval over Common Crawl using Sphere, you must first download the index from the Sphere [repo](https://github.com/facebookresearch/Sphere), and set the environmental variable `BM25_SPHERE_PATH` to the path of the downloaded index.
-Specifically, you can use the following command:
-```bash
-wget -P faiss_index https://dl.fbaipublicfiles.com/sphere/sphere_sparse_index.tar.gz
-tar -xzvf faiss_index/sphere_sparse_index.tar.gz -C faiss_index
-export BM25_SPHERE_PATH=$PWD/faiss_index
-```
-It's important to note that given the large size of the corpus, this step is extremely expensive and time-consuming. We found that larger CPU memory tends to help with the speed.
+## Noise percentile experiments
+Using the ```--noise_experiment``` tag in the retrieval step described in [Retriever evaluation](#retriever-evaluation) results in 10 noisy docs in each percentile of neighbors for each query. This is obtained by retrieving all documents for the query, resulting in an ordered list from most similar to least similar to the query. This is divided into ten equal bins. Random documents from each bin are exported to a noise file. This is implemented in `retriever/ret_utils.py`. For each resulting noise file, run:
-For GTR, we first build an index using the DPR wikipedia snapshot, which you can obtain using the download script from the DPR [repo](https://github.com/facebookresearch/DPR), and then setting the environmental variable `DPR_WIKI_TSV` to the path of the tsv file.
-Specifically, you can use the following command:
-```bash
-wget https://dl.fbaipublicfiles.com/dpr/wikipedia_split/psgs_w100.tsv.gz
-gzip -xzvf psgs_w100.tsv.gz
-export DPR_WIKI_TSV=$PWD/psgs_w100.tsv
-```
-Then, you want to set `GTR_EMB` to the path of the GTR embeddings of the Wikipedia corpus, and running the retrieval script for the first time will automatically build and save the index.
-Building the dense index can be expensive for GPU memory (we use 80GB GPUs for this) and time-consuming; the entire index will take about 31GB.
-If you find this step to be too expensive, you can also download it using:
```bash
-wget https://huggingface.co/datasets/princeton-nlp/gtr-t5-xxl-wikipedia-psgs_w100-index/resolve/main/gtr_wikipedia_index.pkl
-export GTR_EMB=$PWD/gtr_wikipedia_index.pkl
+python3 reader/run.py --config {config_name} --noise_file {noise_name}
```
+By default, the noisy documents will be added to the prompt after the retrieved or gold documents. To switch this order, use the ```--noise_first``` flag. You can switch between adding noise to the gold and retrieved documents by changing the `config_name`.
-To reproduce the DPR retrieval, we refer the DPR [repo](https://github.com/facebookresearch/DPR), which we used the original DPR checkpoint trained on NQ.
-## Code Structure
-
-* `run.py`: run file to reproduce our baseline generations.
-* `eval.py`: eval file to evaluate generations.
-* `prompts`: folder that contains all prompt files.
-* `configs/`: folder that contains all config files to reproduce baselines.
-* `tools/`: misc code (generate summaries/snippets, reranking, etc.)
-
-
-## Reproducing Baselines
-
-
-You can reproduce baselines from our paper by
+## First 100 neighbors experiments
+To perform experiments with adding nearer neighbors to the gold and retrieved results run default retrieval to obtain an `eval_file`, then create new noise files for retrieved results 5-10 and 95-100 (for each query) by running:
```bash
-python run.py --config configs/{config_name}
-```
-
-You can also overwrite any arguments in the config file or add new arguments simply through command line:
+python3 preprocessing/sample_retrieved_neighbors.py --f {eval_file} --d {dataset_name}
```
-python run.py --config configs/{config_name} --seed 43 --model vicuna-13b
-```
-
-The naming of config files follow the rule of `{LLM}_{#demos and #passages}_{retriever}_{method}.yaml`. Method names include:
-* `default` corresponds to the **Vanilla** model in our paper.
-* `summary` corresponds to the **Summary** model.
-* `extraction` corresponds to the **Snippet** model.
-* `interact_doc_id` corresponds to the **Interact** model.
-* `interact_search` corresponds to the **InlineSearch** model.
-* `closedbook` corresponds to the **ClosedBook** model.
-
-Our code support both OpenAI API and offline HuggingFace models:
-
-* For OpenAI models (for example, ChatGPT), you need to set the environment variable `OPENAI_API_KEY` and `OPENAI_ORG_ID`. If you are using the Azure OpenAI API, you need to set the environment variable of `OPENAI_API_KEY` and `OPENAI_API_BASE`. You also need to add the flag `--azure`.
- * Note that in Azure OpenAI API, ChatGPT's name is different and you should set it by `--model gpt-35-turbo`.
-* For the open-source models, you should set the model name equal to the input of HuggingFace models' `.from_pretrained` method. This could either be a local directory (e.g. for the older LLaMA models) or a path to the HuggingFace hub.
-
-For detailed argument usage, please refer to `run.py`.
-
-Model output along with gold answers and run configs will be stored in a json file in `result/`.
-
-### Post-hoc citation
+Note that gold documents are omitted from this set. You can then run experiments with the resulting noise files (as outlined above):
-For closed-book models, one can use `post_hoc_cite.py` to add citations in a post-hoc manner (using GTR-large). To run post-hoc citation, execute
```bash
-python post_hoc_cite.py --f result/{RESULT JSON FILE NAME} --external_docs data/{CORRESPONDING DATA}
+python3 reader/run.py --config {config_name} --noise_file {noise_name}
```
-The output file with post-hoc citations will be stored in `result/`, with a suffix `post_hoc_cite.gtr-t5-large-external`.
-
-## Evaluation
-
-ACLE evaluation is implemented in `eval.py`.
+### Reader evaluation
-For ASQA, use the following command
-```bash
-python eval.py --f {path/to/result/file} --citations --qa --mauve
-```
+Accuracy on the QA task is implemented in run/eval.py. The evaluation code contains copies of functions from two RAG papers that previously used these datasets ([ALCE](https://github.com/princeton-nlp/ALCE) and [RAGGED](https://github.com/neulab/ragged)).
-For QAMPARI, use the following command
+For ASQA and QAMPARI, use the following command
```bash
-python eval.py --f {path/to/result/file} --citations
+python reader/eval.py --f {result_file_name} --citations
```
-For ELI5, use the following command
+For nq and bioasq, use the following command. Note that an option to run this eval without bert is offered because it can be somewhat time consuming.
```bash
-python eval.py --f {path/to/result/file} --citations --claims_nli --mauve
+python reader/eval.py --f {result_file_name} --citations --no_bert
```
The evaluation result will be saved in `result/`, with the same name as the input and a suffix `.score`.
-## Human Evaluation
-
-The results from our human evaluation (Section 6) are located under the directory [`human_eval`](human_eval).
-Both the data and the analysis are available, please refer to the directory for details.
-
-## Bug or Questions?
-
-If you have any questions related to the code or the paper, feel free to email Tianyu (`tianyug@cs.princeton.edu`). If you encounter any problems when using the code, or want to report a bug, you can open an issue. Please try to specify the problem with details so we can help you better and quicker!
-
-
-
-## Citation
-
-Please cite our paper if you use ALCE in your work:
+To generate per-k reader result plots with retrieval results on the same axis, run:
-```bibtex
-@inproceedings{gao2023enabling,
- title={Enabling Large Language Models to Generate Text with Citations},
- author={Gao, Tianyu and Yen, Howard and Yu, Jiatong and Chen, Danqi},
- year={2023},
- booktitle={Empirical Methods in Natural Language Processing (EMNLP)},
-}
+```bash
+python reader/plot_per_k.py --eval_file {dataset}-{model_name}-None-shot{}-ndoc*-42-{cite-}{retriever}.json.score --ret_file {dataset}_retrieval-{retriever}.json --ret_metric {top-k accuracy/precision@k/recall@k}
```
diff --git a/retriever/ret_utils.py b/retriever/ret_utils.py
index 9f59312..9305ec5 100644
--- a/retriever/ret_utils.py
+++ b/retriever/ret_utils.py
@@ -80,54 +80,53 @@ def save_noise(query_data, queries, k, k_neighbors, corpus, dist_neighbors, doc_
"""
logger.info('Saving text and titles for each neighbor')
-
+
import random
random.seed(42)
- # for i in range(10): # for each percentile
- i = 9
- logger.info(f"Generating {(i+1)*10}th percentile random noise")
- start_index = int(i * 0.01 * k)
- end_index = int((i + 1) * 0.01 * k)
- # for qi, q in enumerate(tqdm(queries)):
- for qi, q in enumerate(queries):
- # get gold ids for query
- par_gold = False
- try:
- gold_ids = query_data[qi]['output']['id_set']
- except:
- par_gold = True
- gold_ids = query_data[qi]['output']['page_par_id_set']
- # get neighbor info for this percentile only
- neighbor_inds = k_neighbors[qi, :]
- neighbor_inds = neighbor_inds[start_index:end_index]
- neighbor_data = corpus[neighbor_inds]
- # Get associated text
- n_text = neighbor_data[text_key]
- # Get document & passage ID. Also get associated document title.
- n_id = neighbor_data["id"]
+ for i in range(10): # for each percentile
+ logger.info(f"Generating {(i+1)*10}th percentile random noise")
+ start_index = int(i * 0.01 * k)
+ end_index = int((i + 1) * 0.01 * k)
+ # for qi, q in enumerate(tqdm(queries)):
+ for qi, q in enumerate(queries):
+ # get gold ids for query
+ par_gold = False
+ try:
+ gold_ids = query_data[qi]['output']['id_set']
+ except:
+ par_gold = True
+ gold_ids = query_data[qi]['output']['page_par_id_set']
+ # get neighbor info for this percentile only
+ neighbor_inds = k_neighbors[qi, :]
+ neighbor_inds = neighbor_inds[start_index:end_index]
+ neighbor_data = corpus[neighbor_inds]
+ # Get associated text
+ n_text = neighbor_data[text_key]
+ # Get document & passage ID. Also get associated document title.
+ n_id = neighbor_data["id"]
- ret = [] # list of doc dicts
- choices = [] # track so no duplicates and no golds
- while len(choices) < 100:
- c = random.randrange(len(n_id))
- og_index = c + start_index
- doc_id = str(n_id[c])
- if doc_id not in choices and doc_id not in gold_ids:
- # good choice!
- choices.append(doc_id)
- score = str(dist_neighbors[qi, og_index])
- doc_text = n_text[c]
- res_dict = get_doc(doc_dataset, doc_id, doc_text, score, title_dict, logger)
- res_dict['neighbor_id'] = str(og_index)
- ret.append(res_dict)
- # else continue generating
- query_data[qi]['docs'] = ret
+ ret = [] # list of doc dicts
+ choices = [] # track so no duplicates and no golds
+ while len(choices) < 100:
+ c = random.randrange(len(n_id))
+ og_index = c + start_index
+ doc_id = str(n_id[c])
+ if doc_id not in choices and doc_id not in gold_ids:
+ # good choice!
+ choices.append(doc_id)
+ score = str(dist_neighbors[qi, og_index])
+ doc_text = n_text[c]
+ res_dict = get_doc(doc_dataset, doc_id, doc_text, score, title_dict, logger)
+ res_dict['neighbor_id'] = str(og_index)
+ ret.append(res_dict)
+ # else continue generating
+ query_data[qi]['docs'] = ret
- # output percentile json so memory isn't exceeded
- percentile_file_name = output_file_name.split(".json")[0] + "-random-" + str((i+1)*10) + ".json"
- # load existing file if it exists
- if os.path.exists(os.path.join(DATA_PATH, percentile_file_name)):
- prev_batch_file = load_json(os.path.join(DATA_PATH, percentile_file_name))
- query_data = prev_batch_file + query_data
- save_file(percentile_file_name, query_data, logger)
+ # output percentile json so memory isn't exceeded
+ percentile_file_name = output_file_name.split(".json")[0] + "-random-" + str((i+1)*10) + ".json"
+ # load existing file if it exists
+ if os.path.exists(os.path.join(DATA_PATH, percentile_file_name)):
+ prev_batch_file = load_json(os.path.join(DATA_PATH, percentile_file_name))
+ query_data = prev_batch_file + query_data
+ save_file(percentile_file_name, query_data, logger)