Skip to content

Commit

Permalink
Adding documentation for vector embedding, recall experiments, evalua…
Browse files Browse the repository at this point in the history
…tion with CIs (#21)

* provide info on creating database of vector embeddings
* updated information about evaluation results with bootstrap CIs
* adding runners for manipulating gold/search document recall, info to README
  • Loading branch information
vyaivo authored Nov 5, 2024
1 parent bda0260 commit bf8e949
Show file tree
Hide file tree
Showing 7 changed files with 171 additions and 379 deletions.
86 changes: 50 additions & 36 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,24 @@ To calculate evaluation scores for LLM outputs, you will also need `rouge-score`

## Setup

### Set Paths
Before getting started, you must fill in the path variables `setup/set_paths.sh` for your environment

```bash
export DATA_PATH= # directory containing all preprocessed eval files
export INDEX_PATH=$DATA_PATH/indices # directory to save indices for search/retrieval with SVS
export VEC_PATH=$DATA_PATH/vectors # path to document vectors for search/retrieval with SVS
export DATASET_PATH= # directory containing subdirectories (labelled with dataset name) containing raw downloaded data
export RESULTS_PATH= # location to save output from retriever and reader eval
export COLBERT_MODEL_PATH= # location where colbert model has been downloaded
```

then run with

```bash
source setup/set_paths.sh
```

### Download Data

To download ASQA and QAMPARI datasets, as well as the DPR wikipedia snapshot used for retrieved documents, please refer to the original [ALCE repository](https://github.com/princeton-nlp/ALCE). After downloading this data, create `asqa`, `qampari`, and `dpr_wiki` subdirectories in the location specified by the `DATASET_PATH` environment variable. Place one (it doesn't matter which) corresponding .json eval file in the `asqa` and `qampari` directories, respectively. Rename these files `raw.json`. Rename the downloaded dpr wikipedia dump `raw.tsv` and place it in the `dpr_wiki` subdirectory. Rename the oracle files included in the ALCE data `asqa_gold.json` and `qampari_gold.json`. Move them to the location specified by the `DATA_PATH` environment variable. Finally, the renamed ALCE .json files and DPR wikipedia .tsv file can be converted to the formats needed for running retrieval with SVS (Scalable Vector Search) by running:
Expand All @@ -54,23 +72,8 @@ For the NQ dataset and the KILT Wikipedia corpus that supports it, you may follo

To preprocess the files for use with our dense retrieval code using SVS, run `preprocessing/convert_nq_dense.py` with the appropriate input arguments.

### Set Paths
Before getting started, you must fill in the path variables `setup/set_paths.sh` for your environment

```bash
export DATA_PATH= # directory containing all preprocessed eval files
export INDEX_PATH=$DATA_PATH/indices # directory to save indices for search/retrieval with SVS
export VEC_PATH=$DATA_PATH/vectors # path to document vectors for search/retrieval with SVS
export DATASET_PATH= # directory containing subdirectories (labelled with dataset name) containing raw downloaded data
export RESULTS_PATH= # location to save output from retriever and reader eval
export COLBERT_MODEL_PATH= # location where colbert model has been downloaded
```

then run with

```bash
source setup/set_paths.sh
```
### Embed corpus text
After you download the DPR Wikipedia and/or KILT Wikipedia dumps, you will need to embed the corpus samples into vectors. These form the vector database for the retriever. Code to embed JSON or HuggingFace datasets is available in the [VectorSearchDatasets repository](https://github.com/IntelLabs/VectorSearchDatasets/blob/main/text/wikipedia_dataset.py). Their [README](https://github.com/IntelLabs/VectorSearchDatasets/tree/main/text) has more detailed information on using device parallelism -- specifically, you should refer to the `wikipedia_110M` section to see an example with the KILT Wikipedia corpus.

## Retriever

Expand All @@ -88,13 +91,21 @@ Retrieval with dense text embeddings (e.g. the [BGE-1.5 embeddings](https://hugg

Alternatively, you can make more system-specific install configurations by following the [documentation here](https://intel.github.io/ScalableVectorSearch/).

We have implemented similarity-based retrieval with either exact search or approximate nearest neighbor (ANN) search. Retriever configuration files for exact search are titled `dense_{DATASET}.yaml`, while approximate search parameters can be set in configuration files titled `dense_ann_{DATASET}.yaml`. Several of the parameters for building the ANN search graph can be modified to alter the search performance, but we have provided configurations that work well for those datasets.
We have implemented similarity-based retrieval with either exact search or approximate nearest neighbor (ANN) search. Retriever configuration files for exact search are titled `dense_{DATASET}.yaml`. Any arguments passed to the `retrieval/run.py` script will override the parameters set in the configuration YAML file.

#### Search recall experiments
For the approximate search experiments, we tuned the ANN search to achieve a specific accuracy compared to exact search. This requires a file that saves the results of the exact search, which is done in `preprocessing/create_ground_truth_calibration.py` script to save the results on a subset of the data.

Approximate search parameters can be set in configuration files titled `dense_ann_{DATASET}.yaml`. Several of the parameters for building the ANN search index can be modified to alter the search performance, but we have provided configurations that work well for those datasets. If you leave the `calib_kwargs` parameter as-is, then SVS will run a calibration routine to estimate the search window size that reaches the target recall. Here are example commands to run the calibration at all the target search recalls given in the paper for the ASQA dataset:

#### Tuning search recall
For some experiments, we have tuned the ANN search to achieve a specific accuracy compared to exact search. Therefore we have included a `preprocessing/create_ground_truth_calibration.py` script to save the results of exact search on a subset of the data.
```bash
python retrieval/run.py --config dense_ann_asqa.yaml --calib_kwargs '{"calib_prefix": asqa_bge-base-dense, "num_neighbors": 10, "target_recall": 0.7}'
python retrieval/run.py --config dense_ann_asqa.yaml --calib_kwargs '{"calib_prefix": asqa_bge-base-dense, "num_neighbors": 10, "target_recall": 0.9}'
python retrieval/run.py --config dense_ann_asqa.yaml --calib_kwargs '{"calib_prefix": asqa_bge-base-dense, "num_neighbors": 10, "target_recall": 0.95}'
```

#### Setting gold document recall
For some experiments, we manipulated the set of context documents to achieve an exact number for average retrieval recall across the whole dataset of queries. This can be run with `preprocessing/set_gold_recall.py`.
For some experiments, we manipulated the set of context documents to achieve an exact number for average retrieval recall across the whole dataset of queries. This can be run with `preprocessing/set_gold_recall.py`. The output will be a JSON file that you can pass into the reader LLM (see below).

### ColBERT-2.0 retriever -- very slight modifications from v.0.2.20
The faiss package is required to run the ColBERT retriever.
Expand All @@ -106,11 +117,12 @@ Details on these changes: We ran into OOM errors when using the original code. T
### Retriever evaluation
To evaluate retrieval results use the following command:
```bash
python retriever/eval.py --eval_file {eval_filename} --not_par_level
python retriever/eval_per_query.py --eval_file {eval_filename} --not_par_level
```

Use the ```--not_par_level``` flag for asqa, where the gold metadata is not separated into document-level and paragraph-level ids.

To get 95% bootstrap confidence intervals on the retrieval results, use the ```--ci``` flag.

To generate 10 noisy docs in each percentile of neighbors for each query, add the ```--noise_experiment``` tag. Note that this is only implemented for dense retrieval and has only been tested for asqa.


Expand All @@ -124,10 +136,16 @@ python reader/run.py --config {config_name}

Bash files for looping over various numbers of documents included in the prompt and evaluating the results can be found in `runners/ndoc_asqa_mistral_reader.sh` and `runners/ndoc_asqa_mistral_eval_looper.sh`.

### Search recall experiments
Bash files for looping over the calibrated search recall results and the datasets shown in the paper can be found in `runners/search-recall_reader.sh` and `runners/search-recall_eval.sh`.

### Gold document recall experiments
Bash files for looping over the gold document recall retriever files and the datasets shown in the paper can be found in `runners/gold-recall_reader.sh` and `runners/gold-recall_eval.sh`.

### Noise experiments
The process for performing experiments with adding noisy documents to gold and retrieved documents in the interest of replicating performance gains observed in [The Power of Noise](https://arxiv.org/abs/2401.14887) is outlined here.

## Noise percentile experiments
### Noise percentile experiments
Using the ```--noise_experiment``` tag in the retrieval step described in [Retriever evaluation](#retriever-evaluation) results in 10 noisy docs in each percentile of neighbors for each query. This is obtained by retrieving all documents for the query, resulting in an ordered list from most similar to least similar to the query. This is divided into ten equal bins. Random documents from each bin are exported to a noise file. This is implemented in `retriever/ret_utils.py`. For each resulting noise file, run:

```bash
Expand All @@ -137,8 +155,7 @@ By default, the noisy documents will be added to the prompt after the retrieved

Bash files for running and evaluating this experiment can be found at `runners/noise_percentile_asqa_mistral_gold_reader.sh` and `runners/noise_percentile_asqa_mistral_gold_eval.sh`.


## First 100 neighbors experiments
### First 100 neighbors experiments
To perform experiments with adding nearer neighbors to the gold and retrieved results run default retrieval to obtain an `eval_file`, then create new noise files for retrieved results 5-10 and 95-100 (for each query) by running:

```bash
Expand All @@ -155,24 +172,21 @@ Bash files for running and evaluating this experiment can be found at `runners/f

### Reader evaluation

Accuracy on the QA task is implemented in run/eval.py. The evaluation code contains copies of functions from two RAG papers that previously used these datasets ([ALCE](https://github.com/princeton-nlp/ALCE) and [RAGGED](https://github.com/neulab/ragged)).
We have two ways to run evaluation on the reader results. The `eval.py` script provides the overall mean, whereas the `eval_per_query.py` script provides the information needed to compute the confidence intervals (CIs) across dataset queries. The evaluation code contains copies of functions from two RAG papers that previously used these datasets ([ALCE](https://github.com/princeton-nlp/ALCE) and [RAGGED](https://github.com/neulab/ragged)).

For ASQA and QAMPARI, use the following command
The following command should run the evaluation:
```bash
python reader/eval.py --f {result_file_name} --citations
python reader/eval_per_query.py --f {result_file_name} --citations
```

For nq and bioasq, use the following command. Note that an option to run this eval without bert is offered because it can be somewhat time consuming.
```bash
python reader/eval.py --f {result_file_name} --citations --no_bert
```
The evaluation result will be saved in the RESULTS_PATH (default is `result/`), with the same name as the input and a suffix `.score`.

The evaluation result will be saved in `result/`, with the same name as the input and a suffix `.score`.
To compute 95% bootstrap confidence intervals on the reader evaluation results, you can use the `reader/compute_ci.py` script. The various input arguments are used to find the correct ```*perquery.score``` files in the results folder.

To generate per-k reader result plots with retrieval results on the same axis, run:
To generate per-k reader result plots with retrieval results overlaid on top, run:

```bash
python reader/plot_per_k.py --eval_file {dataset}-{model_name}-None-shot{}-ndoc*-42-{cite-}{retriever}.json.score --ret_file {dataset}_retrieval-{retriever}.json --ret_metric {top-k accuracy/precision@k/recall@k}
python reader/plot_per_k.py --eval_file {dataset}-{model_name}-None-shot{}-ndoc*-cite-{retriever}.json.score --ret_file {dataset}_retrieval-{retriever}.json --ret_metric {top-k accuracy/precision@k/recall@k}
```

## Disclaimer
Expand Down
12 changes: 8 additions & 4 deletions reader/compute_ci.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,10 +19,14 @@ def bootstrap_ci(data, key_prefix, func=np.mean, n_boot=1000, percentiles=[2.5,

if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("-d", "--datasets", nargs='+', default=['asqa', 'nq'])
parser.add_argument("-f", "--subfolder", type=str, default=None)
parser.add_argument("-m", "--models", nargs='+', default=['Llama', 'Mistral'])
parser.add_argument("-c", "--conditions", nargs='+', default=['gold', 'closedbook', 'bge-base', 'colbert'])
parser.add_argument("-d", "--datasets", nargs='+', default=['asqa', 'nq'],
help="Dataset(s) in the score filename")
parser.add_argument("-f", "--subfolder", type=str, default=None,
help="Searches in a subfolder in the RESULTS_PATH")
parser.add_argument("-m", "--models", nargs='+', default=['Llama', 'Mistral'],
help="Model(s) in the score filename")
parser.add_argument("-c", "--conditions", nargs='+', default=['gold', 'closedbook', 'bge-base', 'colbert'],
help="Condition(s) in the score filename")
parser.add_argument("--overwrite", action="store_true")
args = parser.parse_args()

Expand Down
Loading

0 comments on commit bf8e949

Please sign in to comment.