Skip to content

Commit

Permalink
Document preproc (#18)
Browse files Browse the repository at this point in the history
* documenting data download and renaming
* cleanup + consolidation of data preprocessing scripts

---------

Co-authored-by: Vo, Vy <[email protected]>
  • Loading branch information
aleto1999 and vyaivo authored Nov 1, 2024
1 parent de4ff64 commit a24522a
Show file tree
Hide file tree
Showing 3 changed files with 22 additions and 21 deletions.
7 changes: 3 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,12 +32,12 @@ To calculate evaluation scores for LLM outputs, you will also need `rouge-score`
* `eval.py`: eval file to evaluate generations.
* `tools/`: misc code (generate summaries/snippets, reranking, etc.)


## Setup

### Download Data

To download ASQA and QAMPARI datasets, as well as the DPR wikipedia snapshot used for retrieved documents, please refer to the original [ALCE repository](https://github.com/princeton-nlp/ALCE). After downloading this data, the ALCE .json files and DPR wikipedia .tsv files can be converted to the formats needed for running retrieval with SVS (Scalable Vector Search) by running:
To download ASQA and QAMPARI datasets, as well as the DPR wikipedia snapshot used for retrieved documents, please refer to the original [ALCE repository](https://github.com/princeton-nlp/ALCE). After downloading this data, create `asqa`, `qampari`, and `dpr_wiki` subdirectories in the location specified by the `DATASET_PATH` environment variable. Place one (it doesn't matter which) corresponding .json eval file in the `asqa` and `qampari` directories, respectively. Rename these files `raw.json`. Rename the downloaded dpr wikipedia dump `raw.tsv` and place it in the `dpr_wiki` subdirectory. Rename the oracle files included in the ALCE data `asqa_gold.json` and `qampari_gold.json`. Move them to the location specified by the `DATA_PATH` environment variable. Finally, the renamed ALCE .json files and DPR wikipedia .tsv file can be converted to the formats needed for running retrieval with SVS (Scalable Vector Search) by running:


```bash
python preprocessing/alce/convert_alce_dense.py --dataset {asqa/qampari}
Expand All @@ -52,8 +52,7 @@ python preprocessing/alce/convert_alce_colbert.py --dataset {asqa/qampari}

For the NQ dataset and the KILT Wikipedia corpus that supports it, you may follow the dataset download instructions as provided by original [RAGGED repository](https://github.com/neulab/ragged). This includes downloading the preprocessed corpus on [HuggingFace](https://huggingface.co/datasets/jenhsia/ragged). The original repository also provides tools to convert the data for use with ColBERT.

To preprocess the files for use with our dense retrieval code using SVS, run `preprocessing/convert_nq_dense.py` with the appropriate input arguments.

To preprocess the files for use with our dense retrieval code using SVS, run `preprocessing/convert_nq_dense.py` with the appropriate input arguments.

### Set Paths
Before getting started, you must fill in the path variables `setup/set_paths.sh` for your environment
Expand Down
9 changes: 5 additions & 4 deletions preprocessing/alce/convert_alce_dense.py
Original file line number Diff line number Diff line change
Expand Up @@ -60,13 +60,13 @@ def main(args):
logger.info(f"Reading input file {input_file}")

# convert dpr wiki split (used by alce as docs) to format used to generate vectors for svs
output_file = convert_alce_utils.gen_dpr_wiki_jsonl(dpr_input_file, logger)
output_data = convert_alce_utils.gen_dpr_wiki_jsonl(dpr_input_file, logger)
output_file = os.path.join(
DATASET_PATH,
"dpr_wiki",
"docs.jsonl"
)
save_jsonl(output_file, output_file, logger)
save_jsonl(output_data, output_file, logger)

# generate dpr id2title json for svs retrieval with qampari and asqa
dpr_id2title = convert_alce_utils.gen_dpr_id2title(dpr_input_file)
Expand All @@ -81,6 +81,7 @@ def main(args):

if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--dataset", type=str, help="ALCE dataset to compile eval files for: [asqa, qampari]")
parser.add_argument("-d", "--dataset", type=str, required=True, help="ALCE dataset to compile eval files for: [asqa, qampari]")
args = parser.parse_args()
main(args)
main(args)

27 changes: 14 additions & 13 deletions preprocessing/convert_nq_dense.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,11 @@


def main(input_file, output_prefix, corpus_path, title_json, text_key):
# Converts the NQ dataset, as obtained from the RAGGED repository (https://github.com/neulab/ragged), into the format needed for dense vector retrieval and evaluation of the retrieval results.
"""
Converts the NQ dataset, as obtained from the RAGGED repository (https://github.com/neulab/ragged), into the format needed for dense vector retrieval and evaluation of the retrieval results.
The 'text_key' input variable here is the dataset dictionary key that contains the text. If you wish to use one of the other datasets from the RAGGEd repository, e.g. bioASQ, you can alter that here.
"""

query_data = load_jsonl(input_file, sort_by_id=False)

Expand Down Expand Up @@ -128,21 +132,18 @@ def extract_provenance(p, title, page, par):
# --title_json /export/data/vyvo/rag/datasets/kilt_wikipedia/kilt_wikipedia_jsonl/id2title.jsonl

parser = argparse.ArgumentParser()
parser.add_argument('--input_file', type=str)
parser.add_argument('--output_prefix', type=str)
parser.add_argument('--corpus_path', type=str)
parser.add_argument('--title_json', type=str)

args = vars(parser.parse_args())
DATA_PATH = os.environ.get("DATA_PATH")
args['output_prefix'] = f'{DATA_PATH}/{args["output_prefix"]}'
parser.add_argument('-d', '--dataset', type=str, required=True, help="Name of query dataset to process. Expected to follow the RAGGED data format (see README)")
parser.add_argument('--corpus_path', type=str, required=True, help="Where the corpus dataset is stored. Can be the HuggingFace dataset folder, or can be the JSON file dump of the corpus.")
parser.add_argument('--title_json', type=str, required=True, help="The path to the corpus ID to title mapping file, id2title.jsonl")

args = parser.parse_args()
print(args)

DATASET_PATH = os.environ.get("DATASET_PATH")
input_file = f'{DATASET_PATH}/{args.dataset}.jsonl'
output_prefix = f'{DATA_PATH}/{args.dataset}'

# Allow large datasets to be entirely held in memory
datasets.config.IN_MEMORY_MAX_SIZE = 600 * 1e9

if 'kilt' in args['corpus_path']:
text_key = 'paras'

main(**args, text_key=text_key)
main(input_file, output_prefix, args.corpus_path, args.title_json, text_key='contents')

0 comments on commit a24522a

Please sign in to comment.