Document preproc (#18)

* documenting data download and renaming * cleanup + consolidation of data preprocessing scripts --------- Co-authored-by: Vo, Vy <[email protected]>
IntelLabs · Nov 1, 2024 · a24522a · a24522a
1 parent de4ff64
commit a24522a
Show file tree

Hide file tree

Showing 3 changed files with 22 additions and 21 deletions.
diff --git a/README.md b/README.md
@@ -32,12 +32,12 @@ To calculate evaluation scores for LLM outputs, you will also need `rouge-score`
   * `eval.py`: eval file to evaluate generations.
 * `tools/`: misc code (generate summaries/snippets, reranking, etc.)
 
-
 ## Setup
 
 ### Download Data
 
-To download ASQA and QAMPARI datasets, as well as the DPR wikipedia snapshot used for retrieved documents, please refer to the original [ALCE repository](https://github.com/princeton-nlp/ALCE). After downloading this data, the ALCE .json files and DPR wikipedia .tsv files can be converted to the formats needed for running retrieval with SVS (Scalable Vector Search) by running:
+To download ASQA and QAMPARI datasets, as well as the DPR wikipedia snapshot used for retrieved documents, please refer to the original [ALCE repository](https://github.com/princeton-nlp/ALCE). After downloading this data, create `asqa`, `qampari`, and `dpr_wiki` subdirectories in the location specified by the `DATASET_PATH` environment variable. Place one (it doesn't matter which) corresponding .json eval file in the `asqa` and `qampari` directories, respectively. Rename these files `raw.json`. Rename the downloaded dpr wikipedia dump `raw.tsv` and place it in the `dpr_wiki` subdirectory. Rename the oracle files included in the ALCE data `asqa_gold.json` and `qampari_gold.json`. Move them to the location specified by the `DATA_PATH` environment variable. Finally, the renamed ALCE .json files and DPR wikipedia .tsv file can be converted to the formats needed for running retrieval with SVS (Scalable Vector Search) by running:
+
 
 ```bash
 python preprocessing/alce/convert_alce_dense.py --dataset {asqa/qampari}
@@ -52,8 +52,7 @@ python preprocessing/alce/convert_alce_colbert.py --dataset {asqa/qampari}
 
 For the NQ dataset and the KILT Wikipedia corpus that supports it, you may follow the dataset download instructions as provided by original [RAGGED repository](https://github.com/neulab/ragged). This includes downloading the preprocessed corpus on [HuggingFace](https://huggingface.co/datasets/jenhsia/ragged). The original repository also provides tools to convert the data for use with ColBERT.
 
-To preprocess the files for use with our dense retrieval code using SVS, run `preprocessing/convert_nq_dense.py` with the appropriate input arguments. 
-
+To preprocess the files for use with our dense retrieval code using SVS, run `preprocessing/convert_nq_dense.py` with the appropriate input arguments.
 
 ### Set Paths
 Before getting started, you must fill in the path variables `setup/set_paths.sh` for your environment

diff --git a/preprocessing/alce/convert_alce_dense.py b/preprocessing/alce/convert_alce_dense.py
@@ -60,13 +60,13 @@ def main(args):
     logger.info(f"Reading input file {input_file}")
 
     # convert dpr wiki split (used by alce as docs) to format used to generate vectors for svs
-    output_file = convert_alce_utils.gen_dpr_wiki_jsonl(dpr_input_file, logger)
+    output_data = convert_alce_utils.gen_dpr_wiki_jsonl(dpr_input_file, logger)
     output_file = os.path.join(
         DATASET_PATH,
         "dpr_wiki",
         "docs.jsonl"
     )
-    save_jsonl(output_file, output_file, logger)
+    save_jsonl(output_data, output_file, logger)
 
     # generate dpr id2title json for svs retrieval with qampari and asqa
     dpr_id2title = convert_alce_utils.gen_dpr_id2title(dpr_input_file)
@@ -81,6 +81,7 @@ def main(args):
 
 if __name__ == "__main__":
     parser = argparse.ArgumentParser()
-    parser.add_argument("--dataset", type=str, help="ALCE dataset to compile eval files for: [asqa, qampari]")
+    parser.add_argument("-d", "--dataset", type=str, required=True, help="ALCE dataset to compile eval files for: [asqa, qampari]")
     args = parser.parse_args()
-    main(args)
+    main(args)
+
diff --git a/preprocessing/convert_nq_dense.py b/preprocessing/convert_nq_dense.py
@@ -17,7 +17,11 @@
 
 
 def main(input_file, output_prefix, corpus_path, title_json, text_key):
-    # Converts the NQ dataset, as obtained from the RAGGED repository (https://github.com/neulab/ragged), into the format needed for dense vector retrieval and evaluation of the retrieval results.
+    """
+    Converts the NQ dataset, as obtained from the RAGGED repository (https://github.com/neulab/ragged), into the format needed for dense vector retrieval and evaluation of the retrieval results.
+
+    The 'text_key' input variable here is the dataset dictionary key that contains the text. If you wish to use one of the other datasets from the RAGGEd repository, e.g. bioASQ, you can alter that here.
+    """
 
     query_data = load_jsonl(input_file, sort_by_id=False)
 
@@ -128,21 +132,18 @@ def extract_provenance(p, title, page, par):
     #   --title_json /export/data/vyvo/rag/datasets/kilt_wikipedia/kilt_wikipedia_jsonl/id2title.jsonl
 
     parser = argparse.ArgumentParser()
-    parser.add_argument('--input_file', type=str)
-    parser.add_argument('--output_prefix', type=str)
-    parser.add_argument('--corpus_path', type=str)
-    parser.add_argument('--title_json', type=str)
-
-    args = vars(parser.parse_args())
-    DATA_PATH = os.environ.get("DATA_PATH")
-    args['output_prefix'] = f'{DATA_PATH}/{args["output_prefix"]}'
+    parser.add_argument('-d', '--dataset', type=str, required=True, help="Name of query dataset to process. Expected to follow the RAGGED data format (see README)")
+    parser.add_argument('--corpus_path', type=str, required=True, help="Where the corpus dataset is stored. Can be the HuggingFace dataset folder, or can be the JSON file dump of the corpus.")
+    parser.add_argument('--title_json', type=str, required=True, help="The path to the corpus ID to title mapping file, id2title.jsonl")
 
+    args = parser.parse_args()
     print(args)
 
+    DATASET_PATH = os.environ.get("DATASET_PATH")
+    input_file = f'{DATASET_PATH}/{args.dataset}.jsonl'
+    output_prefix = f'{DATA_PATH}/{args.dataset}'
+
     # Allow large datasets to be entirely held in memory
     datasets.config.IN_MEMORY_MAX_SIZE = 600 * 1e9
 
-    if 'kilt' in args['corpus_path']:
-        text_key = 'paras'
-
-    main(**args, text_key=text_key)
+    main(input_file, output_prefix, args.corpus_path, args.title_json, text_key='contents')