feat: Add notebook for wikipedia faiss

YeonwooSung · Sep 16, 2024 · 1c71815 · 1c71815
1 parent d3c0011
commit 1c71815
Show file tree

Hide file tree

Showing 2 changed files with 6 additions and 0 deletions.
diff --git a/Experiments/informaton-retrieval/wikipedia_faiss/README.md b/Experiments/informaton-retrieval/wikipedia_faiss/README.md
@@ -0,0 +1,5 @@
+# Build FAISS embeddings from Wikipedia
+
+The original code is from [kaggle notebook](https://www.kaggle.com/code/samson8/how-to-create-wikipedia-embeddings/notebook).
+Credits to @samson8.
+
diff --git a/Experiments/informaton-retrieval/wikipedia_faiss/how-to-create-wikipedia-embeddings.ipynb b/Experiments/informaton-retrieval/wikipedia_faiss/how-to-create-wikipedia-embeddings.ipynb
@@ -0,0 +1 @@
+{"metadata":{"kernelspec":{"language":"python","display_name":"Python 3","name":"python3"},"language_info":{"name":"python","version":"3.10.12","mimetype":"text/x-python","codemirror_mode":{"name":"ipython","version":3},"pygments_lexer":"ipython3","nbconvert_exporter":"python","file_extension":".py"}},"nbformat_minor":4,"nbformat":4,"cells":[{"cell_type":"markdown","source":"# **How to create wikipedia embeddings using sentence transormers**","metadata":{}},{"cell_type":"markdown","source":"In this notebook I will show how to create wikipedia embeddings using any sentence-transformer model you want! In addition, the faiss index file will be created for searching between our prompt embeddings and wikipedia embeddings to find similar texts and improve our retrieval!\n\nWe will use [the dataset](https://www.kaggle.com/datasets/jjinho/wikipedia-20230701) shared with us by JJ (@jjinho).","metadata":{}},{"cell_type":"code","source":"!pip install faiss-gpu\n!pip install sentence_transformers\nimport faiss\nimport pickle\nimport pandas as pd\nimport os\nimport numpy as np\nfrom sentence_transformers import SentenceTransformer\nimport subprocess\n\nfrom IPython.display import FileLink, display","metadata":{"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"Of course you can directly download it from kaggle, but sometimes I struggle with big files (>1GB), so I use that function.","metadata":{}},{"cell_type":"code","source":"def download_file(path, file_name):\n    os.chdir('/kaggle/working/')\n    zip = f\"/kaggle/working/{file_name}.zip\"\n    command = f\"zip {zip} {path} -r\"\n    result = subprocess.run(command, shell=True, capture_output=True, text=True)\n    if result.returncode != 0:\n        print(\"Unable to run zip command!\")\n        print(result.stderr)\n        return\n    display(FileLink(f'{file_name}.zip'))","metadata":{"_uuid":"8f2839f25d086af736a60e9eeb907d3b93b6e0e5","_cell_guid":"b1076dfc-b9ad-4769-8c92-a6c4dae69d19","trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"# What are the options?","metadata":{}},{"cell_type":"markdown","source":"Here is the list of pretrained sentence transformer models you can use from [sbert website](https://www.sbert.net/docs/pretrained_models.html):\n\n* sentence-transformers/all-mpnet-base-v2\n* sentence-transformers/multi-qa-MiniLM-L6-cos-v1\n* sentence-transformers/all-distilroberta-v1\n* sentence-transformers/all-MiniLM-L12-v2\n* sentence-transformers/multi-qa-distilbert-cos-v1\n* sentence-transformers/all-MiniLM-L6-v2\n* sentence-transformers/multi-qa-MiniLM-L6-cos-v1\n* sentence-transformers/paraphrase-multilingual-mpnet-base-v2\n* sentence-transformers/paraphrase-albert-small-v2\n* sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2\n* sentence-transformers/paraphrase-MiniLM-L3-v2\n* sentence-transformers/distiluse-base-multilingual-cased-v1\n* sentence-transformers/distiluse-base-multilingual-cased-v2\n\nI do not think most of them are useful for our task. For example, multilingual models, but who knows...\n\nTo use any of them just change **model_name** to the model string and with internet on it will be downloaded.","metadata":{}},{"cell_type":"markdown","source":"# What to look for?","metadata":{}},{"cell_type":"markdown","source":"The most important things are the embedding dimension, speed and quality of a given model.\n\nFor example, the best performance comes from `all-mpnet-base-v2` and its embedding dimension is 768 (more RAM needed) and speed is 2800 sentences/sec on V100 (compare it to 384 ED and 7500 s/s by `all-MiniLM-L12-v2`).\n\nNote that using GPU is much faster! So this notebook uses P100.","metadata":{}},{"cell_type":"code","source":"model_name = \"sentence-transformers/all-MiniLM-L12-v2\" \nsentence_transformer = SentenceTransformer(model_name)\nparquet_folder = \"/kaggle/input/wikipedia-20230701\"\nfaiss_index_path = \"/kaggle/working/wikipedia_embeddings.index\"","metadata":{"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"#  In case you have enough RAM","metadata":{}},{"cell_type":"code","source":"document_embeddings = []\nfor idx, filename in enumerate(os.listdir(parquet_folder)):\n    # number, other and wiki_2023_index files are not what we need\n    if filename.endswith(\".parquet\") and not (filename.endswith(\"number.parquet\") or filename.endswith(\"other.parquet\") or filename.endswith(\"wiki_2023_index.parquet\")):\n        print(f\"Processing file_id: {idx} - file_name: {filename} ......\")\n        parquet_path = os.path.join(parquet_folder, filename)\n        df = pd.read_parquet(parquet_path)\n        df.text = df.text.apply(lambda x: x.split(\"==\")[0])# we trim an article to an abstract in this line\n        sentences = df.text.tolist()\n        embeddings = sentence_transformer.encode(sentences, normalize_embeddings=True)\n        del df, sentences # free some memory\n        document_embeddings.extend(embeddings)\n\ndocument_embeddings = np.array(document_embeddings)\nindex = faiss.IndexFlatL2(document_embeddings.shape[1])\nindex.add(document_embeddings)\nfaiss.write_index(index, faiss_index_path)\nprint(f\"Faiss Index Successfully Saved to '{faiss_index_path}'\")","metadata":{"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"#  In case you do not have enough RAM","metadata":{}},{"cell_type":"markdown","source":"In this case we can do next:\n\n1. File by file create embeddings;\n\n2. Dump them to .pickle files;","metadata":{}},{"cell_type":"code","source":"your_file_name = 'some_file_name'\n\ndocument_embeddings = []\nfor idx, filename in enumerate(os.listdir(parquet_folder)):\n    if filename.endswith(f\"{your_file_name}.parquet\"):\n        print(f\"Processing file_id: {idx} - file_name: {filename} ......\")\n        parquet_path = os.path.join(parquet_folder, filename)\n        df = pd.read_parquet(parquet_path)\n        df.text = df.text.apply(lambda x: x.split(\"==\")[0])# we trim an article to an abstract in this line\n        sentences = df.text.tolist()\n        embeddings = sentence_transformer.encode(sentences, normalize_embeddings=True)\n        del df, sentences # free some memory\n        document_embeddings.extend(embeddings)\n\n# pickle your list of embeddings\nwith open(f\"embs_{your_file_name}\", \"wb\") as fp: \n    pickle.dump(document_embeddings, fp)\n    \ndownload_file(f\"/kaggle/working/embs_{your_file_name}\", f\"embs_{your_file_name}\")","metadata":{"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"3.When all embedding lists are obtained, just unpickle and add them to a final list to a create faiss index file.\n\nThat is all!","metadata":{}}]}
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		{"metadata":{"kernelspec":{"language":"python","display_name":"Python 3","name":"python3"},"language_info":{"name":"python","version":"3.10.12","mimetype":"text/x-python","codemirror_mode":{"name":"ipython","version":3},"pygments_lexer":"ipython3","nbconvert_exporter":"python","file_extension":".py"}},"nbformat_minor":4,"nbformat":4,"cells":[{"cell_type":"markdown","source":"# How to create wikipedia embeddings using sentence transormers","metadata":{}},{"cell_type":"markdown","source":"In this notebook I will show how to create wikipedia embeddings using any sentence-transformer model you want! In addition, the faiss index file will be created for searching between our prompt embeddings and wikipedia embeddings to find similar texts and improve our retrieval!\n\nWe will use [the dataset](https://www.kaggle.com/datasets/jjinho/wikipedia-20230701) shared with us by JJ (@jjinho).","metadata":{}},{"cell_type":"code","source":"!pip install faiss-gpu\n!pip install sentence_transformers\nimport faiss\nimport pickle\nimport pandas as pd\nimport os\nimport numpy as np\nfrom sentence_transformers import SentenceTransformer\nimport subprocess\n\nfrom IPython.display import FileLink, display","metadata":{"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"Of course you can directly download it from kaggle, but sometimes I struggle with big files (>1GB), so I use that function.","metadata":{}},{"cell_type":"code","source":"def download_file(path, file_name):\n os.chdir('/kaggle/working/')\n zip = f\"/kaggle/working/{file_name}.zip\"\n command = f\"zip {zip} {path} -r\"\n result = subprocess.run(command, shell=True, capture_output=True, text=True)\n if result.returncode != 0:\n print(\"Unable to run zip command!\")\n print(result.stderr)\n return\n display(FileLink(f'{file_name}.zip'))","metadata":{"_uuid":"8f2839f25d086af736a60e9eeb907d3b93b6e0e5","_cell_guid":"b1076dfc-b9ad-4769-8c92-a6c4dae69d19","trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"# What are the options?","metadata":{}},{"cell_type":"markdown","source":"Here is the list of pretrained sentence transformer models you can use from [sbert website](https://www.sbert.net/docs/pretrained_models.html):\n\n* sentence-transformers/all-mpnet-base-v2\n* sentence-transformers/multi-qa-MiniLM-L6-cos-v1\n* sentence-transformers/all-distilroberta-v1\n* sentence-transformers/all-MiniLM-L12-v2\n* sentence-transformers/multi-qa-distilbert-cos-v1\n* sentence-transformers/all-MiniLM-L6-v2\n* sentence-transformers/multi-qa-MiniLM-L6-cos-v1\n* sentence-transformers/paraphrase-multilingual-mpnet-base-v2\n* sentence-transformers/paraphrase-albert-small-v2\n* sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2\n* sentence-transformers/paraphrase-MiniLM-L3-v2\n* sentence-transformers/distiluse-base-multilingual-cased-v1\n* sentence-transformers/distiluse-base-multilingual-cased-v2\n\nI do not think most of them are useful for our task. For example, multilingual models, but who knows...\n\nTo use any of them just change model_name to the model string and with internet on it will be downloaded.","metadata":{}},{"cell_type":"markdown","source":"# What to look for?","metadata":{}},{"cell_type":"markdown","source":"The most important things are the embedding dimension, speed and quality of a given model.\n\nFor example, the best performance comes from `all-mpnet-base-v2` and its embedding dimension is 768 (more RAM needed) and speed is 2800 sentences/sec on V100 (compare it to 384 ED and 7500 s/s by `all-MiniLM-L12-v2`).\n\nNote that using GPU is much faster! So this notebook uses P100.","metadata":{}},{"cell_type":"code","source":"model_name = \"sentence-transformers/all-MiniLM-L12-v2\" \nsentence_transformer = SentenceTransformer(model_name)\nparquet_folder = \"/kaggle/input/wikipedia-20230701\"\nfaiss_index_path = \"/kaggle/working/wikipedia_embeddings.index\"","metadata":{"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"# In case you have enough RAM","metadata":{}},{"cell_type":"code","source":"document_embeddings = []\nfor idx, filename in enumerate(os.listdir(parquet_folder)):\n # number, other and wiki_2023_index files are not what we need\n if filename.endswith(\".parquet\") and not (filename.endswith(\"number.parquet\") or filename.endswith(\"other.parquet\") or filename.endswith(\"wiki_2023_index.parquet\")):\n print(f\"Processing file_id: {idx} - file_name: {filename} ......\")\n parquet_path = os.path.join(parquet_folder, filename)\n df = pd.read_parquet(parquet_path)\n df.text = df.text.apply(lambda x: x.split(\"==\")[0])# we trim an article to an abstract in this line\n sentences = df.text.tolist()\n embeddings = sentence_transformer.encode(sentences, normalize_embeddings=True)\n del df, sentences # free some memory\n document_embeddings.extend(embeddings)\n\ndocument_embeddings = np.array(document_embeddings)\nindex = faiss.IndexFlatL2(document_embeddings.shape[1])\nindex.add(document_embeddings)\nfaiss.write_index(index, faiss_index_path)\nprint(f\"Faiss Index Successfully Saved to '{faiss_index_path}'\")","metadata":{"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"# In case you do not have enough RAM","metadata":{}},{"cell_type":"markdown","source":"In this case we can do next:\n\n1. File by file create embeddings;\n\n2. Dump them to .pickle files;","metadata":{}},{"cell_type":"code","source":"your_file_name = 'some_file_name'\n\ndocument_embeddings = []\nfor idx, filename in enumerate(os.listdir(parquet_folder)):\n if filename.endswith(f\"{your_file_name}.parquet\"):\n print(f\"Processing file_id: {idx} - file_name: {filename} ......\")\n parquet_path = os.path.join(parquet_folder, filename)\n df = pd.read_parquet(parquet_path)\n df.text = df.text.apply(lambda x: x.split(\"==\")[0])# we trim an article to an abstract in this line\n sentences = df.text.tolist()\n embeddings = sentence_transformer.encode(sentences, normalize_embeddings=True)\n del df, sentences # free some memory\n document_embeddings.extend(embeddings)\n\n# pickle your list of embeddings\nwith open(f\"embs_{your_file_name}\", \"wb\") as fp: \n pickle.dump(document_embeddings, fp)\n \ndownload_file(f\"/kaggle/working/embs_{your_file_name}\", f\"embs_{your_file_name}\")","metadata":{"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"3.When all embedding lists are obtained, just unpickle and add them to a final list to a create faiss index file.\n\nThat is all!","metadata":{}}]}