-
Notifications
You must be signed in to change notification settings - Fork 5
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat: Add notebook for wikipedia faiss
- Loading branch information
1 parent
d3c0011
commit 1c71815
Showing
2 changed files
with
6 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
# Build FAISS embeddings from Wikipedia | ||
|
||
The original code is from [kaggle notebook](https://www.kaggle.com/code/samson8/how-to-create-wikipedia-embeddings/notebook). | ||
Credits to @samson8. | ||
|
1 change: 1 addition & 0 deletions
1
Experiments/informaton-retrieval/wikipedia_faiss/how-to-create-wikipedia-embeddings.ipynb
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
{"metadata":{"kernelspec":{"language":"python","display_name":"Python 3","name":"python3"},"language_info":{"name":"python","version":"3.10.12","mimetype":"text/x-python","codemirror_mode":{"name":"ipython","version":3},"pygments_lexer":"ipython3","nbconvert_exporter":"python","file_extension":".py"}},"nbformat_minor":4,"nbformat":4,"cells":[{"cell_type":"markdown","source":"# **How to create wikipedia embeddings using sentence transormers**","metadata":{}},{"cell_type":"markdown","source":"In this notebook I will show how to create wikipedia embeddings using any sentence-transformer model you want! In addition, the faiss index file will be created for searching between our prompt embeddings and wikipedia embeddings to find similar texts and improve our retrieval!\n\nWe will use [the dataset](https://www.kaggle.com/datasets/jjinho/wikipedia-20230701) shared with us by JJ (@jjinho).","metadata":{}},{"cell_type":"code","source":"!pip install faiss-gpu\n!pip install sentence_transformers\nimport faiss\nimport pickle\nimport pandas as pd\nimport os\nimport numpy as np\nfrom sentence_transformers import SentenceTransformer\nimport subprocess\n\nfrom IPython.display import FileLink, display","metadata":{"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"Of course you can directly download it from kaggle, but sometimes I struggle with big files (>1GB), so I use that function.","metadata":{}},{"cell_type":"code","source":"def download_file(path, file_name):\n os.chdir('/kaggle/working/')\n zip = f\"/kaggle/working/{file_name}.zip\"\n command = f\"zip {zip} {path} -r\"\n result = subprocess.run(command, shell=True, capture_output=True, text=True)\n if result.returncode != 0:\n print(\"Unable to run zip command!\")\n print(result.stderr)\n return\n display(FileLink(f'{file_name}.zip'))","metadata":{"_uuid":"8f2839f25d086af736a60e9eeb907d3b93b6e0e5","_cell_guid":"b1076dfc-b9ad-4769-8c92-a6c4dae69d19","trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"# What are the options?","metadata":{}},{"cell_type":"markdown","source":"Here is the list of pretrained sentence transformer models you can use from [sbert website](https://www.sbert.net/docs/pretrained_models.html):\n\n* sentence-transformers/all-mpnet-base-v2\n* sentence-transformers/multi-qa-MiniLM-L6-cos-v1\n* sentence-transformers/all-distilroberta-v1\n* sentence-transformers/all-MiniLM-L12-v2\n* sentence-transformers/multi-qa-distilbert-cos-v1\n* sentence-transformers/all-MiniLM-L6-v2\n* sentence-transformers/multi-qa-MiniLM-L6-cos-v1\n* sentence-transformers/paraphrase-multilingual-mpnet-base-v2\n* sentence-transformers/paraphrase-albert-small-v2\n* sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2\n* sentence-transformers/paraphrase-MiniLM-L3-v2\n* sentence-transformers/distiluse-base-multilingual-cased-v1\n* sentence-transformers/distiluse-base-multilingual-cased-v2\n\nI do not think most of them are useful for our task. For example, multilingual models, but who knows...\n\nTo use any of them just change **model_name** to the model string and with internet on it will be downloaded.","metadata":{}},{"cell_type":"markdown","source":"# What to look for?","metadata":{}},{"cell_type":"markdown","source":"The most important things are the embedding dimension, speed and quality of a given model.\n\nFor example, the best performance comes from `all-mpnet-base-v2` and its embedding dimension is 768 (more RAM needed) and speed is 2800 sentences/sec on V100 (compare it to 384 ED and 7500 s/s by `all-MiniLM-L12-v2`).\n\nNote that using GPU is much faster! So this notebook uses P100.","metadata":{}},{"cell_type":"code","source":"model_name = \"sentence-transformers/all-MiniLM-L12-v2\" \nsentence_transformer = SentenceTransformer(model_name)\nparquet_folder = \"/kaggle/input/wikipedia-20230701\"\nfaiss_index_path = \"/kaggle/working/wikipedia_embeddings.index\"","metadata":{"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"# In case you have enough RAM","metadata":{}},{"cell_type":"code","source":"document_embeddings = []\nfor idx, filename in enumerate(os.listdir(parquet_folder)):\n # number, other and wiki_2023_index files are not what we need\n if filename.endswith(\".parquet\") and not (filename.endswith(\"number.parquet\") or filename.endswith(\"other.parquet\") or filename.endswith(\"wiki_2023_index.parquet\")):\n print(f\"Processing file_id: {idx} - file_name: {filename} ......\")\n parquet_path = os.path.join(parquet_folder, filename)\n df = pd.read_parquet(parquet_path)\n df.text = df.text.apply(lambda x: x.split(\"==\")[0])# we trim an article to an abstract in this line\n sentences = df.text.tolist()\n embeddings = sentence_transformer.encode(sentences, normalize_embeddings=True)\n del df, sentences # free some memory\n document_embeddings.extend(embeddings)\n\ndocument_embeddings = np.array(document_embeddings)\nindex = faiss.IndexFlatL2(document_embeddings.shape[1])\nindex.add(document_embeddings)\nfaiss.write_index(index, faiss_index_path)\nprint(f\"Faiss Index Successfully Saved to '{faiss_index_path}'\")","metadata":{"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"# In case you do not have enough RAM","metadata":{}},{"cell_type":"markdown","source":"In this case we can do next:\n\n1. File by file create embeddings;\n\n2. Dump them to .pickle files;","metadata":{}},{"cell_type":"code","source":"your_file_name = 'some_file_name'\n\ndocument_embeddings = []\nfor idx, filename in enumerate(os.listdir(parquet_folder)):\n if filename.endswith(f\"{your_file_name}.parquet\"):\n print(f\"Processing file_id: {idx} - file_name: {filename} ......\")\n parquet_path = os.path.join(parquet_folder, filename)\n df = pd.read_parquet(parquet_path)\n df.text = df.text.apply(lambda x: x.split(\"==\")[0])# we trim an article to an abstract in this line\n sentences = df.text.tolist()\n embeddings = sentence_transformer.encode(sentences, normalize_embeddings=True)\n del df, sentences # free some memory\n document_embeddings.extend(embeddings)\n\n# pickle your list of embeddings\nwith open(f\"embs_{your_file_name}\", \"wb\") as fp: \n pickle.dump(document_embeddings, fp)\n \ndownload_file(f\"/kaggle/working/embs_{your_file_name}\", f\"embs_{your_file_name}\")","metadata":{"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"3.When all embedding lists are obtained, just unpickle and add them to a final list to a create faiss index file.\n\nThat is all!","metadata":{}}]} |