Question about how BM25 embed function works #44549
-
|
Hi all! I just created a tutorial for the "full-text" (BM25) search using Milvus. There's a concept that I'm unsure of and I'd like to gain clarity: Let's say I'm following along with setting up a full-text search as in https://milvus.io/docs/full-text-search.md. I understand how the BM25 score works, but I want to make sure I understand how the BM25 embed function turns text into sparse vectors. From following the full-text search and sparse vectors (https://milvus.io/docs/sparse_vector.md) tutorials, the following is my understanding. Is it correct? Are there nuances that I'm missing? Milvus does a lot of preprocessing to our documents behind the scenes, including tokenization and stop word removal. From this preprocessing, it obtains the set of all tokens within all documents. The sparse vector for each document will then have some number of dimensions equal to the number of tokens present in the document. Each dimension will be made up of an index-value pair corresponding to the token index in the set of all tokens and the value for the token. This value will depend on the BM25 score for that token and document (is it exactly the score or some other dependence?). This seems to make sense and I can see why these vectors will be "sparse". Each document probably only contains a handful of the total set of tokens in all documents, while the tokens that are ubiquitous in every document will give scores close to zero. If this is true, what happens when another data entry is inserted into the collection, are the sparse vectors for each document updated? |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 16 replies
-
|
Sparse vectors are mainly composed of two parts: TF (term frequency) and IDF (inverse document frequency). When a document is ingested, it is first tokenized and preprocessed. From this, we extract the TF values, which form the sparse vector representation. The IDF values depend on global corpus statistics and are updated in real time. During search: The query is tokenized, and its TF values are computed. These are combined with the current IDF values. The resulting query vector is then compared against the stored document term frequencies to calculate similarity (e.g., using cosine similarity or dot product). This way, TF captures the importance of terms within a single document, while IDF adjusts for how common or rare those terms are across the entire corpus, yielding a balanced and effective sparse representation for retrieval. |
Beta Was this translation helpful? Give feedback.
-
|
https://milvusio.medium.com/full-text-search-in-milvus-whats-under-the-hood-9058016ea84e here is a more detailed explanation of why we don't need to update the sparse vector in the db even after we have inserted tons of new documents(possibly with complete different term distribution) @anima-kit |
Beta Was this translation helpful? Give feedback.
-
|
@zhuwenxing please keep an eye on the performance issue |
Beta Was this translation helpful? Give feedback.
Sparse vectors are mainly composed of two parts: TF (term frequency) and IDF (inverse document frequency).
When a document is ingested, it is first tokenized and preprocessed. From this, we extract the TF values, which form the sparse vector representation.
The IDF values depend on global corpus statistics and are updated in real time.
During search:
The query is tokenized, and its TF values are computed.
These are combined with the current IDF values.
The resulting query vector is then compared against the stored document term frequencies to calculate similarity (e.g., using cosine similarity or dot product).
This way, TF captures the importance of terms within a single document, while I…