Question about how BM25 embed function works #44549

anima-kit · 2025-09-23T21:52:41Z

anima-kit
Sep 23, 2025

Hi all! I just created a tutorial for the "full-text" (BM25) search using Milvus. There's a concept that I'm unsure of and I'd like to gain clarity:

Let's say I'm following along with setting up a full-text search as in https://milvus.io/docs/full-text-search.md. I understand how the BM25 score works, but I want to make sure I understand how the BM25 embed function turns text into sparse vectors. From following the full-text search and sparse vectors (https://milvus.io/docs/sparse_vector.md) tutorials, the following is my understanding. Is it correct? Are there nuances that I'm missing?

Milvus does a lot of preprocessing to our documents behind the scenes, including tokenization and stop word removal. From this preprocessing, it obtains the set of all tokens within all documents. The sparse vector for each document will then have some number of dimensions equal to the number of tokens present in the document. Each dimension will be made up of an index-value pair corresponding to the token index in the set of all tokens and the value for the token. This value will depend on the BM25 score for that token and document (is it exactly the score or some other dependence?).

This seems to make sense and I can see why these vectors will be "sparse". Each document probably only contains a handful of the total set of tokens in all documents, while the tokens that are ubiquitous in every document will give scores close to zero.

If this is true, what happens when another data entry is inserted into the collection, are the sparse vectors for each document updated?

Answered by xiaofan-luan

Sep 24, 2025

Sparse vectors are mainly composed of two parts: TF (term frequency) and IDF (inverse document frequency).

When a document is ingested, it is first tokenized and preprocessed. From this, we extract the TF values, which form the sparse vector representation.

The IDF values depend on global corpus statistics and are updated in real time.

During search:

The query is tokenized, and its TF values are computed.

These are combined with the current IDF values.

The resulting query vector is then compared against the stored document term frequencies to calculate similarity (e.g., using cosine similarity or dot product).

This way, TF captures the importance of terms within a single document, while I…

View full answer

xiaofan-luan · 2025-09-24T01:32:29Z

xiaofan-luan
Sep 24, 2025
Maintainer

Sparse vectors are mainly composed of two parts: TF (term frequency) and IDF (inverse document frequency).

When a document is ingested, it is first tokenized and preprocessed. From this, we extract the TF values, which form the sparse vector representation.

The IDF values depend on global corpus statistics and are updated in real time.

During search:

The query is tokenized, and its TF values are computed.

These are combined with the current IDF values.

The resulting query vector is then compared against the stored document term frequencies to calculate similarity (e.g., using cosine similarity or dot product).

This way, TF captures the importance of terms within a single document, while IDF adjusts for how common or rare those terms are across the entire corpus, yielding a balanced and effective sparse representation for retrieval.

15 replies

TueVNguyen Oct 10, 2025

Yes, my corpus focuses on English (we have already filtered it using greedy methods). As you mentioned, it’s likely a pretraining corpus for LLMs, so it may contain many rare words and unusual patterns. I think I can pre-normalize the text corpus first, then index it with BM25.
My CPUs have 256 cores, with 2TB of RAM and 36TB of disk space.
No, I always import first, then create the index, and finally load it.
Thanks for your support. I totally agree with the current implementation of Milvus; I just don’t understand why it is so slow.
I'm switching to other methods (i.e I compute the bm25 for all of my documents, and putting it as sparse vector and using IP metric) and It worked (only need 2h to import all corpus).

xiaofan-luan Oct 11, 2025
Maintainer

@TueVNguyen

Could you reproduce this issue and generate a pprof or perf flame graph?
There are might be some performance issue or large number of tokens, we need to understand performance bottleneck

xiaofan-luan Oct 11, 2025
Maintainer

and you said after some data the cluster is crashed. is there a fatal message for that?

TueVNguyen Oct 11, 2025

@TueVNguyen

Could you reproduce this issue and generate a pprof or perf flame graph? There are might be some performance issue or large number of tokens, we need to understand performance bottleneck

sorry I can't, it take a lot of time in my side.
But you can try with these setup:
Split 512 tokens/ chunk for each document using jinaai/jina-embeddings-v3 for 3 corpus:

Wikipedia-2023
Fineweb-edu-subset-2025
S2ORC corpus: filter subset >= 2015.

xiaofan-luan Oct 12, 2025
Maintainer

why this is related to BM25?
seems that you are using sparse embedding, not Bm25

zhengbuqian · 2025-10-10T09:08:20Z

zhengbuqian
Oct 10, 2025
Maintainer

https://milvusio.medium.com/full-text-search-in-milvus-whats-under-the-hood-9058016ea84e

here is a more detailed explanation of why we don't need to update the sparse vector in the db even after we have inserted tons of new documents(possibly with complete different term distribution) @anima-kit

1 reply

anima-kit Oct 19, 2025
Author

Thanks for pointing to this, the step-by-step example confirms a lot of the ideas I had about how BM25 score and sparse vectors are calculated. I also see now why IP is used for vector comparison. Great article!

yanliang567 · 2025-10-11T10:52:14Z

yanliang567
Oct 11, 2025
Maintainer

@zhuwenxing please keep an eye on the performance issue

0 replies

Question about how BM25 embed function works #44549

Uh oh!

anima-kit Sep 23, 2025

Replies: 3 comments · 16 replies

Uh oh!

xiaofan-luan Sep 24, 2025 Maintainer

Uh oh!

Uh oh!

TueVNguyen Oct 10, 2025

Uh oh!

xiaofan-luan Oct 11, 2025 Maintainer

Uh oh!

xiaofan-luan Oct 11, 2025 Maintainer

Uh oh!

TueVNguyen Oct 11, 2025

Uh oh!

xiaofan-luan Oct 12, 2025 Maintainer

Uh oh!

Uh oh!

zhengbuqian Oct 10, 2025 Maintainer

Uh oh!

anima-kit Oct 19, 2025 Author

Uh oh!

yanliang567 Oct 11, 2025 Maintainer

anima-kit
Sep 23, 2025

Replies: 3 comments 16 replies

xiaofan-luan
Sep 24, 2025
Maintainer

xiaofan-luan Oct 11, 2025
Maintainer

xiaofan-luan Oct 11, 2025
Maintainer

xiaofan-luan Oct 12, 2025
Maintainer

zhengbuqian
Oct 10, 2025
Maintainer

anima-kit Oct 19, 2025
Author

yanliang567
Oct 11, 2025
Maintainer