Skip to content
Discussion options

You must be logged in to vote

Sparse vectors are mainly composed of two parts: TF (term frequency) and IDF (inverse document frequency).

When a document is ingested, it is first tokenized and preprocessed. From this, we extract the TF values, which form the sparse vector representation.

The IDF values depend on global corpus statistics and are updated in real time.

During search:

The query is tokenized, and its TF values are computed.

These are combined with the current IDF values.

The resulting query vector is then compared against the stored document term frequencies to calculate similarity (e.g., using cosine similarity or dot product).

This way, TF captures the importance of terms within a single document, while I…

Replies: 3 comments 16 replies

Comment options

You must be logged in to vote
15 replies
@TueVNguyen
Comment options

@xiaofan-luan
Comment options

@xiaofan-luan
Comment options

@TueVNguyen
Comment options

@xiaofan-luan
Comment options

Answer selected by anima-kit
Comment options

You must be logged in to vote
1 reply
@anima-kit
Comment options

Comment options

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
5 participants