-
Notifications
You must be signed in to change notification settings - Fork 95
Add way to tokenize and segment words on outside of database #159
Description
Problem Statement
Can I use the hf model's tokenizer or other spliter to manually segment words outside or inside the database and insert them as the sparse vectors required for bm25 search? It can improve cjk language and multi language document support and morden language dictionary
Proposed Solution
When building an index, there is an option to input additional sparse vector columns, such as manually segmenting them into sparse vectors externally and then inserting the tsvector column, or like the bm25vector column in the vchord_bm25 extension, or like generating dense vectors externally using an embedding model and then inserting the vector column, without using an internal index dictionary.
-- Example usage (if applicable)Alternatives Considered
Using zhparser requires the installation of additional extensions and the creation of text search configuration within the database. The dictionary is also not modern enough