Skip to content

Add way to tokenize and segment words on outside of database #159

@zhcn000000

Description

@zhcn000000

Problem Statement

Can I use the hf model's tokenizer or other spliter to manually segment words outside or inside the database and insert them as the sparse vectors required for bm25 search? It can improve cjk language and multi language document support and morden language dictionary

Proposed Solution

When building an index, there is an option to input additional sparse vector columns, such as manually segmenting them into sparse vectors externally and then inserting the tsvector column, or like the bm25vector column in the vchord_bm25 extension, or like generating dense vectors externally using an embedding model and then inserting the vector column, without using an internal index dictionary.

-- Example usage (if applicable)

Alternatives Considered

Using zhparser requires the installation of additional extensions and the creation of text search configuration within the database. The dictionary is also not modern enough

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions