It is necessary to upgrade the weaviate client. #8034

jiandanfeng · 2024-09-06T01:32:44Z

Self Checks

I have searched for existing issues search for existing issues, including closed ones.
I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
[FOR CHINESE USERS] 请务必使用英文提交 Issue，否则会被关闭。谢谢！:）
Please do not modify this template :) and fill in all the required fields.

1. Is this request related to a challenge you're experiencing? Tell me about your story.

It is necessary to upgrade the weaviate client. When using Chinese full-text search, the current weaviate version does not support Chinese word segmentation, so the effect is not good. It needs to be upgraded to version 1.24. The gse or trigram word segmentation method is used to support Chinese word segmentation and improve the ability of weaviate Chinese full-text search.
weaviate tokenization link: https://weaviate.io/developers/weaviate/config-refs/schema#tokenization
weaviate client update link: https://weaviate.io/developers/weaviate/client-libraries/python/v3_v4_migration#installation

2. Additional context or comments

No response

3. Can you help us with this feature?

I am interested in contributing to this feature.

dosubot · 2024-09-06T01:34:34Z

I found a similar discussion regarding the Weaviate database version and its support in Dify. You might find it relevant: Why doesn't Weaviate database use the latest version, still using version 1.19, and does Dify currently support higher versions of Weaviate? ^[1].

Currently, the Dify codebase uses Weaviate client version 1.19.0 ^[2]. The Weaviate client is integrated in the WeaviateVector class and utilized in integration tests ^[3]^[4].

For Chinese full-text search, the current implementation includes configurations for a Chinese lexer and uses Jieba for text segmentation ^[5]. The jieba library is used for word segmentation ^[6].

Upgrading to Weaviate version 1.24 to support Chinese word segmentation with gse or trigram methods would indeed improve the full-text search capabilities. If you are interested in contributing to this feature, you might want to start by updating the Weaviate client version in the docker-compose files and adjusting the relevant code to utilize the new segmentation methods.

To continue talking to Dosu, mention @dosu.

Kevin9703 · 2024-09-09T15:18:15Z

I agree that we should upgrade the Weaviate version. Version 1.2x supports different compression methods and gRPC calls, which can significantly improve performance.

fujita-h · 2024-09-29T16:14:21Z

I agree with this suggestion.
With the current (default) tokenizer settings, full-text search for Japanese barely works.
I tested the following steps and confirmed that the performance of full-text search in Japanese improved significantly.

Set the weaviate version to 1.24.25
Add USE_GSE: true to the weaviate container's environment variables
Edit dify/api/core/rag/datasource/vdb/weaviate/weaviate_vector.py as follows

    def _default_schema(self, index_name: str) -> dict:
        return {
            "class": index_name,
            "properties": [
                {
                    "name": "text",
                    "dataType": ["text"],
                    "tokenization": "gse", # <- added this
                }
            ],
        }

When implementing, we also need to create a UI that switches the tokenizer depending on whether the document to be registered is in a language suitable for GSE.

Scallions · 2024-10-25T07:48:04Z

https://weaviate.io/developers/weaviate/search/similarity#named-vectors Named-vectors is also needed.

dosubot · 2024-11-25T16:03:21Z

Hi, @jiandanfeng. I'm Dosu, and I'm helping the Dify team manage their backlog. I'm marking this issue as stale.

Issue Summary

The issue was about upgrading the Weaviate client to version 1.24 for better Chinese full-text search.
You and other contributors, including @Kevin9703 and @fujita-h, agreed on the benefits of the upgrade.
@fujita-h shared successful test results and suggested UI adjustments for language-specific tokenization.
The issue seems resolved with the community agreeing on the improvements.

Next Steps

Please confirm if this issue is still relevant to the latest version of the Dify repository. If so, you can keep the discussion open by commenting here.
If there are no further updates, this issue will be automatically closed in 15 days.

Thank you for your understanding and contribution!

fujita-h · 2024-12-31T16:34:50Z

I'm focused on #12223.

dosubot bot added the 👻 feat:rag Embedding related issue, like qdrant, weaviate, milvus, vector database. label Sep 6, 2024

crazywoola assigned JohnJyong Sep 6, 2024

dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Nov 25, 2024

dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 10, 2024

dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Dec 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

It is necessary to upgrade the weaviate client. #8034

It is necessary to upgrade the weaviate client. #8034

jiandanfeng commented Sep 6, 2024 •

edited

Loading

dosubot bot commented Sep 6, 2024

Kevin9703 commented Sep 9, 2024

fujita-h commented Sep 29, 2024 •

edited

Loading

Scallions commented Oct 25, 2024

dosubot bot commented Nov 25, 2024

fujita-h commented Dec 31, 2024

It is necessary to upgrade the weaviate client. #8034

It is necessary to upgrade the weaviate client. #8034

Comments

jiandanfeng commented Sep 6, 2024 • edited Loading

Self Checks

1. Is this request related to a challenge you're experiencing? Tell me about your story.

2. Additional context or comments

3. Can you help us with this feature?

dosubot bot commented Sep 6, 2024

Kevin9703 commented Sep 9, 2024

fujita-h commented Sep 29, 2024 • edited Loading

Scallions commented Oct 25, 2024

dosubot bot commented Nov 25, 2024

fujita-h commented Dec 31, 2024

jiandanfeng commented Sep 6, 2024 •

edited

Loading

fujita-h commented Sep 29, 2024 •

edited

Loading