Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

It is necessary to upgrade the weaviate client. #8034

Closed
4 of 5 tasks
jiandanfeng opened this issue Sep 6, 2024 · 6 comments
Closed
4 of 5 tasks

It is necessary to upgrade the weaviate client. #8034

jiandanfeng opened this issue Sep 6, 2024 · 6 comments
Assignees
Labels
👻 feat:rag Embedding related issue, like qdrant, weaviate, milvus, vector database.

Comments

@jiandanfeng
Copy link
Contributor

jiandanfeng commented Sep 6, 2024

Self Checks

  • I have searched for existing issues search for existing issues, including closed ones.
  • I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
  • [FOR CHINESE USERS] 请务必使用英文提交 Issue,否则会被关闭。谢谢!:)
  • Please do not modify this template :) and fill in all the required fields.

1. Is this request related to a challenge you're experiencing? Tell me about your story.

It is necessary to upgrade the weaviate client. When using Chinese full-text search, the current weaviate version does not support Chinese word segmentation, so the effect is not good. It needs to be upgraded to version 1.24. The gse or trigram word segmentation method is used to support Chinese word segmentation and improve the ability of weaviate Chinese full-text search.
weaviate tokenization link: https://weaviate.io/developers/weaviate/config-refs/schema#tokenization
weaviate client update link: https://weaviate.io/developers/weaviate/client-libraries/python/v3_v4_migration#installation

2. Additional context or comments

No response

3. Can you help us with this feature?

  • I am interested in contributing to this feature.
Copy link

dosubot bot commented Sep 6, 2024

I found a similar discussion regarding the Weaviate database version and its support in Dify. You might find it relevant: Why doesn't Weaviate database use the latest version, still using version 1.19, and does Dify currently support higher versions of Weaviate? [1].

Currently, the Dify codebase uses Weaviate client version 1.19.0 [2]. The Weaviate client is integrated in the WeaviateVector class and utilized in integration tests [3][4].

For Chinese full-text search, the current implementation includes configurations for a Chinese lexer and uses Jieba for text segmentation [5]. The jieba library is used for word segmentation [6].

Upgrading to Weaviate version 1.24 to support Chinese word segmentation with gse or trigram methods would indeed improve the full-text search capabilities. If you are interested in contributing to this feature, you might want to start by updating the Weaviate client version in the docker-compose files and adjusting the relevant code to utilize the new segmentation methods.

To continue talking to Dosu, mention @dosu.

@dosubot dosubot bot added the 👻 feat:rag Embedding related issue, like qdrant, weaviate, milvus, vector database. label Sep 6, 2024
@Kevin9703
Copy link
Contributor

I agree that we should upgrade the Weaviate version. Version 1.2x supports different compression methods and gRPC calls, which can significantly improve performance.

@fujita-h
Copy link
Contributor

fujita-h commented Sep 29, 2024

I agree with this suggestion.
With the current (default) tokenizer settings, full-text search for Japanese barely works.
I tested the following steps and confirmed that the performance of full-text search in Japanese improved significantly.

  1. Set the weaviate version to 1.24.25
  2. Add USE_GSE: true to the weaviate container's environment variables
  3. Edit dify/api/core/rag/datasource/vdb/weaviate/weaviate_vector.py as follows
    def _default_schema(self, index_name: str) -> dict:
        return {
            "class": index_name,
            "properties": [
                {
                    "name": "text",
                    "dataType": ["text"],
                    "tokenization": "gse", # <- added this
                }
            ],
        }

When implementing, we also need to create a UI that switches the tokenizer depending on whether the document to be registered is in a language suitable for GSE.

@Scallions
Copy link

Copy link

dosubot bot commented Nov 25, 2024

Hi, @jiandanfeng. I'm Dosu, and I'm helping the Dify team manage their backlog. I'm marking this issue as stale.

Issue Summary

  • The issue was about upgrading the Weaviate client to version 1.24 for better Chinese full-text search.
  • You and other contributors, including @Kevin9703 and @fujita-h, agreed on the benefits of the upgrade.
  • @fujita-h shared successful test results and suggested UI adjustments for language-specific tokenization.
  • The issue seems resolved with the community agreeing on the improvements.

Next Steps

  • Please confirm if this issue is still relevant to the latest version of the Dify repository. If so, you can keep the discussion open by commenting here.
  • If there are no further updates, this issue will be automatically closed in 15 days.

Thank you for your understanding and contribution!

@dosubot dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Nov 25, 2024
@dosubot dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 10, 2024
@dosubot dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Dec 10, 2024
@fujita-h
Copy link
Contributor

I'm focused on #12223.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
👻 feat:rag Embedding related issue, like qdrant, weaviate, milvus, vector database.
Projects
None yet
Development

No branches or pull requests

5 participants