Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Weaviate supports Chinese bm25 #12223 #12258

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

dajianguo
Copy link
Contributor

@dajianguo dajianguo commented Dec 31, 2024

Summary

Weaviate is not supports Chinese bm25 in last version。
I looked up Weaviate's official website and found that Weaviate added support for jieba word segmentation after 1.24. The previous version segmented words based on spaces, which does not work for Chinese. The following is my modification method. If there is no problem, I will submit it,Of course it is configurable in docker-compose.yaml.
1、Upgraded Weaviate version
2、Modify the default word segmenter

Tip

Close issue syntax: Fixes #<issue number> or Resolves #<issue number>, see documentation for more details.

Resolves #12223

Reference Links:
https://weaviate.io/developers/weaviate/config-refs/schema#gse-and-trigram-tokenization-methods
https://pkg.go.dev/github.com/go-ego/gse#section-readme

Screenshots

Before After
...
image

|

Checklist

Important

Please review the checklist below before submitting your pull request.

  • This change requires a documentation update, included: Dify Document
  • I understand that this PR may be closed in case there was no previous discussion or issues. (This doesn't apply to typos!)
  • I've added a test for each change that was introduced, and I tried as much as possible to make a single atomic change.
  • I've updated the documentation accordingly.
  • I ran dev/reformat(backend) and cd web && npx lint-staged(frontend) to appease the lint gods

@dosubot dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. 👻 feat:rag Embedding related issue, like qdrant, weaviate, milvus, vector database. labels Dec 31, 2024
@crazywoola crazywoola requested a review from JohnJyong January 1, 2025 05:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
👻 feat:rag Embedding related issue, like qdrant, weaviate, milvus, vector database. size:M This PR changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Weaviate supports Chinese bm25
1 participant