text analyzer stemmer filter language support #41998

faileon · 2025-05-21T12:06:38Z

faileon
May 21, 2025

Hello,
are there any plans to include more languages for the stemming filter step in text analyzer? It would greatly improve keyword/hybrid search in my local language (czech). Will it eventually be possible to add custom stemmers?

The roadmap has the following section for CY25:
Analyzer Enhancement
Enhance Analyzer with expanded tokenizer support and improved observability

Perhaps it will be included with this feature?

yhmo · 2025-05-22T02:46:01Z

yhmo
May 22, 2025
Collaborator

Wait @aoiasd to comment.

0 replies

yhmo · 2025-05-22T03:32:16Z

yhmo
May 22, 2025
Collaborator

I just discussed with @aoiasd. So far, we don't have the stemming filter for Czech. The milvus analyzer is mainly powered by Tantivy, and the snowball project provides stemmers for tantivy. Czech is not in the list of the stemmers of snowball: https://github.com/snowballstem/snowball/blob/master/libstemmer/modules.txt

3 replies

faileon May 22, 2025
Author

While not on the GitHub list, the snowball project mentions Czech stemmer on their site https://snowballstem.org/algorithms/czech/stemmer.html

There is also one directly for tantivy https://github.com/testuj-to/tantivy-czech-stemmer

But perhaps the most important would be this PR in the snowball repo snowballstem/snowball#151

yhmo May 22, 2025
Collaborator

Looks like the https://github.com/testuj-to/tantivy-czech-stemmer is a customized stemmer based on snowball, not an official snowball lib.

https://snowballstem.org/algorithms/czech/stemmer.html This page mentioned "In March 2012 Jim O’Regan sent us an implementation of Ljiljana Dolamic's Czech stemmer." There should be a file "czech.sbl" in this folder https://github.com/snowballstem/snowball/tree/master/algorithms, but no such file.

Seems the pr snowballstem/snowball#151 is submitted by another developer since the implementation is different. This pr has been pending since 2021, and marked as milestone 3.1.0 recently. It is not available in the current version of tantivy. Once it is released, we can plan to upgrade tantivy to new version to support Czech.

ojwb Oct 23, 2025

To correct a minor detail here, snowballstem/snowball#151 is actually derived from the version on the website (though it has evolved to the point where that may not be obvious).

If you compare the outputs from the Snowball Czech stemmer on the website with Dolamic's original Java stemmer (linked in comments at the end), there are significant differences. I started pulling at that thread and also found some clear bugs in the Java versions (e.g. incorrect lengths in checks mean some suffixes can never match, such as buffer.substring( len- 2 ,len).equals("\u0161ti")), then while resolving that I noticed more threads to pull at, and so on. A lot of unravelling later...

The version in the PR works better than either the website version (which is functionally the same as the tantivy version) or the Java original, but there are a few points still to resolve and I'd certainly appreciate input from Czech speakers, especially those working on searching Czech text. I've just summarised one of the two open points over in the Snowball PR to try to make that easier:

snowballstem/snowball#151 (comment)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

text analyzer stemmer filter language support #41998

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

text analyzer stemmer filter language support #41998

Uh oh!

faileon May 21, 2025

Replies: 2 comments · 3 replies

Uh oh!

yhmo May 22, 2025 Collaborator

Uh oh!

yhmo May 22, 2025 Collaborator

Uh oh!

Uh oh!

faileon May 22, 2025 Author

Uh oh!

yhmo May 22, 2025 Collaborator

Uh oh!

ojwb Oct 23, 2025

faileon
May 21, 2025

Replies: 2 comments 3 replies

yhmo
May 22, 2025
Collaborator

yhmo
May 22, 2025
Collaborator

faileon May 22, 2025
Author

yhmo May 22, 2025
Collaborator