<repository>
<id>github</id>
<name>GitHub OWNER Apache Maven Packages</name>
<url>https://maven.pkg.github.com/maneau/fastwordssearch</url>
</repository>
<dependency>
<groupId>org.maneau</groupId>
<artifactId>fastwordssearch</artifactId>
<version>0.3</version>
</dependency>A algorithm to searching keywords (multi words) in text fastly and low memory usage even with a large dictionary (1.000.000).
Inspire from aho-corasick it works almost like it.
Why creating another ?
It design to be less memory greedy and as fast as aho-corasick.
How it works ?
This algorithm stores only word by word keywords on a tree and nothing more than String in Map. Fast thanks to its tokenizer which reads sentences in one pass and ignores special characters or html tags.
Can you prouve it ?
Juste launch the BenchTest and you get something like this for a wikipedia html page :
| Dictionary Size | Aho-Corasick memory | FastWordsSearch memory | Aho-Corasick speed | FastWordsSearch speed |
|---|---|---|---|---|
| 100 | 274.6 KB | 69.5 KB | 191ms /100 docs | 194ms /100 docs |
| 1000 | 2.3 MB | 676.7 KB | 127ms /100 docs | 115ms /100 docs |
| 10000 | 18.3 MB | 6 MB | 156ms /100 docs | 89ms /100 docs |
| 100000 | 135.5 MB | 48 MB | 164ms /100 docs | 52ms /100 docs |
| 1000000 | 1.1 GB | 463.6 MB | 201ms /100 docs | 46ms /100 docs |
for a txt page the result are very similar :
| Dictionary Size | Aho-Corasick memory | FastWordsSearch memory | Aho-Corasick speed | FastWordsSearch speed |
|---|---|---|---|---|
| 100 | 280.4 KB | 69.8 KB | 147ms /100 docs | 118ms /100 docs |
| 1000 | 2.3 MB | 678.9 KB | 80ms /100 docs | 71ms /100 docs |
| 10000 | 18.3 MB | 6 MB | 133ms /100 docs | 158ms /100 docs |
| 100000 | 135.4 MB | 48 MB | 49ms /100 docs | 40ms /100 docs |
| 1000000 | 1.1 GB | 463.6 MB | 110ms /100 docs | 30ms /100 docs |
WordTrie trie = WordTrie.builder()
.addKeyword("some words")
.build();
List<MatchToken> tokens = trie.parseText("text where searching some words SOME WORDS dont't");
assertEquals(1, tokens.size());
assertTokenEquals(new MatchToken(21,31,"some words"), tokens.get(0));It while return "some words" and it place.
You can need to be case insensitive :
WordTrie trie = WordTrie.builder().ignoreCase()
.addKeyword("Some wordS")
.build();
List<MatchToken> tokens = trie.parseText("text where searching some Words");
assertEquals(1, tokens.size());
assertTokenEquals(new MatchToken(21,31,"Some wordS"), tokens.get(0));You can need to be accent insensitive :
WordTrie trie = WordTrie.builder().ignoreAccent()
.addKeyword("Déjà vu")
.build();
List<MatchToken> tokens = trie.parseText("it's like a Déja vü");
assertEquals(1, tokens.size());
assertTokenEquals(new MatchToken(12,19,"Déjà vu"), tokens.get(0));Once build, you still can to add some keywords.
WordTrie trie = WordTrie.builder().ignoreCase()
.addKeyword("some words")
.build();
trie.addKeyword("text where");
List<MatchToken> tokens = trie.parseText("text where searching some words");
assertEquals(1, tokens.size());
assertTokenEquals(new MatchToken(8,21,"some words"), tokens.get(0));The library contains a simple but fast String Tokenizer which ignore special characters.
WordTrie trie = WordTrie.builder()
.addKeyword("some words")
.build();
List<MatchToken> tokens = trie.parseText("this is some-words, and some...words as well");
assertEquals(2, tokens.size());
assertTokenEquals(new MatchToken(8,18,"some words"), tokens.get(0));
assertTokenEquals(new MatchToken(24,36,"some words"), tokens.get(1));The library contains a simple but fast String Tokenizer which ignore html tags.
WordTrie trie = WordTrie.builder()
.addKeyword("some words")
.build();
List<MatchToken> tokens = trie.parseText("<p class=\"some words\">some <b>words</b></p>");
assertEquals(1, tokens.size());
assertTokenEquals(new MatchToken(22,35,"some words"), tokens.get(0));Note : the Tokenizer only consider ASCII words neither numbers or special char (-_/)