Skip to content

maneau/fastwordssearch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FastWordsSearch

Build Status Deploy Status Codacy Badge Codecov Apache 2

Dependency

<repository>
    <id>github</id>
    <name>GitHub OWNER Apache Maven Packages</name>
    <url>https://maven.pkg.github.com/maneau/fastwordssearch</url>
</repository>
<dependency>
    <groupId>org.maneau</groupId>
    <artifactId>fastwordssearch</artifactId>
    <version>0.3</version>
</dependency>

Introduction

A algorithm to searching keywords (multi words) in text fastly and low memory usage even with a large dictionary (1.000.000).

Inspire from aho-corasick it works almost like it.

Why creating another ?

It design to be less memory greedy and as fast as aho-corasick.

How it works ?

This algorithm stores only word by word keywords on a tree and nothing more than String in Map. Fast thanks to its tokenizer which reads sentences in one pass and ignores special characters or html tags.

Can you prouve it ?

Juste launch the BenchTest and you get something like this for a wikipedia html page :

Dictionary Size Aho-Corasick memory FastWordsSearch memory Aho-Corasick speed FastWordsSearch speed
100 274.6 KB 69.5 KB 191ms /100 docs 194ms /100 docs
1000 2.3 MB 676.7 KB 127ms /100 docs 115ms /100 docs
10000 18.3 MB 6 MB 156ms /100 docs 89ms /100 docs
100000 135.5 MB 48 MB 164ms /100 docs 52ms /100 docs
1000000 1.1 GB 463.6 MB 201ms /100 docs 46ms /100 docs

for a txt page the result are very similar :

Dictionary Size Aho-Corasick memory FastWordsSearch memory Aho-Corasick speed FastWordsSearch speed
100 280.4 KB 69.8 KB 147ms /100 docs 118ms /100 docs
1000 2.3 MB 678.9 KB 80ms /100 docs 71ms /100 docs
10000 18.3 MB 6 MB 133ms /100 docs 158ms /100 docs
100000 135.4 MB 48 MB 49ms /100 docs 40ms /100 docs
1000000 1.1 GB 463.6 MB 110ms /100 docs 30ms /100 docs

Usage

Basic usage

WordTrie trie = WordTrie.builder()
        .addKeyword("some words")
        .build();

List<MatchToken> tokens = trie.parseText("text where searching some words SOME WORDS dont't");
assertEquals(1, tokens.size());
assertTokenEquals(new MatchToken(21,31,"some words"), tokens.get(0));

It while return "some words" and it place.

Case Insensitive

You can need to be case insensitive :

WordTrie trie = WordTrie.builder().ignoreCase()
        .addKeyword("Some wordS")
        .build();

List<MatchToken> tokens = trie.parseText("text where searching some Words");

assertEquals(1, tokens.size());
assertTokenEquals(new MatchToken(21,31,"Some wordS"), tokens.get(0));

Accent Insensitive

You can need to be accent insensitive :

WordTrie trie = WordTrie.builder().ignoreAccent()
        .addKeyword("Déjà vu")
        .build();

List<MatchToken> tokens = trie.parseText("it's like a Déja vü");

assertEquals(1, tokens.size());
assertTokenEquals(new MatchToken(12,19,"Déjà vu"), tokens.get(0));

Still can add keywords

Once build, you still can to add some keywords.

WordTrie trie = WordTrie.builder().ignoreCase()
        .addKeyword("some words")
        .build();

trie.addKeyword("text where");

List<MatchToken> tokens = trie.parseText("text where searching some words");
assertEquals(1, tokens.size());
assertTokenEquals(new MatchToken(8,21,"some words"), tokens.get(0));

Ignoring special characters and ponctuation

The library contains a simple but fast String Tokenizer which ignore special characters.

WordTrie trie = WordTrie.builder()
        .addKeyword("some words")
        .build();

List<MatchToken> tokens = trie.parseText("this is some-words, and some...words as well");
assertEquals(2, tokens.size());
assertTokenEquals(new MatchToken(8,18,"some words"), tokens.get(0));
assertTokenEquals(new MatchToken(24,36,"some words"), tokens.get(1));

Ignoring html tags

The library contains a simple but fast String Tokenizer which ignore html tags.

WordTrie trie = WordTrie.builder()
    .addKeyword("some words")
    .build();

List<MatchToken> tokens = trie.parseText("<p class=\"some words\">some <b>words</b></p>");
assertEquals(1, tokens.size());
assertTokenEquals(new MatchToken(22,35,"some words"), tokens.get(0));

Note : the Tokenizer only consider ASCII words neither numbers or special char (-_/)

About

Algorithm for searching keywords (multi-words) in a text very quickly.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors