FastWordsSearch

Dependency

<repository>
    <id>github</id>
    <name>GitHub OWNER Apache Maven Packages</name>
    <url>https://maven.pkg.github.com/maneau/fastwordssearch</url>
</repository>
<dependency>
    <groupId>org.maneau</groupId>
    <artifactId>fastwordssearch</artifactId>
    <version>0.3</version>
</dependency>

Introduction

A algorithm to searching keywords (multi words) in text fastly and low memory usage even with a large dictionary (1.000.000).

Inspire from aho-corasick it works almost like it.

Why creating another ?

It design to be less memory greedy and as fast as aho-corasick.

How it works ?

This algorithm stores only word by word keywords on a tree and nothing more than String in Map. Fast thanks to its tokenizer which reads sentences in one pass and ignores special characters or html tags.

Can you prouve it ?

Juste launch the BenchTest and you get something like this for a wikipedia html page :

Dictionary Size	Aho-Corasick memory	FastWordsSearch memory	Aho-Corasick speed	FastWordsSearch speed
100	274.6 KB	69.5 KB	191ms /100 docs	194ms /100 docs
1000	2.3 MB	676.7 KB	127ms /100 docs	115ms /100 docs
10000	18.3 MB	6 MB	156ms /100 docs	89ms /100 docs
100000	135.5 MB	48 MB	164ms /100 docs	52ms /100 docs
1000000	1.1 GB	463.6 MB	201ms /100 docs	46ms /100 docs

for a txt page the result are very similar :

Dictionary Size	Aho-Corasick memory	FastWordsSearch memory	Aho-Corasick speed	FastWordsSearch speed
100	280.4 KB	69.8 KB	147ms /100 docs	118ms /100 docs
1000	2.3 MB	678.9 KB	80ms /100 docs	71ms /100 docs
10000	18.3 MB	6 MB	133ms /100 docs	158ms /100 docs
100000	135.4 MB	48 MB	49ms /100 docs	40ms /100 docs
1000000	1.1 GB	463.6 MB	110ms /100 docs	30ms /100 docs

Usage

Basic usage

WordTrie trie = WordTrie.builder()
        .addKeyword("some words")
        .build();

List<MatchToken> tokens = trie.parseText("text where searching some words SOME WORDS dont't");
assertEquals(1, tokens.size());
assertTokenEquals(new MatchToken(21,31,"some words"), tokens.get(0));

It while return "some words" and it place.

Case Insensitive

You can need to be case insensitive :

WordTrie trie = WordTrie.builder().ignoreCase()
        .addKeyword("Some wordS")
        .build();

List<MatchToken> tokens = trie.parseText("text where searching some Words");

assertEquals(1, tokens.size());
assertTokenEquals(new MatchToken(21,31,"Some wordS"), tokens.get(0));

Accent Insensitive

You can need to be accent insensitive :

WordTrie trie = WordTrie.builder().ignoreAccent()
        .addKeyword("Déjà vu")
        .build();

List<MatchToken> tokens = trie.parseText("it's like a Déja vü");

assertEquals(1, tokens.size());
assertTokenEquals(new MatchToken(12,19,"Déjà vu"), tokens.get(0));

Still can add keywords

Once build, you still can to add some keywords.

WordTrie trie = WordTrie.builder().ignoreCase()
        .addKeyword("some words")
        .build();

trie.addKeyword("text where");

List<MatchToken> tokens = trie.parseText("text where searching some words");
assertEquals(1, tokens.size());
assertTokenEquals(new MatchToken(8,21,"some words"), tokens.get(0));

Ignoring special characters and ponctuation

The library contains a simple but fast String Tokenizer which ignore special characters.

WordTrie trie = WordTrie.builder()
        .addKeyword("some words")
        .build();

List<MatchToken> tokens = trie.parseText("this is some-words, and some...words as well");
assertEquals(2, tokens.size());
assertTokenEquals(new MatchToken(8,18,"some words"), tokens.get(0));
assertTokenEquals(new MatchToken(24,36,"some words"), tokens.get(1));

Ignoring html tags

The library contains a simple but fast String Tokenizer which ignore html tags.

WordTrie trie = WordTrie.builder()
    .addKeyword("some words")
    .build();

List<MatchToken> tokens = trie.parseText("<p class=\"some words\">some <b>words</b></p>");
assertEquals(1, tokens.size());
assertTokenEquals(new MatchToken(22,35,"some words"), tokens.get(0));

Note : the Tokenizer only consider ASCII words neither numbers or special char (-_/)

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
src		src
.gitignore		.gitignore
.travis.settings.xml		.travis.settings.xml
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FastWordsSearch

Dependency

Introduction

Usage

Basic usage

Case Insensitive

Accent Insensitive

Still can add keywords

Ignoring special characters and ponctuation

Ignoring html tags

About

Uh oh!

Releases 3

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

FastWordsSearch

Dependency

Introduction

Usage

Basic usage

Case Insensitive

Accent Insensitive

Still can add keywords

Ignoring special characters and ponctuation

Ignoring html tags

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages