- pagination
- facets
- sift3/cosine distance
- tf-idf ranking
- stopword removal
- stemming
- diactrics replacement
- number support
npm install [--save/--save-dev] inverted-index
var inverted = require('inverted-index')
var level = require('level')('/path/to/my/db')
var sublevel = require('sublevel')
var index = inverted(sublevel(db, 'index'), {
idf: true,
stem: true,
rank: true,
rank_algorithm: 'cosine',
facets: true
}, function(id, options, fn){
level.get(id, options, fn)
})
Any level API-compatible instance is accepted.
The exemplified options is the default configuration.
When idf
is flagged as true, for each token indexed an idf
(term frequency–inverse document frequency) is calculated. When querying the index, the terms with lowest idf
are fetched first. Example:
"Julie loves me more than Linda loves me"
[
{
"word": "julie",
"idf": 1.791759469228055
},
{
"word": "linda",
"idf": 1.791759469228055
},
{
"word": "loves",
"idf": 1.0986122886681098
}
]
Notice that "me", "more" and "than" are not indexed, because those are considered stopwords
.
Whether the text should be stemmed or not. When true, the text is stemmed with the Porter stemming algorithm using NaturalNode/natural. Example:
"Fishing is a way of catching cats, he argued in his arguments"
is tokenized into:
["fishing", "is", "a", "way", "of", "catching", "cats", "he", "argued", "in", "his", "arguments"]
and stemmed into:
["fish", "is", "a", "wai", "of", "catch", "cat", "he", "argued", "in", "his", "argum"]
With ranking enabled, when querying it ranks the results based on a defined algorithm. The rank is done AFTER the fetch, so it only ranks using the result set (that can be parcial depending on the size of matching results) comparing the query with the original indexed text, to the tokens.
So, idf
is used to fetch tokens ordered by idf
and then ranking is done with the original text of each token's correspondent document comparing with the query text. The "problem" with ranking is that if you have 100000 tokens that match the query tokens, only 100
(can be set on the query options) are fetched for each page and THEN the rank is done. Example:
{
"1": "Fishing is a way of catching cats, Julie argued in her arguments",
"2": "Julie loves me more than Linda loves me"
}
querying Julie loves
would fetch:
[
{
"word": "loves",
"idf": 1.0986122886681098,
"id": "2"
},
{
"word": "julie",
"idf": 1.791759469228055,
"id": "2"
},
{
"word": "julie",
"idf": 2.4849066497880004,
"id": "1"
}
]
and then rank them:
["2", "1"]
Only takes effect when rank
is set to true. Valid options are cosine
or sift3
using ramitos/cosine and ramitos/sift3.
Haven't made any benchmarks on that, but sift3
should be faster. Will get data on that soon.
Enabling facets
is useful to query based on types of models. Example:
{
"1": {
"text": "Hank Green",
"facets": ["user"]
},
"2": {
"text": "John Green",
"facets": ["user"]
},
"3": {
"text": "Johnnie Walker",
"facets": ["user"]
},
"b": {
"text": "Johnnie Walker",
"facets": ["brand"]
}
}
You can then query "Johnnie" with facets
["brand"]
and only get:
["b"]
Notice how the result don't include the user 3
because it doesn't have the brand facet
.
You can also combine facets
with id
's to provide property based queries:
{
"3": {
"text": "Johnnie Walker [email protected]",
"facets": ["user"]
},
"3-name": {
"text": "Johnnie Walker",
"facets": ["user-name"]
},
"3-email": {
"text": "[email protected]",
"facets": ["user-email"]
}
}
And then query the facets
["user-name"]
with the text "johnnie" and get:
["3-name"]
And with that you can just split the results to get the id
's.
For ranking results, we need to store the original text. When indexing large amounts of data this can have an impact on disk usage. To prevent that, a function can be passed that receives id
, options
, and callback
as the arguments to fetch the original indexed text for that id
.
index.index('john green', 1, ['user'], function(err){
assert(!err)
})
index.put('Fishing is a way of catching cats, he argued in his arguments', 'b', function(err){
assert(!err)
})
index.link('Julie loves me more than Linda loves me', '1436ebc684b-c1039c76bdb2b054670f3a1256c98650', ['message'], function(err){
assert(!err)
})
index.remove(1, function(err){
assert(!err)
})
index.del('b', function(err){
assert(!err)
})
index.unlink('1436ebc684b-c1039c76bdb2b054670f3a1256c98650', function(err){
assert(!err)
})
index.search('Fishing', function(err, result){
assert(!err)
assert(result.last)
assert(result.results)
})
index.query('Green', ['user'], function(err, result){
assert(!err)
assert(result.last)
assert(result.results)
})
index.search('Green', 'user', function(err, result){
assert(!err)
assert(result.last)
assert(result.results)
})
index.search('Green', {
limit: 100,
ttl: 1000 * 60 * 60
}, function(err, result){
assert(!err)
assert(result.last)
assert(result.results)
})
index.search({
last: '1436ec2e069-bf55e1ed64540b925e13d6bfd21a543c'
}, function(err, result){
assert(!err)
assert(result.last)
assert(result.results)
})
Every query returns a last parameter. That can be passed to the query
/search
function to get the next results. When you pass last
, you don't need to pass the search query again, because it is saved in the db.
Note that pagination expires in 1h
, so if you do a query now, and 2 hours later you want to retrieve the next page, you'll get an error.
The ttl
can, however, be tuned in the query
options.
MIT