Skip to content
This repository has been archived by the owner on Aug 26, 2020. It is now read-only.
/ inverted Public archive

inverted-index for level with pagination, sift3/cosine distance, tf-idf ranking, and more

License

Notifications You must be signed in to change notification settings

sergioramos/inverted

Repository files navigation

inverted-index

NPM version Build Status Dependency Status Coverage Status

features

  • pagination
  • facets
  • sift3/cosine distance
  • tf-idf ranking
  • stopword removal
  • stemming
  • diactrics replacement
  • number support

install

npm install [--save/--save-dev] inverted-index

api

var inverted = require('inverted-index')

inverted(db[, options[, getter]])

var level = require('level')('/path/to/my/db')
var sublevel = require('sublevel')

var index = inverted(sublevel(db, 'index'), {
  idf: true,
  stem: true,
  rank: true,
  rank_algorithm: 'cosine',
  facets: true
}, function(id, options, fn){
  level.get(id, options, fn)
})

db

Any level API-compatible instance is accepted.

options

The exemplified options is the default configuration.

idf

When idf is flagged as true, for each token indexed an idf (term frequency–inverse document frequency) is calculated. When querying the index, the terms with lowest idf are fetched first. Example:

"Julie loves me more than Linda loves me"
[
  {
    "word": "julie",
    "idf": 1.791759469228055
  },
  {
    "word": "linda",
    "idf": 1.791759469228055
  },
  {
    "word": "loves",
    "idf": 1.0986122886681098
  }
]

Notice that "me", "more" and "than" are not indexed, because those are considered stopwords.

stem

Whether the text should be stemmed or not. When true, the text is stemmed with the Porter stemming algorithm using NaturalNode/natural. Example:

"Fishing is a way of catching cats, he argued in his arguments"

is tokenized into:

["fishing", "is", "a", "way", "of", "catching", "cats", "he", "argued", "in", "his", "arguments"]

and stemmed into:

["fish", "is", "a", "wai", "of", "catch", "cat", "he", "argued", "in", "his", "argum"]
rank

With ranking enabled, when querying it ranks the results based on a defined algorithm. The rank is done AFTER the fetch, so it only ranks using the result set (that can be parcial depending on the size of matching results) comparing the query with the original indexed text, to the tokens.

So, idf is used to fetch tokens ordered by idf and then ranking is done with the original text of each token's correspondent document comparing with the query text. The "problem" with ranking is that if you have 100000 tokens that match the query tokens, only 100 (can be set on the query options) are fetched for each page and THEN the rank is done. Example:

{
  "1": "Fishing is a way of catching cats, Julie argued in her arguments",
  "2": "Julie loves me more than Linda loves me"
}

querying Julie loves would fetch:

[
  {
    "word": "loves",
    "idf": 1.0986122886681098,
    "id": "2"
  },
  {
    "word": "julie",
    "idf": 1.791759469228055,
    "id": "2"
  },
  {
    "word": "julie",
    "idf": 2.4849066497880004,
    "id": "1"
  }
]

and then rank them:

["2", "1"]
rank_algorithm

Only takes effect when rank is set to true. Valid options are cosine or sift3 using ramitos/cosine and ramitos/sift3.

Haven't made any benchmarks on that, but sift3 should be faster. Will get data on that soon.

facets

Enabling facets is useful to query based on types of models. Example:

{
  "1": {
    "text": "Hank Green",
    "facets": ["user"]
  },
  "2": {
    "text": "John Green",
    "facets": ["user"]
  },
  "3": {
    "text": "Johnnie Walker",
    "facets": ["user"]
  },
  "b": {
    "text": "Johnnie Walker",
    "facets": ["brand"]
  }
}

You can then query "Johnnie" with facets ["brand"] and only get:

["b"]

Notice how the result don't include the user 3 because it doesn't have the brand facet.

You can also combine facets with id's to provide property based queries:

{
  "3": {
    "text": "Johnnie Walker [email protected]",
    "facets": ["user"]
  },
  "3-name": {
    "text": "Johnnie Walker",
    "facets": ["user-name"]
  },
  "3-email": {
    "text": "[email protected]",
    "facets": ["user-email"]
  }
}

And then query the facets ["user-name"] with the text "johnnie" and get:

["3-name"]

And with that you can just split the results to get the id's.

getter

For ranking results, we need to store the original text. When indexing large amounts of data this can have an impact on disk usage. To prevent that, a function can be passed that receives id, options, and callback as the arguments to fetch the original indexed text for that id.

index(text, id[, facets], callback)

put(text, id[, facets], callback)

link(text, id[, facets], callback)

index.index('john green', 1, ['user'], function(err){
  assert(!err)
})
index.put('Fishing is a way of catching cats, he argued in his arguments', 'b', function(err){
  assert(!err)
})
index.link('Julie loves me more than Linda loves me', '1436ebc684b-c1039c76bdb2b054670f3a1256c98650', ['message'], function(err){
  assert(!err)
})

remove(id, callback)

del(id, callback)

unlink(id, callback)

index.remove(1, function(err){
  assert(!err)
})
index.del('b', function(err){
  assert(!err)
})
index.unlink('1436ebc684b-c1039c76bdb2b054670f3a1256c98650', function(err){
  assert(!err)
})

index.search(query[, facets[, options]], callback)

index.query(query[, facets[, options]], callback)

index.search('Fishing', function(err, result){
  assert(!err)
  assert(result.last)
  assert(result.results)
})
index.query('Green', ['user'], function(err, result){
  assert(!err)
  assert(result.last)
  assert(result.results)
})
index.search('Green', 'user', function(err, result){
  assert(!err)
  assert(result.last)
  assert(result.results)
})
index.search('Green', {
  limit: 100, 
  ttl: 1000 * 60 * 60
}, function(err, result){
  assert(!err)
  assert(result.last)
  assert(result.results)
})
index.search({
  last: '1436ec2e069-bf55e1ed64540b925e13d6bfd21a543c'
}, function(err, result){
  assert(!err)
  assert(result.last)
  assert(result.results)
})

pagination

Every query returns a last parameter. That can be passed to the query/search function to get the next results. When you pass last, you don't need to pass the search query again, because it is saved in the db.

Note that pagination expires in 1h, so if you do a query now, and 2 hours later you want to retrieve the next page, you'll get an error.

The ttl can, however, be tuned in the query options.

license

MIT

About

inverted-index for level with pagination, sift3/cosine distance, tf-idf ranking, and more

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published