-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathNOTES
103 lines (72 loc) · 2.55 KB
/
NOTES
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
API
Simple C
Pull and Push based reads
Single-threaded string chunk stream writes
Pluggable SAXish document parser
Nestedness
docs have names.. __field:root, __field:root.inner, etc
does that mean INSIDE is an operator?
does that mean that there should be push/pop stack commands?
what about fragmented/interleaved docs?
stop/resume doc?
Ruby
Tests are in Ruby
A network api is provided in eventmachine (want to port to mbuf/c)
JSON Interface
Queries:
[{"key": "\"hello world\" goodbye"}, ...]
Query
ORs of (AND | NEAR(N))s of (NOT? key:value)
[title:foo NEAR(N) title:bar] means title:foo AND title:bar AND pos(bar) > pos(foo) AND pos(bar) <= pos(foo) + N
No further nesting allowed--use distributivity and De Morgan's law
Results
Pull results are provided individually, with a pointer to the state necessary to generate the next result
Push results are pushed at a callback handler
Protocol
Simple, line-oriented, tcp/stream protocol, pipelined.
command "args"\n
{} // json
all responses are json
commands are stateful/modal (e.g. you can add command 1x, then stream documents)
Parser
Documents have finite size, so the parser allocates O(1) space ahead of time.
Rough C call stack:
- engine_consume
- engine_append
- parser->consume
- engine_field_found
- engine_term_found
- engine_document_found
- engine_percolate
- engine_index
Ruby version (with mixed C)
- engine = Engine.new(parser) # establishes bidi reference
# (reusing parsers is Exception)
- engine.consume(str)
- engine_consume
- engine_append
- parser#consume(str) # no super, just full custom code
- parser#field_found()
- engine#field_found()
- engine_field_found
- parser#term_found()
- engine#term_found()
- engine_term_found
- parser#document_found()
- engine#document_found()
- engine_document_found
- engine_percolate
- engine_index
Storage
Three regions
Document stream persisted
Term index
Posting lists
Stream, plists are ring buffer of subregions
Term index is a LRWritten Hash table.
Posting lists
Percolation tricks
Document terms are added to a hash table
Queries are structured as a tree of hash tables. This is consistent because clauses are sortable--[a AND b] and [b AND a] will both evaluate to the same path through the tree.
DFS matches out of the tree.
A AND B OR C AND D OR E