filter by shaleenji · Pull Request #154 · endee-io/endee

shaleenji · 2026-03-30T09:46:22Z

No description provided.

github-actions · 2026-04-06T08:50:39Z

VectorDB Benchmark - Ready To Run

CI Passed ([lint + unit tests] (https://github.com/endee-io/endee/actions/runs/24508497817)) - benchmark options unlocked.

Post one of the command below. Only members with write access can trigger runs.

Available Modes

Mode	Command	What runs
Dense	`/correctness_benchmarking dense`	HNSW insert throughput · query P50/P95/P99 · recall@10 · concurrent QPS
Hybrid	`/correctness_benchmarking hybrid`	Dense + sparse BM25 fusion · same suite + fusion latency overhead

Infrastructure

Server	Role	Instance
Endee Server	Endee VectorDB — code from this branch	`t2.large`
Benchmark Server	Benchmark runner	`t3a.large`

Both servers start on demand and are always terminated after the run — pass or fail.

How Correctness Benchmarking Works

1. Post /correctness_benchmarking <mode>
2. Endee Server Create  →  this branch's code deployed  →  Endee starts in chosen mode
3. Benchmark Server Create  →  benchmark suite transferred
4. Benchmark Server runs correctness benchmarking against Endee Server
5. Results posted back here  →  pass/fail + full metrics table
6. Both servers terminated   →  always, even on failure

After a new push, CI must pass again before this menu reappears.

shaleenji · 2026-04-15T06:37:44Z

Oirginal Performance numbers for Int filter and label filter

github-actions · 2026-04-15T06:37:53Z

VectorDB Benchmark — — Failed

Triggered by @shaleenji · Commit ``

Step	Status
Provision Servers	Up
Deploy Endee Server	Done
Run Benchmark	Failed
Results	See reason below
Teardown	Done ✓

shaleenji · 2026-04-15T06:41:15Z

New Performance numbers for Int filter and label filter

shaleenji · 2026-04-16T03:40:50Z

1M indexing time on 8CPU, 30GB machine ovh B3-32 machine

int filter: 981 seconds
label filter: 1025 sec

shaleenji · 2026-04-16T03:42:38Z

The above code has two purposes:

Makes all the filter parsing first before actually inserting it to the DB
Performance enhancement by doing single MDBX transaction

1M indexing time on 8CPU, 30GB machine ovh B3-32 machine

int filter: 873 seconds (~11% reduction in indexing time)
label filter: 999 seconds

github-actions · 2026-04-23T11:07:44Z

VectorDB Benchmark - Ready To Run

CI Passed ([lint + unit tests] (https://github.com/endee-io/endee/actions/runs/25796643394)) - benchmark options unlocked.

Post one of the command below. Only members with write access can trigger runs.

Available Modes

Mode	Command	What runs
Dense	`/correctness_benchmarking dense`	HNSW insert throughput · query P50/P95/P99 · recall@10 · concurrent QPS
Hybrid	`/correctness_benchmarking hybrid`	Dense + sparse BM25 fusion · same suite + fusion latency overhead

Infrastructure

Server	Role	Instance
Endee Server	Endee VectorDB — code from this branch	`t2.large`
Benchmark Server	Benchmark runner	`t3a.large`

Both servers start on demand and are always terminated after the run — pass or fail.

How Correctness Benchmarking Works

1. Post /correctness_benchmarking <mode>
2. Endee Server Create  →  this branch's code deployed  →  Endee starts in chosen mode
3. Benchmark Server Create  →  benchmark suite transferred
4. Benchmark Server runs correctness benchmarking against Endee Server
5. Results posted back here  →  pass/fail + full metrics table
6. Both servers terminated   →  always, even on failure

After a new push, CI must pass again before this menu reappears.

shaleenji · 2026-04-26T21:50:46Z

Things todo:
Note: Key encoding is string-concatenated: field:value, field:id, and field:. Fields or values containing : can collide or break prefix scans. Length-prefixed or tuple-encoded MDBX keys would be safer.

shaleenji · 2026-04-26T21:56:19Z

Filter writes are not atomic with vector/meta/HNSW/WAL writes. A failure can leave vector metadata, filter indexes, sparse storage, and HNSW out of sync.

shaleenji · 2026-05-12T05:05:30Z

Numeric bucket: fix the duplicate-heavy cliff bug.

When >65,536 ids shared the same numeric filter value, Bucket::serialize
truncated the on-disk count to a uint16_t and corrupted the bucket --
recall collapsed to zero past that boundary (reproducible with
tests/repo_filter.py).

The fix has four parts in src/filter/numeric_index.hpp:

Bucket::add now caps deltas/ids at MAX_SIZE for delta_32==0
duplicates and routes the excess id into summary_bitmap only.
Cardinality is preserved, on-disk arrays no longer grow without
bound.
The on-disk count field is removed from Bucket::serialize entirely.
Bucket::deserialize derives nr_array_entries from the residual
bytes after the bitmap, so there is no count to overflow.
Bucket::deserialize stays backward compatible with the old on-disk
format via a modulus check on the residual: existing DBs written
by the old code keep working with no migration.
range() gains two correctness branches: a legacy-salvage branch
for cliff-corrupted buckets (ids.empty() but bitmap non-empty),
and a bitmap-only-inclusion branch in the partial-overlap path
for buckets with cardinality > ids.size().
range() also gains a coarse full-coverage fast path: when a
bucket's [base, base+MAX_DELTA] extent is wholly inside the query
range, skip the deltas/ids deserialize and union just the bitmap
via Bucket::read_summary_bitmap.

Build: pin -falign-functions=64 in the release flags (CMakeLists.txt).

Editing any header transitively included by ndd.hpp (filter.hpp,
numeric_index.hpp, vector_storage.hpp) was producing 10-30% QPS
swings on the int-filter HTTP bench with no algorithmic change,
because the HNSW search loop is sensitive to function placement
relative to cache lines. Microbenches of range() and bitmap.contains()
were byte-identical between affected builds; the cost lived in
i-cache effects on the surrounding HNSW inner loop. Forcing 64-byte
function alignment removes the variance so future header touches
don't masquerade as perf regressions.

shaleenji · 2026-05-12T05:05:55Z

Adjacent issue not addressed here: the slide-split LEFT-bucket rebuild
in add_to_buckets() rebuilds summary_bitmap from ids only, which
silently drops bitmap-only entries. Saturation handling is therefore
not durable across splits; tracked separately.

shaleenji · 2026-05-12T05:17:22Z

THIS IS A Breaking change. Need to reindex

…Filter

shaleenji · 2026-05-12T06:48:14Z

Now doing metadata updates when filters are updated
Stale entries are removed when updating filters

shaleenji · 2026-05-12T08:45:40Z

add explicit Roaring bitmap payload validation for category filter reads
replace unsafe filter bitmap reads with bounded readSafe + exact byte-count checks
validate deserialized Roaring internals before using stored bitmaps
apply the same hardening to numeric bucket bitmap payloads, including the range fast path
return/propagate corruption as OperationResult code 200 instead of trusting malformed payloads
add regression coverage for valid, truncated, trailing-byte, and garbage bitmap payloads

…n a txn

shaleenji · 2026-05-13T04:48:24Z

solves #237 , #238 , #239 , #240 , #241 , #242

partially #244 and #25

shaleenji · 2026-05-13T04:57:05Z

Requires reindexing

…y checks ids.empty() in numeric_index.cpp (line 306), removal deletes the bucket on that basis in numeric_index.cpp (line 412), and range skips ids.empty() buckets in numeric_index.cpp (line 1004). Also, split rebuilds the left bitmap only from ids in numeric_index.cpp (line 626), dropping bitmap-only duplicate IDs.

shaleenji · 2026-05-15T02:56:17Z

Server:
8 CPUs
32GB Memory
100GB NVME SSD

Label Filter

Int Filter

filter

12e67b1

shaleenji changed the title ~~filter~~ GetIndexInfo made light on storage and memory Mar 30, 2026

shaleenji changed the title ~~GetIndexInfo made light on storage and memory~~ filter Mar 30, 2026

shaleenji force-pushed the filter_pass branch from b1bab89 to 12e67b1 Compare March 30, 2026 12:06

Merge branch 'master' into filter_pass

e8684ea

shaleengarg added 3 commits April 6, 2026 21:11

merge

1c54807

removing dead code

ecfaa7b

unified implementation of add_filters_from_json

e7da08e

grouping numeric insertions for transactionality and performance

444d584

shaleengarg added 5 commits April 16, 2026 04:46

addMany instead of a looped add

e276382

cleanup

5db20c8

put batch todo comments

eea6ff3

commenting for better understanding

6011c2e

Merge branch 'master' into filter_pass

9e84b6d

shaleengarg added 2 commits April 24, 2026 11:46

name changes

cbda89a

docs updated for understanding

f6d18ba

shaleengarg added 5 commits April 30, 2026 05:48

timing function to time individual components of filterd search

dd7f97a

no need to copy data from mdbx

cdd5c37

Merge branch 'master' into filter_pass

46c09bd

using return type OperationResult to propagate the logs

b4c0488

comments updated

7afeddf

shaleengarg added 3 commits May 6, 2026 11:09

filter adding gt, gte, lt, lte

38b19c8

reject filters with : in key or value

523cb30

do meta data fetch only for vectors that satisfy the filters

3810ed5

shaleenji mentioned this pull request May 8, 2026

Vaib/numeric filters #80

Closed

fix numeric bucket duplicate cliff; pin function alignment to 64B

546430d

filter: fix stale index entries on upsert; sync meta.filter on delete…

b0e8425

…Filter

safe filter bitmap deserialization

a46d0b8

shaleengarg added 5 commits May 12, 2026 09:50

filter parameters validation

3e33557

bounding the filter mdbx size by reducing the number of updates withi…

750e5d8

…n a txn

moving to cpp from hpp

7743296

removing search timing for testing

8cfb915

numeric filters using only floats

e9cca02

shaleengarg added 4 commits May 13, 2026 05:49

testing

02acc13

filter docs

92e4f12

mac compile time flags to use xcrun to find the correct clang version

49b286f

Conversation

shaleenji commented Mar 30, 2026

Uh oh!

github-actions Bot commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

VectorDB Benchmark - Ready To Run

Available Modes

Infrastructure

How Correctness Benchmarking Works

Uh oh!

shaleenji commented Apr 15, 2026

Uh oh!

github-actions Bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

VectorDB Benchmark — — Failed

Uh oh!

shaleenji commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shaleenji commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shaleenji commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

VectorDB Benchmark - Ready To Run

Available Modes

Infrastructure

How Correctness Benchmarking Works

Uh oh!

shaleenji commented Apr 26, 2026

Uh oh!

shaleenji commented Apr 26, 2026

Uh oh!

shaleenji commented May 12, 2026

Uh oh!

shaleenji commented May 12, 2026

Uh oh!

shaleenji commented May 12, 2026

Uh oh!

shaleenji commented May 12, 2026

Uh oh!

shaleenji commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shaleenji commented May 13, 2026

Uh oh!

shaleenji commented May 13, 2026

Uh oh!

shaleenji commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions Bot commented Apr 6, 2026 •

edited

Loading

github-actions Bot commented Apr 15, 2026 •

edited

Loading

shaleenji commented Apr 15, 2026 •

edited

Loading

shaleenji commented Apr 16, 2026 •

edited

Loading

shaleenji commented Apr 16, 2026 •

edited

Loading

github-actions Bot commented Apr 23, 2026 •

edited

Loading

shaleenji commented May 12, 2026 •

edited

Loading