Skip to content

filter#154

Open
shaleenji wants to merge 33 commits into
masterfrom
filter_pass
Open

filter#154
shaleenji wants to merge 33 commits into
masterfrom
filter_pass

Conversation

@shaleenji
Copy link
Copy Markdown
Collaborator

No description provided.

@shaleenji shaleenji changed the title filter GetIndexInfo made light on storage and memory Mar 30, 2026
@shaleenji shaleenji changed the title GetIndexInfo made light on storage and memory filter Mar 30, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 6, 2026

VectorDB Benchmark - Ready To Run

CI Passed ([lint + unit tests] (https://github.com/endee-io/endee/actions/runs/24508497817)) - benchmark options unlocked.

Post one of the command below. Only members with write access can trigger runs.


Available Modes

Mode Command What runs
Dense /correctness_benchmarking dense HNSW insert throughput · query P50/P95/P99 · recall@10 · concurrent QPS
Hybrid /correctness_benchmarking hybrid Dense + sparse BM25 fusion · same suite + fusion latency overhead

Infrastructure

Server Role Instance
Endee Server Endee VectorDB — code from this branch t2.large
Benchmark Server Benchmark runner t3a.large

Both servers start on demand and are always terminated after the run — pass or fail.


How Correctness Benchmarking Works

1. Post /correctness_benchmarking <mode>
2. Endee Server Create  →  this branch's code deployed  →  Endee starts in chosen mode
3. Benchmark Server Create  →  benchmark suite transferred
4. Benchmark Server runs correctness benchmarking against Endee Server
5. Results posted back here  →  pass/fail + full metrics table
6. Both servers terminated   →  always, even on failure

After a new push, CI must pass again before this menu reappears.

@shaleenji
Copy link
Copy Markdown
Collaborator Author

Oirginal Performance numbers for Int filter and label filter

Screenshot 2026-04-13 at 16 11 16 (1) Screenshot 2026-04-14 at 17 15 01

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 15, 2026

VectorDB Benchmark — — Failed

Triggered by @shaleenji · Commit ``

Step Status
Provision Servers Up
Deploy Endee Server Done
Run Benchmark Failed
Results See reason below
Teardown Done ✓

@shaleenji
Copy link
Copy Markdown
Collaborator Author

shaleenji commented Apr 15, 2026

New Performance numbers for Int filter and label filter

Screenshot 2026-04-15 at 12 08 07 Screenshot 2026-04-15 at 12 25 41

@shaleenji
Copy link
Copy Markdown
Collaborator Author

shaleenji commented Apr 16, 2026

1M indexing time on 8CPU, 30GB machine ovh B3-32 machine

int filter: 981 seconds
label filter: 1025 sec

@shaleenji
Copy link
Copy Markdown
Collaborator Author

shaleenji commented Apr 16, 2026

The above code has two purposes:

  1. Makes all the filter parsing first before actually inserting it to the DB
  2. Performance enhancement by doing single MDBX transaction

1M indexing time on 8CPU, 30GB machine ovh B3-32 machine

int filter: 873 seconds (~11% reduction in indexing time)
label filter: 999 seconds

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 23, 2026

VectorDB Benchmark - Ready To Run

CI Passed ([lint + unit tests] (https://github.com/endee-io/endee/actions/runs/25796643394)) - benchmark options unlocked.

Post one of the command below. Only members with write access can trigger runs.


Available Modes

Mode Command What runs
Dense /correctness_benchmarking dense HNSW insert throughput · query P50/P95/P99 · recall@10 · concurrent QPS
Hybrid /correctness_benchmarking hybrid Dense + sparse BM25 fusion · same suite + fusion latency overhead

Infrastructure

Server Role Instance
Endee Server Endee VectorDB — code from this branch t2.large
Benchmark Server Benchmark runner t3a.large

Both servers start on demand and are always terminated after the run — pass or fail.


How Correctness Benchmarking Works

1. Post /correctness_benchmarking <mode>
2. Endee Server Create  →  this branch's code deployed  →  Endee starts in chosen mode
3. Benchmark Server Create  →  benchmark suite transferred
4. Benchmark Server runs correctness benchmarking against Endee Server
5. Results posted back here  →  pass/fail + full metrics table
6. Both servers terminated   →  always, even on failure

After a new push, CI must pass again before this menu reappears.

@shaleenji
Copy link
Copy Markdown
Collaborator Author

Things todo:
Note: Key encoding is string-concatenated: field:value, field:id, and field:. Fields or values containing : can collide or break prefix scans. Length-prefixed or tuple-encoded MDBX keys would be safer.

@shaleenji
Copy link
Copy Markdown
Collaborator Author

Filter writes are not atomic with vector/meta/HNSW/WAL writes. A failure can leave vector metadata, filter indexes, sparse storage, and HNSW out of sync.

@shaleenji shaleenji mentioned this pull request May 8, 2026
@shaleenji
Copy link
Copy Markdown
Collaborator Author

  1. Numeric bucket: fix the duplicate-heavy cliff bug.

When >65,536 ids shared the same numeric filter value, Bucket::serialize
truncated the on-disk count to a uint16_t and corrupted the bucket --
recall collapsed to zero past that boundary (reproducible with
tests/repo_filter.py).

The fix has four parts in src/filter/numeric_index.hpp:

  • Bucket::add now caps deltas/ids at MAX_SIZE for delta_32==0
    duplicates and routes the excess id into summary_bitmap only.
    Cardinality is preserved, on-disk arrays no longer grow without
    bound.

  • The on-disk count field is removed from Bucket::serialize entirely.
    Bucket::deserialize derives nr_array_entries from the residual
    bytes after the bitmap, so there is no count to overflow.

  • Bucket::deserialize stays backward compatible with the old on-disk
    format via a modulus check on the residual: existing DBs written
    by the old code keep working with no migration.

  • range() gains two correctness branches: a legacy-salvage branch
    for cliff-corrupted buckets (ids.empty() but bitmap non-empty),
    and a bitmap-only-inclusion branch in the partial-overlap path
    for buckets with cardinality > ids.size().

  • range() also gains a coarse full-coverage fast path: when a
    bucket's [base, base+MAX_DELTA] extent is wholly inside the query
    range, skip the deltas/ids deserialize and union just the bitmap
    via Bucket::read_summary_bitmap.


  1. Build: pin -falign-functions=64 in the release flags (CMakeLists.txt).

Editing any header transitively included by ndd.hpp (filter.hpp,
numeric_index.hpp, vector_storage.hpp) was producing 10-30% QPS
swings on the int-filter HTTP bench with no algorithmic change,
because the HNSW search loop is sensitive to function placement
relative to cache lines. Microbenches of range() and bitmap.contains()
were byte-identical between affected builds; the cost lived in
i-cache effects on the surrounding HNSW inner loop. Forcing 64-byte
function alignment removes the variance so future header touches
don't masquerade as perf regressions.

@shaleenji
Copy link
Copy Markdown
Collaborator Author

Adjacent issue not addressed here: the slide-split LEFT-bucket rebuild
in add_to_buckets() rebuilds summary_bitmap from ids only, which
silently drops bitmap-only entries. Saturation handling is therefore
not durable across splits; tracked separately.

@shaleenji
Copy link
Copy Markdown
Collaborator Author

THIS IS A Breaking change. Need to reindex

@shaleenji
Copy link
Copy Markdown
Collaborator Author

  1. Now doing metadata updates when filters are updated
  2. Stale entries are removed when updating filters

@shaleenji
Copy link
Copy Markdown
Collaborator Author

shaleenji commented May 12, 2026

  • add explicit Roaring bitmap payload validation for category filter reads
  • replace unsafe filter bitmap reads with bounded readSafe + exact byte-count checks
  • validate deserialized Roaring internals before using stored bitmaps
  • apply the same hardening to numeric bucket bitmap payloads, including the range fast path
  • return/propagate corruption as OperationResult code 200 instead of trusting malformed payloads
  • add regression coverage for valid, truncated, trailing-byte, and garbage bitmap payloads

@shaleenji
Copy link
Copy Markdown
Collaborator Author

solves #237 , #238 , #239 , #240 , #241 , #242

partially #244 and #25

@shaleenji
Copy link
Copy Markdown
Collaborator Author

Requires reindexing

…y checks ids.empty() in numeric_index.cpp (line 306), removal deletes the bucket on that basis in numeric_index.cpp (line 412), and range skips ids.empty() buckets in numeric_index.cpp (line 1004). Also, split rebuilds the left bitmap only from ids in numeric_index.cpp (line 626), dropping bitmap-only duplicate IDs.
@shaleenji
Copy link
Copy Markdown
Collaborator Author

Server:
8 CPUs
32GB Memory
100GB NVME SSD

Label Filter

Screenshot 2026-05-15 at 08 24 19

Int Filter

Screenshot 2026-05-15 at 08 25 06

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants