From 6768d187ff69cc89cca00dbd8efba38d05dde363 Mon Sep 17 00:00:00 2001 From: Sandeep Date: Tue, 12 May 2026 17:22:02 -0500 Subject: [PATCH 1/3] Publish corrected 10GB native benchmark result --- docs/SERVER_BENCHMARK_STATUS.md | 64 ++++++++++++++----- docs/STAGED_BENCHMARK_PLAN.md | 21 ++++-- ...60512T_10gb_optimized_native_metadata.json | 29 +++++++++ ...260512T_10gb_optimized_native_summary.json | 44 +++++++++++++ 4 files changed, 138 insertions(+), 20 deletions(-) create mode 100644 reports/runs/20260512T_10gb_optimized_native_metadata.json create mode 100644 reports/runs/20260512T_10gb_optimized_native_summary.json diff --git a/docs/SERVER_BENCHMARK_STATUS.md b/docs/SERVER_BENCHMARK_STATUS.md index f75c94a..0968ad6 100644 --- a/docs/SERVER_BENCHMARK_STATUS.md +++ b/docs/SERVER_BENCHMARK_STATUS.md @@ -60,6 +60,8 @@ Local snapshots now checked into this repo: - [`reports/runs/20260429T_next_gte_small_st_embedding_metadata.json`](../reports/runs/20260429T_next_gte_small_st_embedding_metadata.json) - [`reports/runs/20260503T_stage10gb_d768_sochdb_vector_summary.json`](../reports/runs/20260503T_stage10gb_d768_sochdb_vector_summary.json) - [`reports/runs/20260503T_stage10gb_d768_metadata.json`](../reports/runs/20260503T_stage10gb_d768_metadata.json) +- [`reports/runs/20260512T_10gb_optimized_native_summary.json`](../reports/runs/20260512T_10gb_optimized_native_summary.json) +- [`reports/runs/20260512T_10gb_optimized_native_metadata.json`](../reports/runs/20260512T_10gb_optimized_native_metadata.json) ## Established quality finding @@ -194,9 +196,9 @@ For large-scale system work: 3. run `250GB` only after confirming disk headroom 4. defer `1TB` until storage is expanded -## Current staged run +## Current staged `10GB` status -The first staged large-scale run is now complete: +The first staged large-scale run is now understood in two parts: - run id: `20260503T_stage10gb_d768` - dataset: `synthetic_10gib_768d` @@ -208,14 +210,24 @@ The first staged large-scale run is now complete: - local summary: [`reports/runs/20260503T_stage10gb_d768_sochdb_vector_summary.json`](../reports/runs/20260503T_stage10gb_d768_sochdb_vector_summary.json) - local metadata: [`reports/runs/20260503T_stage10gb_d768_metadata.json`](../reports/runs/20260503T_stage10gb_d768_metadata.json) +Corrected native rerun: + +- run id: `20260512T_10gb_optimized_native` +- workload: `sochdb_native_10gb` +- server script: `run_10gb_bench.py` +- local summary: [`reports/runs/20260512T_10gb_optimized_native_summary.json`](../reports/runs/20260512T_10gb_optimized_native_summary.json) +- local metadata: [`reports/runs/20260512T_10gb_optimized_native_metadata.json`](../reports/runs/20260512T_10gb_optimized_native_metadata.json) + Important implementation note: -- the staged lane now uses the compiled `sochdb-bulk` binary on the server for - build and query operations -- this replaced the earlier Python-only workload path, which failed on the - hosted machine because the stale `VectorIndex` import path was not available +- the first successful end-to-end staged run used the compiled `sochdb-bulk` + path because the old Python server environment was blocked +- that got the lane running, but it was not a trustworthy steady-state repeated + query benchmark +- the corrected May 12 rerun used `sochdb.HnswIndex.load(...)` once and then + in-process `index.search(...)` / `index.search_batch(...)` -### `10GB` staged result +### What the first published `10GB` run proved What succeeded: @@ -224,7 +236,7 @@ What succeeded: - observed build throughput was about `891.6 vec/s` - output index size was about `10,069.1 MB` -What failed the performance bar: +What looked bad at first: - `250` queries took about `27,455.91 s` - search throughput was only about `0.0091 QPS` @@ -232,13 +244,32 @@ What failed the performance bar: - `p95` latency was about `110,108 ms` - `mean` latency was about `109,822 ms` +Why that result was misleading: + +- that runner used `bulk_query_from_file(...)`, which shells out once per query +- the CLI query path loads the large index before searching +- the benchmark therefore measured repeated subprocess startup and repeated index + reload more than it measured real steady-state ANN search + +### Corrected `10GB` native rerun + +The verified May 12 rerun on the server showed: + +- one-time index load: about `106.85 s` +- sequential search: about `506.63 QPS` +- sequential mean latency: about `1.97 ms` +- sequential `p50`: about `1.87 ms` +- sequential `p95`: about `2.40 ms` +- batch search: about `356 QPS` + Interpretation: -- the staged runner itself is now working end to end -- the bottleneck has moved from benchmark plumbing to SochDB query-path behavior -- we should not scale this lane to `100GB` yet -- the next benchmark task is diagnosing why the current search path is roughly - `~110 s` per query at `10GB` +- the catastrophic `~110 s/query` result was a benchmark harness artifact +- the corrected in-process search result is the meaningful steady-state number +- the large-scale story is now materially stronger than the old published docs + suggested +- we should use the corrected native path as the baseline for future staged + `100GB` and `250GB` work ## Scripts that define the server lane @@ -253,8 +284,11 @@ Related planning docs: ## What is still pending -- investigate the `10GB` search-latency failure before running `100GB` -- complete the staged `10GB` -> `100GB` -> `250GB` scale path after that +- replace the old misleading `10GB` interpretation everywhere it still appears +- decide whether to publish the corrected native rerun as the canonical `10GB` + search result in a dedicated comparison doc/table +- continue the staged `10GB` -> `100GB` -> `250GB` scale path using the native + steady-state methodology - defer `1TB` claims until storage is expanded This file should be the first place to update whenever new server benchmark work diff --git a/docs/STAGED_BENCHMARK_PLAN.md b/docs/STAGED_BENCHMARK_PLAN.md index 2e31c1c..66580d6 100644 --- a/docs/STAGED_BENCHMARK_PLAN.md +++ b/docs/STAGED_BENCHMARK_PLAN.md @@ -41,16 +41,27 @@ Current server state: - run `20260503T_stage10gb_d768` completed on the benchmark server - dataset: `synthetic_10gib_768d` -- runner path now uses the compiled `sochdb-bulk` binary for index build/query -- this avoids the stale in-process `VectorIndex` path that failed on the hosted box +- the first published run used the compiled `sochdb-bulk` binary for build/query +- a later corrected rerun used the in-process native `HnswIndex.load(...)` + + `index.search(...)` path from `run_10gb_bench.py` Current outcome: - build completed successfully for `3,495,253` vectors at about `892 vec/s` - output index size was about `10,069 MB` -- search throughput was only about `0.0091 QPS` -- `p50` query latency was about `109,814 ms` -- the next priority is query-path investigation before moving on to `100GB` +- the original published `0.0091 QPS` / `109,814 ms p50` search result is now + understood to be a harness artifact, not the true steady-state search speed +- the corrected May 12 native rerun measured about `506.6 QPS` with about + `1.87 ms p50` and `1.97 ms` mean latency after a one-time `106.85 s` index + load +- the next priority is publishing the corrected native lane cleanly and then + continuing the staged scale path with the right measurement method + +Methodology warning: + +- the original slow search number came from a subprocess-per-query bulk CLI + path that reloaded the large index repeatedly +- do not treat that number as the engine's steady-state search performance What this lane measures: diff --git a/reports/runs/20260512T_10gb_optimized_native_metadata.json b/reports/runs/20260512T_10gb_optimized_native_metadata.json new file mode 100644 index 0000000..70dea87 --- /dev/null +++ b/reports/runs/20260512T_10gb_optimized_native_metadata.json @@ -0,0 +1,29 @@ +{ + "run_id": "20260512T_10gb_optimized_native", + "timestamp_utc": "2026-05-12T08:23:20.389332", + "dataset_name": "synthetic_10gib_768d", + "dataset_dir": "/datasets/synthetic_10gib_768d", + "result_json": "/results/10gb_optimized/results_m16.json", + "workload": "sochdb_native_10gb", + "methodology": { + "script": "/work/run_10gb_bench.py", + "search_path": "in-process native extension", + "index_load": "load once before repeated queries", + "warmup_queries": 10, + "notes": [ + "This rerun avoids the per-query subprocess path used by the earlier bulk CLI harness.", + "It should be treated as the corrected steady-state search measurement for the loaded index.", + "This artifact was verified on the server and then copied into the repo." + ] + }, + "config": { + "num_vectors": 3495253, + "num_queries": 1000, + "dimension": 768, + "M": 32, + "ef_construction": 200, + "ef_search": 64, + "k": 10, + "batch_size": 5000 + } +} diff --git a/reports/runs/20260512T_10gb_optimized_native_summary.json b/reports/runs/20260512T_10gb_optimized_native_summary.json new file mode 100644 index 0000000..ed3df33 --- /dev/null +++ b/reports/runs/20260512T_10gb_optimized_native_summary.json @@ -0,0 +1,44 @@ +{ + "load_s": 106.85488888109103, + "search_sequential": { + "total_s": 1.973815259989351, + "qps": 506.6330270470171, + "mean_ms": 1.973815259989351, + "p50_ms": 1.8717250786721706, + "p95_ms": 2.4010292254388332, + "p99_ms": 6.252808030694723 + }, + "search_batch": { + "total_s": 2.808144075796008, + "qps": 356.1070846112253, + "per_query_ms": 2.808144075796008 + }, + "search_batch_ef64": { + "qps": 358.62969130334096, + "per_query_ms": 2.788391547743231 + }, + "search_batch_ef128": { + "qps": 358.11082750298215, + "per_query_ms": 2.792431625071913 + }, + "search_batch_ef256": { + "qps": 359.017884829031, + "per_query_ms": 2.785376557148993 + }, + "search_batch_ef512": { + "qps": 356.51812853303915, + "per_query_ms": 2.8049064548686147 + }, + "config": { + "num_vectors": 3495253, + "num_queries": 1000, + "dimension": 768, + "M": 32, + "ef_construction": 200, + "ef_search": 64, + "k": 10, + "batch_size": 5000 + }, + "timestamp": "2026-05-12T08:23:20.389332", + "workload": "sochdb_native_10gb" +} From 5558a4b8930e10c0422673d0a9b846f844510dbb Mon Sep 17 00:00:00 2001 From: Sandeep Date: Tue, 12 May 2026 17:27:24 -0500 Subject: [PATCH 2/3] Simplify benchmark status docs --- docs/SERVER_BENCHMARK_STATUS.md | 294 +++++--------------------------- docs/STAGED_BENCHMARK_PLAN.md | 163 ++++-------------- 2 files changed, 79 insertions(+), 378 deletions(-) diff --git a/docs/SERVER_BENCHMARK_STATUS.md b/docs/SERVER_BENCHMARK_STATUS.md index 0968ad6..6b9a3e4 100644 --- a/docs/SERVER_BENCHMARK_STATUS.md +++ b/docs/SERVER_BENCHMARK_STATUS.md @@ -1,259 +1,67 @@ # Server Benchmark Status -This document captures the current state of the heavy benchmark lane that runs on -the hosted SochDB server instead of on a laptop. +This document captures the current benchmark story for the hosted SochDB server. -## Why the server lane exists +## Current setup -Heavy benchmark work should happen on the benchmark server, not on a developer -laptop. That is especially true for: +- heavy benchmark work runs on the benchmark server, not on laptops +- hosted gRPC demo endpoint: `studio.agentslab.host:50053` +- current server class: about `12` CPU, about `62 GiB` RAM +- current storage is not appropriate for a final `1TB` claim yet -- retrieval-quality sweeps -- embedding bakeoffs -- staged large-dataset runs -- repeatable gRPC benchmark runs against the hosted demo endpoint - -Current server target: - -- host: private benchmark server -- SSH: stored out-of-band for operators only -- hosted gRPC endpoint: `studio.agentslab.host:50053` - -## Current server constraints - -The server is good enough for repeated CPU-oriented benchmark work, but it is not -the right machine for a final `1TB` claim yet. - -- about `12` CPU -- about `62 GiB` RAM -- limited free root-disk capacity for honest `1TB` benchmarking -- weak GPU (`GeForce GT 710`), so embedding work should remain CPU-friendly - -Because of that, the large-scale benchmark story should stay staged: +Because of that, large-scale benchmarking stays staged: 1. `10GB` 2. `100GB` 3. `250GB` -4. `1TB` only after moving to a larger disk or attached storage - -## Current benchmark workspace on the server - -- `/datasets` -- `/embeddings` -- `/results` -- `/logs` -- `/work` - -These locations should be treated as the canonical landing zone for heavy benchmark -artifacts before we selectively publish summaries back into this repo. - -Local snapshots now checked into this repo: - -- [`reports/runs/20260427T224143Z_scifact_baseline_pilot_metadata.json`](../reports/runs/20260427T224143Z_scifact_baseline_pilot_metadata.json) -- [`reports/runs/20260427T225122Z_scifact_baseline_summary.json`](../reports/runs/20260427T225122Z_scifact_baseline_summary.json) -- [`reports/runs/20260427T225122Z_scifact_baseline_embedding_metadata.json`](../reports/runs/20260427T225122Z_scifact_baseline_embedding_metadata.json) -- [`reports/runs/20260427T230412Z_scifact_bge_small_summary.json`](../reports/runs/20260427T230412Z_scifact_bge_small_summary.json) -- [`reports/runs/20260427T230412Z_scifact_bge_small_embedding_metadata.json`](../reports/runs/20260427T230412Z_scifact_bge_small_embedding_metadata.json) -- [`reports/runs/20260429T_next_bge_base_summary.json`](../reports/runs/20260429T_next_bge_base_summary.json) -- [`reports/runs/20260429T_next_bge_base_embedding_metadata.json`](../reports/runs/20260429T_next_bge_base_embedding_metadata.json) -- [`reports/runs/20260429T_next_gte_small_st_summary.json`](../reports/runs/20260429T_next_gte_small_st_summary.json) -- [`reports/runs/20260429T_next_gte_small_st_embedding_metadata.json`](../reports/runs/20260429T_next_gte_small_st_embedding_metadata.json) -- [`reports/runs/20260503T_stage10gb_d768_sochdb_vector_summary.json`](../reports/runs/20260503T_stage10gb_d768_sochdb_vector_summary.json) -- [`reports/runs/20260503T_stage10gb_d768_metadata.json`](../reports/runs/20260503T_stage10gb_d768_metadata.json) -- [`reports/runs/20260512T_10gb_optimized_native_summary.json`](../reports/runs/20260512T_10gb_optimized_native_summary.json) -- [`reports/runs/20260512T_10gb_optimized_native_metadata.json`](../reports/runs/20260512T_10gb_optimized_native_metadata.json) - -## Established quality finding +4. `1TB` only after storage expansion -The most important retrieval-quality result so far is that embedding choice moved -quality more than HNSW tuning on SciFact. +## Published quality result -Latest verified server runs: +Quality is measured separately from scale, using SciFact retrieval benchmarks. -- baseline sweep run: `20260427T225122Z` -- baseline pilot metadata run: `20260427T224143Z` -- `BAAI/bge-small-en-v1.5` sweep run: `20260427T230412Z` -- `thenlper/gte-small` sweep run: `20260429T_next_gte_small_st` -- `BAAI/bge-base-en-v1.5` sweep run: `20260429T_next_bge_base` +Current best published result: -Baseline embedding metadata: - -- backend: `sentence-transformers` -- model: `sentence-transformers/all-MiniLM-L6-v2` -- dataset: SciFact -- documents: `5183` -- queries: `300` -- dimension: `384` - -BGE comparison embedding metadata: - -- backend: `fastembed` -- model: `BAAI/bge-small-en-v1.5` -- dataset: SciFact -- documents: `5183` -- queries: `300` -- dimension: `384` - -GTE-small embedding metadata: - -- backend: `sentence-transformers` -- model: `thenlper/gte-small` -- dataset: SciFact -- documents: `5183` -- queries: `300` -- dimension: `384` - -BGE-base embedding metadata: - -- backend: `fastembed` - model: `BAAI/bge-base-en-v1.5` -- dataset: SciFact -- documents: `5183` -- queries: `300` -- dimension: `768` - -Summary of the current conclusion: +- recall@5: `0.8121` +- MRR: `0.7017` +- nDCG@5: `0.7258` -- baseline SciFact `recall@5` was about `0.7109` -- `thenlper/gte-small` reached about `0.7786` `recall@5` -- `BAAI/bge-base-en-v1.5` reached about `0.8121` `recall@5` -- `MRR` and `nDCG` improved as well -- `gte-small` stayed near baseline latency -- `bge-base-en-v1.5` improved quality further, but with noticeably higher latency -- HNSW parameter sweeps did not meaningfully change quality compared with the - embedding-model change +Main takeaway: -### Exact SciFact comparison +- embedding choice moved quality more than HNSW tuning in the current setup -| Embeddings | Run | recall@5 | MRR | nDCG@5 | p50 (ms) | p95 (ms) | mean (ms) | -| :--- | :--- | ---: | ---: | ---: | ---: | ---: | ---: | -| `all-MiniLM-L6-v2` + `fast` HNSW | `20260427T225122Z` | `0.7109` | `0.5883` | `0.6135` | `0.857` | `0.996` | `0.800` | -| `all-MiniLM-L6-v2` + `balanced` HNSW | `20260427T225122Z` | `0.7109` | `0.5883` | `0.6135` | `0.953` | `1.019` | `0.880` | -| `all-MiniLM-L6-v2` + `quality` HNSW | `20260427T225122Z` | `0.7109` | `0.5883` | `0.6135` | `0.900` | `1.001` | `0.813` | -| `BAAI/bge-small-en-v1.5` + `fast` HNSW | `20260427T230412Z` | `0.7624` | `0.6603` | `0.6812` | `0.920` | `1.041` | `0.840` | -| `BAAI/bge-small-en-v1.5` + `balanced` HNSW | `20260427T230412Z` | `0.7624` | `0.6603` | `0.6812` | `0.929` | `0.985` | `0.833` | -| `BAAI/bge-small-en-v1.5` + `quality` HNSW | `20260427T230412Z` | `0.7624` | `0.6603` | `0.6812` | `0.720` | `0.992` | `0.775` | -| `thenlper/gte-small` + `fast` HNSW | `20260429T_next_gte_small_st` | `0.7786` | `0.6711` | `0.6944` | `0.955` | `1.028` | `0.878` | -| `thenlper/gte-small` + `balanced` HNSW | `20260429T_next_gte_small_st` | `0.7786` | `0.6711` | `0.6944` | `0.976` | `1.049` | `0.888` | -| `thenlper/gte-small` + `quality` HNSW | `20260429T_next_gte_small_st` | `0.7786` | `0.6711` | `0.6944` | `0.968` | `1.056` | `0.901` | -| `BAAI/bge-base-en-v1.5` + `fast` HNSW | `20260429T_next_bge_base` | `0.8121` | `0.7017` | `0.7258` | `1.787` | `3.287` | `1.951` | -| `BAAI/bge-base-en-v1.5` + `balanced` HNSW | `20260429T_next_bge_base` | `0.8121` | `0.7017` | `0.7258` | `2.683` | `4.243` | `2.989` | -| `BAAI/bge-base-en-v1.5` + `quality` HNSW | `20260429T_next_bge_base` | `0.8121` | `0.7017` | `0.7258` | `1.823` | `2.946` | `2.189` | +Useful reference points: -### Best-to-best summary - -Using the best observed latency profile from each embedding set: - -| Comparison | recall@5 | MRR | nDCG@5 | mean latency | +| Embeddings | recall@5 | MRR | nDCG@5 | mean latency | | :--- | ---: | ---: | ---: | ---: | -| baseline `all-MiniLM-L6-v2` | `0.7109` | `0.5883` | `0.6135` | `0.800 ms` | +| `all-MiniLM-L6-v2` | `0.7109` | `0.5883` | `0.6135` | `0.800 ms` | | `BAAI/bge-small-en-v1.5` | `0.7624` | `0.6603` | `0.6812` | `0.775 ms` | | `thenlper/gte-small` | `0.7786` | `0.6711` | `0.6944` | `0.878 ms` | | `BAAI/bge-base-en-v1.5` | `0.8121` | `0.7017` | `0.7258` | `1.951 ms` | -Observed gains from the embedding change: - -- `recall@5`: `+0.0516` absolute, about `+7.3%` relative -- `MRR`: `+0.0719` absolute, about `+12.2%` relative -- `nDCG@5`: `+0.0677` absolute, about `+11.0%` relative +Published artifacts: -Observed gains for `thenlper/gte-small` over baseline: +- [`20260427T225122Z_scifact_baseline_summary.json`](../reports/runs/20260427T225122Z_scifact_baseline_summary.json) +- [`20260427T230412Z_scifact_bge_small_summary.json`](../reports/runs/20260427T230412Z_scifact_bge_small_summary.json) +- [`20260429T_next_gte_small_st_summary.json`](../reports/runs/20260429T_next_gte_small_st_summary.json) +- [`20260429T_next_bge_base_summary.json`](../reports/runs/20260429T_next_bge_base_summary.json) -- `recall@5`: `+0.0677` absolute, about `+9.5%` relative -- `MRR`: `+0.0827` absolute, about `+14.1%` relative -- `nDCG@5`: `+0.0809` absolute, about `+13.2%` relative +## Published `10GB` scale result -Observed gains for `BAAI/bge-base-en-v1.5` over baseline: +Dataset: -- `recall@5`: `+0.1012` absolute, about `+14.2%` relative -- `MRR`: `+0.1134` absolute, about `+19.3%` relative -- `nDCG@5`: `+0.1122` absolute, about `+18.3%` relative - -Interpretation: - -- the next strong retrieval lever is embedding selection -- `BAAI/bge-base-en-v1.5` is the current quality leader on SciFact -- `thenlper/gte-small` is a useful middle point when we want a lighter latency hit -- HNSW sweeps are still useful for latency/recall tradeoff mapping -- we should not oversell ANN tuning as the main quality breakthrough -- dimensionality matters in this comparison set, so `384`-dim and `768`-dim wins - should not be treated as identical cost classes - -## Recommended benchmark order from here - -For retrieval work, keep the methodology disciplined: - -1. fix the dataset -2. fix the embedding model -3. sweep HNSW settings -4. compare `recall@k`, `MRR`, `nDCG`, and latency -5. compare embedding models on the same benchmark path - -For large-scale system work: - -1. complete clean `10GB` results -2. publish `100GB` results -3. run `250GB` only after confirming disk headroom -4. defer `1TB` until storage is expanded - -## Current staged `10GB` status - -The first staged large-scale run is now understood in two parts: - -- run id: `20260503T_stage10gb_d768` -- dataset: `synthetic_10gib_768d` -- target size: `10 GiB` +- run family: `synthetic_10gib_768d` +- vectors: `3,495,253` - dimension: `768` -- query count: `250` -- runner: `scripts/run_sochdb_stage_vector.sh` -- workload: `benchmarks/run_bulk_vector_workload.py` -- local summary: [`reports/runs/20260503T_stage10gb_d768_sochdb_vector_summary.json`](../reports/runs/20260503T_stage10gb_d768_sochdb_vector_summary.json) -- local metadata: [`reports/runs/20260503T_stage10gb_d768_metadata.json`](../reports/runs/20260503T_stage10gb_d768_metadata.json) - -Corrected native rerun: - -- run id: `20260512T_10gb_optimized_native` -- workload: `sochdb_native_10gb` -- server script: `run_10gb_bench.py` -- local summary: [`reports/runs/20260512T_10gb_optimized_native_summary.json`](../reports/runs/20260512T_10gb_optimized_native_summary.json) -- local metadata: [`reports/runs/20260512T_10gb_optimized_native_metadata.json`](../reports/runs/20260512T_10gb_optimized_native_metadata.json) -Important implementation note: +Build result: -- the first successful end-to-end staged run used the compiled `sochdb-bulk` - path because the old Python server environment was blocked -- that got the lane running, but it was not a trustworthy steady-state repeated - query benchmark -- the corrected May 12 rerun used `sochdb.HnswIndex.load(...)` once and then - in-process `index.search(...)` / `index.search_batch(...)` +- build throughput: about `891.6 vec/s` +- build time: about `3920.14 s` +- output index size: about `10,069.1 MB` -### What the first published `10GB` run proved - -What succeeded: - -- index build completed successfully for `3,495,253` vectors -- build time was about `3920.14 s` (`~65.3 min`) -- observed build throughput was about `891.6 vec/s` -- output index size was about `10,069.1 MB` - -What looked bad at first: - -- `250` queries took about `27,455.91 s` -- search throughput was only about `0.0091 QPS` -- `p50` latency was about `109,814 ms` -- `p95` latency was about `110,108 ms` -- `mean` latency was about `109,822 ms` - -Why that result was misleading: - -- that runner used `bulk_query_from_file(...)`, which shells out once per query -- the CLI query path loads the large index before searching -- the benchmark therefore measured repeated subprocess startup and repeated index - reload more than it measured real steady-state ANN search - -### Corrected `10GB` native rerun - -The verified May 12 rerun on the server showed: +Corrected steady-state search result: - one-time index load: about `106.85 s` - sequential search: about `506.63 QPS` @@ -262,34 +70,20 @@ The verified May 12 rerun on the server showed: - sequential `p95`: about `2.40 ms` - batch search: about `356 QPS` -Interpretation: - -- the catastrophic `~110 s/query` result was a benchmark harness artifact -- the corrected in-process search result is the meaningful steady-state number -- the large-scale story is now materially stronger than the old published docs - suggested -- we should use the corrected native path as the baseline for future staged - `100GB` and `250GB` work +Important note: -## Scripts that define the server lane +- the earlier `0.0091 QPS` / `~110s per query` result came from a bad benchmark + harness path and should not be treated as the real steady-state engine result -- `scripts/run_sochdb_grpc_quality_sweep.sh` -- `scripts/run_sochdb_embedding_bakeoff.sh` -- `scripts/run_sochdb_stage_vector.sh` +Published artifacts: -Related planning docs: +- [`20260503T_stage10gb_d768_sochdb_vector_summary.json`](../reports/runs/20260503T_stage10gb_d768_sochdb_vector_summary.json) +- [`20260503T_stage10gb_d768_metadata.json`](../reports/runs/20260503T_stage10gb_d768_metadata.json) +- [`20260512T_10gb_optimized_native_summary.json`](../reports/runs/20260512T_10gb_optimized_native_summary.json) +- [`20260512T_10gb_optimized_native_metadata.json`](../reports/runs/20260512T_10gb_optimized_native_metadata.json) -- [`STAGED_BENCHMARK_PLAN.md`](./STAGED_BENCHMARK_PLAN.md) -- [`RETRIEVAL_AND_VECTOR_PLAN.md`](./RETRIEVAL_AND_VECTOR_PLAN.md) +## What is pending -## What is still pending - -- replace the old misleading `10GB` interpretation everywhere it still appears -- decide whether to publish the corrected native rerun as the canonical `10GB` - search result in a dedicated comparison doc/table -- continue the staged `10GB` -> `100GB` -> `250GB` scale path using the native - steady-state methodology +- publish `100GB` results using the corrected native steady-state methodology +- publish `250GB` results after confirming disk headroom - defer `1TB` claims until storage is expanded - -This file should be the first place to update whenever new server benchmark work -changes the current benchmark story. diff --git a/docs/STAGED_BENCHMARK_PLAN.md b/docs/STAGED_BENCHMARK_PLAN.md index 66580d6..0f57012 100644 --- a/docs/STAGED_BENCHMARK_PLAN.md +++ b/docs/STAGED_BENCHMARK_PLAN.md @@ -1,157 +1,64 @@ # Staged Benchmark Plan -For the current hosted benchmark state and the already-established retrieval -quality takeaway, start with -[`SERVER_BENCHMARK_STATUS.md`](./SERVER_BENCHMARK_STATUS.md). +Start with [`SERVER_BENCHMARK_STATUS.md`](./SERVER_BENCHMARK_STATUS.md) for the +current published results. ## Why staged -The current benchmark server does not have enough free root-disk capacity for a -clean, honest `1TB` run today. +The current benchmark server does not yet have the storage profile for a clean +final `1TB` benchmark claim. -So we should run staged sizes: +So the scale path stays staged: 1. `10GB` 2. `100GB` 3. `250GB` -4. `1TB` only after storage is expanded +4. `1TB` after storage expansion -## Current benchmark workspace +## Current `10GB` status -Server: +The first staged dataset is: -- `/datasets` -- `/results` -- `/logs` -- `/work` +- dataset: `synthetic_10gib_768d` +- vectors: `3,495,253` +- dimension: `768` + +Published result summary: + +- build worked and produced about a `10,069 MB` index +- corrected steady-state search result is about `506.63 QPS` +- corrected sequential mean latency is about `1.97 ms` +- one-time index load is about `106.85 s` -## First staged `10GB` lane +Important note: -For the first large-scale system pass, use a synthetic normalized-vector dataset -for throughput and latency characterization. Keep this separate from the SciFact -quality lane. +- the original slow search number from the bulk CLI harness was a methodology + artifact, not the final engine result -Reusable scripts: +## Reusable scripts - `scripts/generate_staged_vector_dataset.py` - `scripts/run_sochdb_stage_vector.sh` - `benchmarks/run_bulk_vector_workload.py` -Current server state: - -- run `20260503T_stage10gb_d768` completed on the benchmark server -- dataset: `synthetic_10gib_768d` -- the first published run used the compiled `sochdb-bulk` binary for build/query -- a later corrected rerun used the in-process native `HnswIndex.load(...)` + - `index.search(...)` path from `run_10gb_bench.py` - -Current outcome: - -- build completed successfully for `3,495,253` vectors at about `892 vec/s` -- output index size was about `10,069 MB` -- the original published `0.0091 QPS` / `109,814 ms p50` search result is now - understood to be a harness artifact, not the true steady-state search speed -- the corrected May 12 native rerun measured about `506.6 QPS` with about - `1.87 ms p50` and `1.97 ms` mean latency after a one-time `106.85 s` index - load -- the next priority is publishing the corrected native lane cleanly and then - continuing the staged scale path with the right measurement method - -Methodology warning: - -- the original slow search number came from a subprocess-per-query bulk CLI - path that reloaded the large index repeatedly -- do not treat that number as the engine's steady-state search performance +## What this lane is for -What this lane measures: +This lane measures: -- insert throughput +- build throughput - search QPS -- search latency percentiles -- dataset payload size and run metadata - -What it does not claim: - -- retrieval quality on BEIR/SciFact -- production corpus realism - -Example: - -```bash -TARGET_GIB=10 DIM=768 \ -$HOME/sochdb-benchmark-runs/work/run_sochdb_stage_vector.sh -``` - -## Pilot runner - -The first reusable runner is: - -- `scripts/run_sochdb_grpc_pilot.sh` - -It wraps the gRPC retrieval benchmark and writes: - -- result JSON -- metadata JSON -- full log - -## Example - -```bash -DATASET_DIR=$HOME/sochdb-benchmark-runs/datasets/scifact \ -EMBEDDING_DIR=$HOME/sochdb-benchmark-runs/datasets/scifact-embeddings \ -$HOME/sochdb-benchmark-runs/work/run_sochdb_grpc_pilot.sh -``` - -## Quality-first benchmark lane - -For quality work, we should avoid mixing too many variables at once. - -The evaluation order should be: - -1. fix the dataset -2. fix the embedding model -3. sweep HNSW settings -4. compare recall / nDCG / latency -5. only then compare different embedding models - -The first quality sweep runner is: - -- `scripts/run_sochdb_grpc_quality_sweep.sh` - -It compares three useful HNSW profiles: - -- `fast` -- `balanced` -- `quality` - -and writes: - -- per-run result JSON -- summary JSON -- summary table text - -## Server-only embedding bakeoff - -If we want to improve retrieval quality, the next likely lever after HNSW sweeps is embeddings. - -The server-only runner for that is: - -- `scripts/run_sochdb_embedding_bakeoff.sh` +- search latency +- index size at scale -It does this on the server: +This lane does not measure: -1. generates embeddings for each configured model -2. runs the same HNSW quality sweep for each embedding set -3. writes one result directory per model +- retrieval relevance quality +- semantic usefulness on real-world corpora -That keeps the heavy work off the laptop and keeps the comparison methodology clean. +## Next step -When models need different embedding backends, use `MODEL_BACKENDS` to override -the default backend per model. Example: +Use the corrected native steady-state methodology for the next stages: -```bash -DATASET_DIR=$HOME/sochdb-benchmark-runs/datasets/scifact \ -MODEL_BACKENDS=thenlper/gte-small=sentence-transformers \ -MODELS=BAAI/bge-small-en-v1.5,thenlper/gte-small \ -$HOME/sochdb-benchmark-runs/work/run_sochdb_embedding_bakeoff.sh -``` +1. publish `100GB` +2. publish `250GB` +3. defer `1TB` until storage is expanded From ea0583c16f1a4c01e3faf68755085fff289242cf Mon Sep 17 00:00:00 2001 From: Sandeep Date: Tue, 12 May 2026 17:33:14 -0500 Subject: [PATCH 3/3] Highlight latest benchmark results in README --- README.md | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) diff --git a/README.md b/README.md index f941d81..f1836f1 100644 --- a/README.md +++ b/README.md @@ -5,6 +5,22 @@ This repository contains reproducible benchmarks comparing **SochDB** against ot **📊 [See Published Results](PUBLISHED_RESULTS.md)** - Comprehensive benchmark findings with real LLM integration **🖥️ [See Server Benchmark Status](docs/SERVER_BENCHMARK_STATUS.md)** - Current hosted benchmark lane, SciFact quality takeaway, and staged scale plan +## Current Highlights + +Latest published benchmark takeaways: + +- **Quality**: Best current SciFact result uses `BAAI/bge-base-en-v1.5` + with `recall@5 = 0.8121`, `MRR = 0.7017`, and `nDCG@5 = 0.7258` +- **Scale**: The corrected `10GB` staged run reached about `506.63 QPS` + with about `1.97 ms` mean latency after a one-time `106.85 s` index load +- **Methodology**: The earlier `~110s/query` `10GB` result came from a bad + harness path and is not the real steady-state engine search number + +For the latest published benchmark state, see: + +- [docs/SERVER_BENCHMARK_STATUS.md](docs/SERVER_BENCHMARK_STATUS.md) +- [docs/STAGED_BENCHMARK_PLAN.md](docs/STAGED_BENCHMARK_PLAN.md) + ## Overview We provide benchmarks across different dimensions: