From 6768d187ff69cc89cca00dbd8efba38d05dde363 Mon Sep 17 00:00:00 2001
From: Sandeep <saisandeep.kantareddy@gmail.com>
Date: Tue, 12 May 2026 17:22:02 -0500
Subject: [PATCH 1/3] Publish corrected 10GB native benchmark result

---
 docs/SERVER_BENCHMARK_STATUS.md               | 64 ++++++++++++++-----
 docs/STAGED_BENCHMARK_PLAN.md                 | 21 ++++--
 ...60512T_10gb_optimized_native_metadata.json | 29 +++++++++
 ...260512T_10gb_optimized_native_summary.json | 44 +++++++++++++
 4 files changed, 138 insertions(+), 20 deletions(-)
 create mode 100644 reports/runs/20260512T_10gb_optimized_native_metadata.json
 create mode 100644 reports/runs/20260512T_10gb_optimized_native_summary.json

diff --git a/docs/SERVER_BENCHMARK_STATUS.md b/docs/SERVER_BENCHMARK_STATUS.md
index f75c94a..0968ad6 100644
--- a/docs/SERVER_BENCHMARK_STATUS.md
+++ b/docs/SERVER_BENCHMARK_STATUS.md
@@ -60,6 +60,8 @@ Local snapshots now checked into this repo:
 - [`reports/runs/20260429T_next_gte_small_st_embedding_metadata.json`](../reports/runs/20260429T_next_gte_small_st_embedding_metadata.json)
 - [`reports/runs/20260503T_stage10gb_d768_sochdb_vector_summary.json`](../reports/runs/20260503T_stage10gb_d768_sochdb_vector_summary.json)
 - [`reports/runs/20260503T_stage10gb_d768_metadata.json`](../reports/runs/20260503T_stage10gb_d768_metadata.json)
+- [`reports/runs/20260512T_10gb_optimized_native_summary.json`](../reports/runs/20260512T_10gb_optimized_native_summary.json)
+- [`reports/runs/20260512T_10gb_optimized_native_metadata.json`](../reports/runs/20260512T_10gb_optimized_native_metadata.json)
 
 ## Established quality finding
 
@@ -194,9 +196,9 @@ For large-scale system work:
 3. run `250GB` only after confirming disk headroom
 4. defer `1TB` until storage is expanded
 
-## Current staged run
+## Current staged `10GB` status
 
-The first staged large-scale run is now complete:
+The first staged large-scale run is now understood in two parts:
 
 - run id: `20260503T_stage10gb_d768`
 - dataset: `synthetic_10gib_768d`
@@ -208,14 +210,24 @@ The first staged large-scale run is now complete:
 - local summary: [`reports/runs/20260503T_stage10gb_d768_sochdb_vector_summary.json`](../reports/runs/20260503T_stage10gb_d768_sochdb_vector_summary.json)
 - local metadata: [`reports/runs/20260503T_stage10gb_d768_metadata.json`](../reports/runs/20260503T_stage10gb_d768_metadata.json)
 
+Corrected native rerun:
+
+- run id: `20260512T_10gb_optimized_native`
+- workload: `sochdb_native_10gb`
+- server script: `run_10gb_bench.py`
+- local summary: [`reports/runs/20260512T_10gb_optimized_native_summary.json`](../reports/runs/20260512T_10gb_optimized_native_summary.json)
+- local metadata: [`reports/runs/20260512T_10gb_optimized_native_metadata.json`](../reports/runs/20260512T_10gb_optimized_native_metadata.json)
+
 Important implementation note:
 
-- the staged lane now uses the compiled `sochdb-bulk` binary on the server for
-  build and query operations
-- this replaced the earlier Python-only workload path, which failed on the
-  hosted machine because the stale `VectorIndex` import path was not available
+- the first successful end-to-end staged run used the compiled `sochdb-bulk`
+  path because the old Python server environment was blocked
+- that got the lane running, but it was not a trustworthy steady-state repeated
+  query benchmark
+- the corrected May 12 rerun used `sochdb.HnswIndex.load(...)` once and then
+  in-process `index.search(...)` / `index.search_batch(...)`
 
-### `10GB` staged result
+### What the first published `10GB` run proved
 
 What succeeded:
 
@@ -224,7 +236,7 @@ What succeeded:
 - observed build throughput was about `891.6 vec/s`
 - output index size was about `10,069.1 MB`
 
-What failed the performance bar:
+What looked bad at first:
 
 - `250` queries took about `27,455.91 s`
 - search throughput was only about `0.0091 QPS`
@@ -232,13 +244,32 @@ What failed the performance bar:
 - `p95` latency was about `110,108 ms`
 - `mean` latency was about `109,822 ms`
 
+Why that result was misleading:
+
+- that runner used `bulk_query_from_file(...)`, which shells out once per query
+- the CLI query path loads the large index before searching
+- the benchmark therefore measured repeated subprocess startup and repeated index
+  reload more than it measured real steady-state ANN search
+
+### Corrected `10GB` native rerun
+
+The verified May 12 rerun on the server showed:
+
+- one-time index load: about `106.85 s`
+- sequential search: about `506.63 QPS`
+- sequential mean latency: about `1.97 ms`
+- sequential `p50`: about `1.87 ms`
+- sequential `p95`: about `2.40 ms`
+- batch search: about `356 QPS`
+
 Interpretation:
 
-- the staged runner itself is now working end to end
-- the bottleneck has moved from benchmark plumbing to SochDB query-path behavior
-- we should not scale this lane to `100GB` yet
-- the next benchmark task is diagnosing why the current search path is roughly
-  `~110 s` per query at `10GB`
+- the catastrophic `~110 s/query` result was a benchmark harness artifact
+- the corrected in-process search result is the meaningful steady-state number
+- the large-scale story is now materially stronger than the old published docs
+  suggested
+- we should use the corrected native path as the baseline for future staged
+  `100GB` and `250GB` work
 
 ## Scripts that define the server lane
 
@@ -253,8 +284,11 @@ Related planning docs:
 
 ## What is still pending
 
-- investigate the `10GB` search-latency failure before running `100GB`
-- complete the staged `10GB` -> `100GB` -> `250GB` scale path after that
+- replace the old misleading `10GB` interpretation everywhere it still appears
+- decide whether to publish the corrected native rerun as the canonical `10GB`
+  search result in a dedicated comparison doc/table
+- continue the staged `10GB` -> `100GB` -> `250GB` scale path using the native
+  steady-state methodology
 - defer `1TB` claims until storage is expanded
 
 This file should be the first place to update whenever new server benchmark work
diff --git a/docs/STAGED_BENCHMARK_PLAN.md b/docs/STAGED_BENCHMARK_PLAN.md
index 2e31c1c..66580d6 100644
--- a/docs/STAGED_BENCHMARK_PLAN.md
+++ b/docs/STAGED_BENCHMARK_PLAN.md
@@ -41,16 +41,27 @@ Current server state:
 
 - run `20260503T_stage10gb_d768` completed on the benchmark server
 - dataset: `synthetic_10gib_768d`
-- runner path now uses the compiled `sochdb-bulk` binary for index build/query
-- this avoids the stale in-process `VectorIndex` path that failed on the hosted box
+- the first published run used the compiled `sochdb-bulk` binary for build/query
+- a later corrected rerun used the in-process native `HnswIndex.load(...)` +
+  `index.search(...)` path from `run_10gb_bench.py`
 
 Current outcome:
 
 - build completed successfully for `3,495,253` vectors at about `892 vec/s`
 - output index size was about `10,069 MB`
-- search throughput was only about `0.0091 QPS`
-- `p50` query latency was about `109,814 ms`
-- the next priority is query-path investigation before moving on to `100GB`
+- the original published `0.0091 QPS` / `109,814 ms p50` search result is now
+  understood to be a harness artifact, not the true steady-state search speed
+- the corrected May 12 native rerun measured about `506.6 QPS` with about
+  `1.87 ms p50` and `1.97 ms` mean latency after a one-time `106.85 s` index
+  load
+- the next priority is publishing the corrected native lane cleanly and then
+  continuing the staged scale path with the right measurement method
+
+Methodology warning:
+
+- the original slow search number came from a subprocess-per-query bulk CLI
+  path that reloaded the large index repeatedly
+- do not treat that number as the engine's steady-state search performance
 
 What this lane measures:
 
diff --git a/reports/runs/20260512T_10gb_optimized_native_metadata.json b/reports/runs/20260512T_10gb_optimized_native_metadata.json
new file mode 100644
index 0000000..70dea87
--- /dev/null
+++ b/reports/runs/20260512T_10gb_optimized_native_metadata.json
@@ -0,0 +1,29 @@
+{
+  "run_id": "20260512T_10gb_optimized_native",
+  "timestamp_utc": "2026-05-12T08:23:20.389332",
+  "dataset_name": "synthetic_10gib_768d",
+  "dataset_dir": "<benchmark-workspace>/datasets/synthetic_10gib_768d",
+  "result_json": "<benchmark-workspace>/results/10gb_optimized/results_m16.json",
+  "workload": "sochdb_native_10gb",
+  "methodology": {
+    "script": "<benchmark-workspace>/work/run_10gb_bench.py",
+    "search_path": "in-process native extension",
+    "index_load": "load once before repeated queries",
+    "warmup_queries": 10,
+    "notes": [
+      "This rerun avoids the per-query subprocess path used by the earlier bulk CLI harness.",
+      "It should be treated as the corrected steady-state search measurement for the loaded index.",
+      "This artifact was verified on the server and then copied into the repo."
+    ]
+  },
+  "config": {
+    "num_vectors": 3495253,
+    "num_queries": 1000,
+    "dimension": 768,
+    "M": 32,
+    "ef_construction": 200,
+    "ef_search": 64,
+    "k": 10,
+    "batch_size": 5000
+  }
+}
diff --git a/reports/runs/20260512T_10gb_optimized_native_summary.json b/reports/runs/20260512T_10gb_optimized_native_summary.json
new file mode 100644
index 0000000..ed3df33
--- /dev/null
+++ b/reports/runs/20260512T_10gb_optimized_native_summary.json
@@ -0,0 +1,44 @@
+{
+  "load_s": 106.85488888109103,
+  "search_sequential": {
+    "total_s": 1.973815259989351,
+    "qps": 506.6330270470171,
+    "mean_ms": 1.973815259989351,
+    "p50_ms": 1.8717250786721706,
+    "p95_ms": 2.4010292254388332,
+    "p99_ms": 6.252808030694723
+  },
+  "search_batch": {
+    "total_s": 2.808144075796008,
+    "qps": 356.1070846112253,
+    "per_query_ms": 2.808144075796008
+  },
+  "search_batch_ef64": {
+    "qps": 358.62969130334096,
+    "per_query_ms": 2.788391547743231
+  },
+  "search_batch_ef128": {
+    "qps": 358.11082750298215,
+    "per_query_ms": 2.792431625071913
+  },
+  "search_batch_ef256": {
+    "qps": 359.017884829031,
+    "per_query_ms": 2.785376557148993
+  },
+  "search_batch_ef512": {
+    "qps": 356.51812853303915,
+    "per_query_ms": 2.8049064548686147
+  },
+  "config": {
+    "num_vectors": 3495253,
+    "num_queries": 1000,
+    "dimension": 768,
+    "M": 32,
+    "ef_construction": 200,
+    "ef_search": 64,
+    "k": 10,
+    "batch_size": 5000
+  },
+  "timestamp": "2026-05-12T08:23:20.389332",
+  "workload": "sochdb_native_10gb"
+}

From 5558a4b8930e10c0422673d0a9b846f844510dbb Mon Sep 17 00:00:00 2001
From: Sandeep <saisandeep.kantareddy@gmail.com>
Date: Tue, 12 May 2026 17:27:24 -0500
Subject: [PATCH 2/3] Simplify benchmark status docs

---
 docs/SERVER_BENCHMARK_STATUS.md | 294 +++++---------------------------
 docs/STAGED_BENCHMARK_PLAN.md   | 163 ++++--------------
 2 files changed, 79 insertions(+), 378 deletions(-)

diff --git a/docs/SERVER_BENCHMARK_STATUS.md b/docs/SERVER_BENCHMARK_STATUS.md
index 0968ad6..6b9a3e4 100644
--- a/docs/SERVER_BENCHMARK_STATUS.md
+++ b/docs/SERVER_BENCHMARK_STATUS.md
@@ -1,259 +1,67 @@
 # Server Benchmark Status
 
-This document captures the current state of the heavy benchmark lane that runs on
-the hosted SochDB server instead of on a laptop.
+This document captures the current benchmark story for the hosted SochDB server.
 
-## Why the server lane exists
+## Current setup
 
-Heavy benchmark work should happen on the benchmark server, not on a developer
-laptop. That is especially true for:
+- heavy benchmark work runs on the benchmark server, not on laptops
+- hosted gRPC demo endpoint: `studio.agentslab.host:50053`
+- current server class: about `12` CPU, about `62 GiB` RAM
+- current storage is not appropriate for a final `1TB` claim yet
 
-- retrieval-quality sweeps
-- embedding bakeoffs
-- staged large-dataset runs
-- repeatable gRPC benchmark runs against the hosted demo endpoint
-
-Current server target:
-
-- host: private benchmark server
-- SSH: stored out-of-band for operators only
-- hosted gRPC endpoint: `studio.agentslab.host:50053`
-
-## Current server constraints
-
-The server is good enough for repeated CPU-oriented benchmark work, but it is not
-the right machine for a final `1TB` claim yet.
-
-- about `12` CPU
-- about `62 GiB` RAM
-- limited free root-disk capacity for honest `1TB` benchmarking
-- weak GPU (`GeForce GT 710`), so embedding work should remain CPU-friendly
-
-Because of that, the large-scale benchmark story should stay staged:
+Because of that, large-scale benchmarking stays staged:
 
 1. `10GB`
 2. `100GB`
 3. `250GB`
-4. `1TB` only after moving to a larger disk or attached storage
-
-## Current benchmark workspace on the server
-
-- `<benchmark-workspace>/datasets`
-- `<benchmark-workspace>/embeddings`
-- `<benchmark-workspace>/results`
-- `<benchmark-workspace>/logs`
-- `<benchmark-workspace>/work`
-
-These locations should be treated as the canonical landing zone for heavy benchmark
-artifacts before we selectively publish summaries back into this repo.
-
-Local snapshots now checked into this repo:
-
-- [`reports/runs/20260427T224143Z_scifact_baseline_pilot_metadata.json`](../reports/runs/20260427T224143Z_scifact_baseline_pilot_metadata.json)
-- [`reports/runs/20260427T225122Z_scifact_baseline_summary.json`](../reports/runs/20260427T225122Z_scifact_baseline_summary.json)
-- [`reports/runs/20260427T225122Z_scifact_baseline_embedding_metadata.json`](../reports/runs/20260427T225122Z_scifact_baseline_embedding_metadata.json)
-- [`reports/runs/20260427T230412Z_scifact_bge_small_summary.json`](../reports/runs/20260427T230412Z_scifact_bge_small_summary.json)
-- [`reports/runs/20260427T230412Z_scifact_bge_small_embedding_metadata.json`](../reports/runs/20260427T230412Z_scifact_bge_small_embedding_metadata.json)
-- [`reports/runs/20260429T_next_bge_base_summary.json`](../reports/runs/20260429T_next_bge_base_summary.json)
-- [`reports/runs/20260429T_next_bge_base_embedding_metadata.json`](../reports/runs/20260429T_next_bge_base_embedding_metadata.json)
-- [`reports/runs/20260429T_next_gte_small_st_summary.json`](../reports/runs/20260429T_next_gte_small_st_summary.json)
-- [`reports/runs/20260429T_next_gte_small_st_embedding_metadata.json`](../reports/runs/20260429T_next_gte_small_st_embedding_metadata.json)
-- [`reports/runs/20260503T_stage10gb_d768_sochdb_vector_summary.json`](../reports/runs/20260503T_stage10gb_d768_sochdb_vector_summary.json)
-- [`reports/runs/20260503T_stage10gb_d768_metadata.json`](../reports/runs/20260503T_stage10gb_d768_metadata.json)
-- [`reports/runs/20260512T_10gb_optimized_native_summary.json`](../reports/runs/20260512T_10gb_optimized_native_summary.json)
-- [`reports/runs/20260512T_10gb_optimized_native_metadata.json`](../reports/runs/20260512T_10gb_optimized_native_metadata.json)
-
-## Established quality finding
+4. `1TB` only after storage expansion
 
-The most important retrieval-quality result so far is that embedding choice moved
-quality more than HNSW tuning on SciFact.
+## Published quality result
 
-Latest verified server runs:
+Quality is measured separately from scale, using SciFact retrieval benchmarks.
 
-- baseline sweep run: `20260427T225122Z`
-- baseline pilot metadata run: `20260427T224143Z`
-- `BAAI/bge-small-en-v1.5` sweep run: `20260427T230412Z`
-- `thenlper/gte-small` sweep run: `20260429T_next_gte_small_st`
-- `BAAI/bge-base-en-v1.5` sweep run: `20260429T_next_bge_base`
+Current best published result:
 
-Baseline embedding metadata:
-
-- backend: `sentence-transformers`
-- model: `sentence-transformers/all-MiniLM-L6-v2`
-- dataset: SciFact
-- documents: `5183`
-- queries: `300`
-- dimension: `384`
-
-BGE comparison embedding metadata:
-
-- backend: `fastembed`
-- model: `BAAI/bge-small-en-v1.5`
-- dataset: SciFact
-- documents: `5183`
-- queries: `300`
-- dimension: `384`
-
-GTE-small embedding metadata:
-
-- backend: `sentence-transformers`
-- model: `thenlper/gte-small`
-- dataset: SciFact
-- documents: `5183`
-- queries: `300`
-- dimension: `384`
-
-BGE-base embedding metadata:
-
-- backend: `fastembed`
 - model: `BAAI/bge-base-en-v1.5`
-- dataset: SciFact
-- documents: `5183`
-- queries: `300`
-- dimension: `768`
-
-Summary of the current conclusion:
+- recall@5: `0.8121`
+- MRR: `0.7017`
+- nDCG@5: `0.7258`
 
-- baseline SciFact `recall@5` was about `0.7109`
-- `thenlper/gte-small` reached about `0.7786` `recall@5`
-- `BAAI/bge-base-en-v1.5` reached about `0.8121` `recall@5`
-- `MRR` and `nDCG` improved as well
-- `gte-small` stayed near baseline latency
-- `bge-base-en-v1.5` improved quality further, but with noticeably higher latency
-- HNSW parameter sweeps did not meaningfully change quality compared with the
-  embedding-model change
+Main takeaway:
 
-### Exact SciFact comparison
+- embedding choice moved quality more than HNSW tuning in the current setup
 
-| Embeddings | Run | recall@5 | MRR | nDCG@5 | p50 (ms) | p95 (ms) | mean (ms) |
-| :--- | :--- | ---: | ---: | ---: | ---: | ---: | ---: |
-| `all-MiniLM-L6-v2` + `fast` HNSW | `20260427T225122Z` | `0.7109` | `0.5883` | `0.6135` | `0.857` | `0.996` | `0.800` |
-| `all-MiniLM-L6-v2` + `balanced` HNSW | `20260427T225122Z` | `0.7109` | `0.5883` | `0.6135` | `0.953` | `1.019` | `0.880` |
-| `all-MiniLM-L6-v2` + `quality` HNSW | `20260427T225122Z` | `0.7109` | `0.5883` | `0.6135` | `0.900` | `1.001` | `0.813` |
-| `BAAI/bge-small-en-v1.5` + `fast` HNSW | `20260427T230412Z` | `0.7624` | `0.6603` | `0.6812` | `0.920` | `1.041` | `0.840` |
-| `BAAI/bge-small-en-v1.5` + `balanced` HNSW | `20260427T230412Z` | `0.7624` | `0.6603` | `0.6812` | `0.929` | `0.985` | `0.833` |
-| `BAAI/bge-small-en-v1.5` + `quality` HNSW | `20260427T230412Z` | `0.7624` | `0.6603` | `0.6812` | `0.720` | `0.992` | `0.775` |
-| `thenlper/gte-small` + `fast` HNSW | `20260429T_next_gte_small_st` | `0.7786` | `0.6711` | `0.6944` | `0.955` | `1.028` | `0.878` |
-| `thenlper/gte-small` + `balanced` HNSW | `20260429T_next_gte_small_st` | `0.7786` | `0.6711` | `0.6944` | `0.976` | `1.049` | `0.888` |
-| `thenlper/gte-small` + `quality` HNSW | `20260429T_next_gte_small_st` | `0.7786` | `0.6711` | `0.6944` | `0.968` | `1.056` | `0.901` |
-| `BAAI/bge-base-en-v1.5` + `fast` HNSW | `20260429T_next_bge_base` | `0.8121` | `0.7017` | `0.7258` | `1.787` | `3.287` | `1.951` |
-| `BAAI/bge-base-en-v1.5` + `balanced` HNSW | `20260429T_next_bge_base` | `0.8121` | `0.7017` | `0.7258` | `2.683` | `4.243` | `2.989` |
-| `BAAI/bge-base-en-v1.5` + `quality` HNSW | `20260429T_next_bge_base` | `0.8121` | `0.7017` | `0.7258` | `1.823` | `2.946` | `2.189` |
+Useful reference points:
 
-### Best-to-best summary
-
-Using the best observed latency profile from each embedding set:
-
-| Comparison | recall@5 | MRR | nDCG@5 | mean latency |
+| Embeddings | recall@5 | MRR | nDCG@5 | mean latency |
 | :--- | ---: | ---: | ---: | ---: |
-| baseline `all-MiniLM-L6-v2` | `0.7109` | `0.5883` | `0.6135` | `0.800 ms` |
+| `all-MiniLM-L6-v2` | `0.7109` | `0.5883` | `0.6135` | `0.800 ms` |
 | `BAAI/bge-small-en-v1.5` | `0.7624` | `0.6603` | `0.6812` | `0.775 ms` |
 | `thenlper/gte-small` | `0.7786` | `0.6711` | `0.6944` | `0.878 ms` |
 | `BAAI/bge-base-en-v1.5` | `0.8121` | `0.7017` | `0.7258` | `1.951 ms` |
 
-Observed gains from the embedding change:
-
-- `recall@5`: `+0.0516` absolute, about `+7.3%` relative
-- `MRR`: `+0.0719` absolute, about `+12.2%` relative
-- `nDCG@5`: `+0.0677` absolute, about `+11.0%` relative
+Published artifacts:
 
-Observed gains for `thenlper/gte-small` over baseline:
+- [`20260427T225122Z_scifact_baseline_summary.json`](../reports/runs/20260427T225122Z_scifact_baseline_summary.json)
+- [`20260427T230412Z_scifact_bge_small_summary.json`](../reports/runs/20260427T230412Z_scifact_bge_small_summary.json)
+- [`20260429T_next_gte_small_st_summary.json`](../reports/runs/20260429T_next_gte_small_st_summary.json)
+- [`20260429T_next_bge_base_summary.json`](../reports/runs/20260429T_next_bge_base_summary.json)
 
-- `recall@5`: `+0.0677` absolute, about `+9.5%` relative
-- `MRR`: `+0.0827` absolute, about `+14.1%` relative
-- `nDCG@5`: `+0.0809` absolute, about `+13.2%` relative
+## Published `10GB` scale result
 
-Observed gains for `BAAI/bge-base-en-v1.5` over baseline:
+Dataset:
 
-- `recall@5`: `+0.1012` absolute, about `+14.2%` relative
-- `MRR`: `+0.1134` absolute, about `+19.3%` relative
-- `nDCG@5`: `+0.1122` absolute, about `+18.3%` relative
-
-Interpretation:
-
-- the next strong retrieval lever is embedding selection
-- `BAAI/bge-base-en-v1.5` is the current quality leader on SciFact
-- `thenlper/gte-small` is a useful middle point when we want a lighter latency hit
-- HNSW sweeps are still useful for latency/recall tradeoff mapping
-- we should not oversell ANN tuning as the main quality breakthrough
-- dimensionality matters in this comparison set, so `384`-dim and `768`-dim wins
-  should not be treated as identical cost classes
-
-## Recommended benchmark order from here
-
-For retrieval work, keep the methodology disciplined:
-
-1. fix the dataset
-2. fix the embedding model
-3. sweep HNSW settings
-4. compare `recall@k`, `MRR`, `nDCG`, and latency
-5. compare embedding models on the same benchmark path
-
-For large-scale system work:
-
-1. complete clean `10GB` results
-2. publish `100GB` results
-3. run `250GB` only after confirming disk headroom
-4. defer `1TB` until storage is expanded
-
-## Current staged `10GB` status
-
-The first staged large-scale run is now understood in two parts:
-
-- run id: `20260503T_stage10gb_d768`
-- dataset: `synthetic_10gib_768d`
-- target size: `10 GiB`
+- run family: `synthetic_10gib_768d`
+- vectors: `3,495,253`
 - dimension: `768`
-- query count: `250`
-- runner: `scripts/run_sochdb_stage_vector.sh`
-- workload: `benchmarks/run_bulk_vector_workload.py`
-- local summary: [`reports/runs/20260503T_stage10gb_d768_sochdb_vector_summary.json`](../reports/runs/20260503T_stage10gb_d768_sochdb_vector_summary.json)
-- local metadata: [`reports/runs/20260503T_stage10gb_d768_metadata.json`](../reports/runs/20260503T_stage10gb_d768_metadata.json)
-
-Corrected native rerun:
-
-- run id: `20260512T_10gb_optimized_native`
-- workload: `sochdb_native_10gb`
-- server script: `run_10gb_bench.py`
-- local summary: [`reports/runs/20260512T_10gb_optimized_native_summary.json`](../reports/runs/20260512T_10gb_optimized_native_summary.json)
-- local metadata: [`reports/runs/20260512T_10gb_optimized_native_metadata.json`](../reports/runs/20260512T_10gb_optimized_native_metadata.json)
 
-Important implementation note:
+Build result:
 
-- the first successful end-to-end staged run used the compiled `sochdb-bulk`
-  path because the old Python server environment was blocked
-- that got the lane running, but it was not a trustworthy steady-state repeated
-  query benchmark
-- the corrected May 12 rerun used `sochdb.HnswIndex.load(...)` once and then
-  in-process `index.search(...)` / `index.search_batch(...)`
+- build throughput: about `891.6 vec/s`
+- build time: about `3920.14 s`
+- output index size: about `10,069.1 MB`
 
-### What the first published `10GB` run proved
-
-What succeeded:
-
-- index build completed successfully for `3,495,253` vectors
-- build time was about `3920.14 s` (`~65.3 min`)
-- observed build throughput was about `891.6 vec/s`
-- output index size was about `10,069.1 MB`
-
-What looked bad at first:
-
-- `250` queries took about `27,455.91 s`
-- search throughput was only about `0.0091 QPS`
-- `p50` latency was about `109,814 ms`
-- `p95` latency was about `110,108 ms`
-- `mean` latency was about `109,822 ms`
-
-Why that result was misleading:
-
-- that runner used `bulk_query_from_file(...)`, which shells out once per query
-- the CLI query path loads the large index before searching
-- the benchmark therefore measured repeated subprocess startup and repeated index
-  reload more than it measured real steady-state ANN search
-
-### Corrected `10GB` native rerun
-
-The verified May 12 rerun on the server showed:
+Corrected steady-state search result:
 
 - one-time index load: about `106.85 s`
 - sequential search: about `506.63 QPS`
@@ -262,34 +70,20 @@ The verified May 12 rerun on the server showed:
 - sequential `p95`: about `2.40 ms`
 - batch search: about `356 QPS`
 
-Interpretation:
-
-- the catastrophic `~110 s/query` result was a benchmark harness artifact
-- the corrected in-process search result is the meaningful steady-state number
-- the large-scale story is now materially stronger than the old published docs
-  suggested
-- we should use the corrected native path as the baseline for future staged
-  `100GB` and `250GB` work
+Important note:
 
-## Scripts that define the server lane
+- the earlier `0.0091 QPS` / `~110s per query` result came from a bad benchmark
+  harness path and should not be treated as the real steady-state engine result
 
-- `scripts/run_sochdb_grpc_quality_sweep.sh`
-- `scripts/run_sochdb_embedding_bakeoff.sh`
-- `scripts/run_sochdb_stage_vector.sh`
+Published artifacts:
 
-Related planning docs:
+- [`20260503T_stage10gb_d768_sochdb_vector_summary.json`](../reports/runs/20260503T_stage10gb_d768_sochdb_vector_summary.json)
+- [`20260503T_stage10gb_d768_metadata.json`](../reports/runs/20260503T_stage10gb_d768_metadata.json)
+- [`20260512T_10gb_optimized_native_summary.json`](../reports/runs/20260512T_10gb_optimized_native_summary.json)
+- [`20260512T_10gb_optimized_native_metadata.json`](../reports/runs/20260512T_10gb_optimized_native_metadata.json)
 
-- [`STAGED_BENCHMARK_PLAN.md`](./STAGED_BENCHMARK_PLAN.md)
-- [`RETRIEVAL_AND_VECTOR_PLAN.md`](./RETRIEVAL_AND_VECTOR_PLAN.md)
+## What is pending
 
-## What is still pending
-
-- replace the old misleading `10GB` interpretation everywhere it still appears
-- decide whether to publish the corrected native rerun as the canonical `10GB`
-  search result in a dedicated comparison doc/table
-- continue the staged `10GB` -> `100GB` -> `250GB` scale path using the native
-  steady-state methodology
+- publish `100GB` results using the corrected native steady-state methodology
+- publish `250GB` results after confirming disk headroom
 - defer `1TB` claims until storage is expanded
-
-This file should be the first place to update whenever new server benchmark work
-changes the current benchmark story.
diff --git a/docs/STAGED_BENCHMARK_PLAN.md b/docs/STAGED_BENCHMARK_PLAN.md
index 66580d6..0f57012 100644
--- a/docs/STAGED_BENCHMARK_PLAN.md
+++ b/docs/STAGED_BENCHMARK_PLAN.md
@@ -1,157 +1,64 @@
 # Staged Benchmark Plan
 
-For the current hosted benchmark state and the already-established retrieval
-quality takeaway, start with
-[`SERVER_BENCHMARK_STATUS.md`](./SERVER_BENCHMARK_STATUS.md).
+Start with [`SERVER_BENCHMARK_STATUS.md`](./SERVER_BENCHMARK_STATUS.md) for the
+current published results.
 
 ## Why staged
 
-The current benchmark server does not have enough free root-disk capacity for a
-clean, honest `1TB` run today.
+The current benchmark server does not yet have the storage profile for a clean
+final `1TB` benchmark claim.
 
-So we should run staged sizes:
+So the scale path stays staged:
 
 1. `10GB`
 2. `100GB`
 3. `250GB`
-4. `1TB` only after storage is expanded
+4. `1TB` after storage expansion
 
-## Current benchmark workspace
+## Current `10GB` status
 
-Server:
+The first staged dataset is:
 
-- `<benchmark-workspace>/datasets`
-- `<benchmark-workspace>/results`
-- `<benchmark-workspace>/logs`
-- `<benchmark-workspace>/work`
+- dataset: `synthetic_10gib_768d`
+- vectors: `3,495,253`
+- dimension: `768`
+
+Published result summary:
+
+- build worked and produced about a `10,069 MB` index
+- corrected steady-state search result is about `506.63 QPS`
+- corrected sequential mean latency is about `1.97 ms`
+- one-time index load is about `106.85 s`
 
-## First staged `10GB` lane
+Important note:
 
-For the first large-scale system pass, use a synthetic normalized-vector dataset
-for throughput and latency characterization. Keep this separate from the SciFact
-quality lane.
+- the original slow search number from the bulk CLI harness was a methodology
+  artifact, not the final engine result
 
-Reusable scripts:
+## Reusable scripts
 
 - `scripts/generate_staged_vector_dataset.py`
 - `scripts/run_sochdb_stage_vector.sh`
 - `benchmarks/run_bulk_vector_workload.py`
 
-Current server state:
-
-- run `20260503T_stage10gb_d768` completed on the benchmark server
-- dataset: `synthetic_10gib_768d`
-- the first published run used the compiled `sochdb-bulk` binary for build/query
-- a later corrected rerun used the in-process native `HnswIndex.load(...)` +
-  `index.search(...)` path from `run_10gb_bench.py`
-
-Current outcome:
-
-- build completed successfully for `3,495,253` vectors at about `892 vec/s`
-- output index size was about `10,069 MB`
-- the original published `0.0091 QPS` / `109,814 ms p50` search result is now
-  understood to be a harness artifact, not the true steady-state search speed
-- the corrected May 12 native rerun measured about `506.6 QPS` with about
-  `1.87 ms p50` and `1.97 ms` mean latency after a one-time `106.85 s` index
-  load
-- the next priority is publishing the corrected native lane cleanly and then
-  continuing the staged scale path with the right measurement method
-
-Methodology warning:
-
-- the original slow search number came from a subprocess-per-query bulk CLI
-  path that reloaded the large index repeatedly
-- do not treat that number as the engine's steady-state search performance
+## What this lane is for
 
-What this lane measures:
+This lane measures:
 
-- insert throughput
+- build throughput
 - search QPS
-- search latency percentiles
-- dataset payload size and run metadata
-
-What it does not claim:
-
-- retrieval quality on BEIR/SciFact
-- production corpus realism
-
-Example:
-
-```bash
-TARGET_GIB=10 DIM=768 \
-$HOME/sochdb-benchmark-runs/work/run_sochdb_stage_vector.sh
-```
-
-## Pilot runner
-
-The first reusable runner is:
-
-- `scripts/run_sochdb_grpc_pilot.sh`
-
-It wraps the gRPC retrieval benchmark and writes:
-
-- result JSON
-- metadata JSON
-- full log
-
-## Example
-
-```bash
-DATASET_DIR=$HOME/sochdb-benchmark-runs/datasets/scifact \
-EMBEDDING_DIR=$HOME/sochdb-benchmark-runs/datasets/scifact-embeddings \
-$HOME/sochdb-benchmark-runs/work/run_sochdb_grpc_pilot.sh
-```
-
-## Quality-first benchmark lane
-
-For quality work, we should avoid mixing too many variables at once.
-
-The evaluation order should be:
-
-1. fix the dataset
-2. fix the embedding model
-3. sweep HNSW settings
-4. compare recall / nDCG / latency
-5. only then compare different embedding models
-
-The first quality sweep runner is:
-
-- `scripts/run_sochdb_grpc_quality_sweep.sh`
-
-It compares three useful HNSW profiles:
-
-- `fast`
-- `balanced`
-- `quality`
-
-and writes:
-
-- per-run result JSON
-- summary JSON
-- summary table text
-
-## Server-only embedding bakeoff
-
-If we want to improve retrieval quality, the next likely lever after HNSW sweeps is embeddings.
-
-The server-only runner for that is:
-
-- `scripts/run_sochdb_embedding_bakeoff.sh`
+- search latency
+- index size at scale
 
-It does this on the server:
+This lane does not measure:
 
-1. generates embeddings for each configured model
-2. runs the same HNSW quality sweep for each embedding set
-3. writes one result directory per model
+- retrieval relevance quality
+- semantic usefulness on real-world corpora
 
-That keeps the heavy work off the laptop and keeps the comparison methodology clean.
+## Next step
 
-When models need different embedding backends, use `MODEL_BACKENDS` to override
-the default backend per model. Example:
+Use the corrected native steady-state methodology for the next stages:
 
-```bash
-DATASET_DIR=$HOME/sochdb-benchmark-runs/datasets/scifact \
-MODEL_BACKENDS=thenlper/gte-small=sentence-transformers \
-MODELS=BAAI/bge-small-en-v1.5,thenlper/gte-small \
-$HOME/sochdb-benchmark-runs/work/run_sochdb_embedding_bakeoff.sh
-```
+1. publish `100GB`
+2. publish `250GB`
+3. defer `1TB` until storage is expanded

From ea0583c16f1a4c01e3faf68755085fff289242cf Mon Sep 17 00:00:00 2001
From: Sandeep <saisandeep.kantareddy@gmail.com>
Date: Tue, 12 May 2026 17:33:14 -0500
Subject: [PATCH 3/3] Highlight latest benchmark results in README

---
 README.md | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/README.md b/README.md
index f941d81..f1836f1 100644
--- a/README.md
+++ b/README.md
@@ -5,6 +5,22 @@ This repository contains reproducible benchmarks comparing **SochDB** against ot
 **📊 [See Published Results](PUBLISHED_RESULTS.md)** - Comprehensive benchmark findings with real LLM integration
 **🖥️ [See Server Benchmark Status](docs/SERVER_BENCHMARK_STATUS.md)** - Current hosted benchmark lane, SciFact quality takeaway, and staged scale plan
 
+## Current Highlights
+
+Latest published benchmark takeaways:
+
+- **Quality**: Best current SciFact result uses `BAAI/bge-base-en-v1.5`
+  with `recall@5 = 0.8121`, `MRR = 0.7017`, and `nDCG@5 = 0.7258`
+- **Scale**: The corrected `10GB` staged run reached about `506.63 QPS`
+  with about `1.97 ms` mean latency after a one-time `106.85 s` index load
+- **Methodology**: The earlier `~110s/query` `10GB` result came from a bad
+  harness path and is not the real steady-state engine search number
+
+For the latest published benchmark state, see:
+
+- [docs/SERVER_BENCHMARK_STATUS.md](docs/SERVER_BENCHMARK_STATUS.md)
+- [docs/STAGED_BENCHMARK_PLAN.md](docs/STAGED_BENCHMARK_PLAN.md)
+
 ## Overview
 
 We provide benchmarks across different dimensions: