sochdb · SaiSandeepKantareddy · May 12, 2026 · May 12, 2026 · May 12, 2026
diff --git a/README.md b/README.md
@@ -5,6 +5,22 @@ This repository contains reproducible benchmarks comparing **SochDB** against ot
 **📊 [See Published Results](PUBLISHED_RESULTS.md)** - Comprehensive benchmark findings with real LLM integration
 **🖥️ [See Server Benchmark Status](docs/SERVER_BENCHMARK_STATUS.md)** - Current hosted benchmark lane, SciFact quality takeaway, and staged scale plan
 
+## Current Highlights
+
+Latest published benchmark takeaways:
+
+- **Quality**: Best current SciFact result uses `BAAI/bge-base-en-v1.5`
+  with `recall@5 = 0.8121`, `MRR = 0.7017`, and `nDCG@5 = 0.7258`
+- **Scale**: The corrected `10GB` staged run reached about `506.63 QPS`
+  with about `1.97 ms` mean latency after a one-time `106.85 s` index load
+- **Methodology**: The earlier `~110s/query` `10GB` result came from a bad
+  harness path and is not the real steady-state engine search number
+
+For the latest published benchmark state, see:
+
+- [docs/SERVER_BENCHMARK_STATUS.md](docs/SERVER_BENCHMARK_STATUS.md)
+- [docs/STAGED_BENCHMARK_PLAN.md](docs/STAGED_BENCHMARK_PLAN.md)
+
 ## Overview
 
 We provide benchmarks across different dimensions:

diff --git a/docs/SERVER_BENCHMARK_STATUS.md b/docs/SERVER_BENCHMARK_STATUS.md
@@ -1,261 +1,89 @@
 # Server Benchmark Status
 
-This document captures the current state of the heavy benchmark lane that runs on
-the hosted SochDB server instead of on a laptop.
+This document captures the current benchmark story for the hosted SochDB server.
 
-## Why the server lane exists
+## Current setup
 
-Heavy benchmark work should happen on the benchmark server, not on a developer
-laptop. That is especially true for:
+- heavy benchmark work runs on the benchmark server, not on laptops
+- hosted gRPC demo endpoint: `studio.agentslab.host:50053`
+- current server class: about `12` CPU, about `62 GiB` RAM
+- current storage is not appropriate for a final `1TB` claim yet
 
-- retrieval-quality sweeps
-- embedding bakeoffs
-- staged large-dataset runs
-- repeatable gRPC benchmark runs against the hosted demo endpoint
-
-Current server target:
-
-- host: private benchmark server
-- SSH: stored out-of-band for operators only
-- hosted gRPC endpoint: `studio.agentslab.host:50053`
-
-## Current server constraints
-
-The server is good enough for repeated CPU-oriented benchmark work, but it is not
-the right machine for a final `1TB` claim yet.
-
-- about `12` CPU
-- about `62 GiB` RAM
-- limited free root-disk capacity for honest `1TB` benchmarking
-- weak GPU (`GeForce GT 710`), so embedding work should remain CPU-friendly
-
-Because of that, the large-scale benchmark story should stay staged:
+Because of that, large-scale benchmarking stays staged:
 
 1. `10GB`
 2. `100GB`
 3. `250GB`
-4. `1TB` only after moving to a larger disk or attached storage
-
-## Current benchmark workspace on the server
-
-- `<benchmark-workspace>/datasets`
-- `<benchmark-workspace>/embeddings`
-- `<benchmark-workspace>/results`
-- `<benchmark-workspace>/logs`
-- `<benchmark-workspace>/work`
-
-These locations should be treated as the canonical landing zone for heavy benchmark
-artifacts before we selectively publish summaries back into this repo.
-
-Local snapshots now checked into this repo:
+4. `1TB` only after storage expansion
 
-- [`reports/runs/20260427T224143Z_scifact_baseline_pilot_metadata.json`](../reports/runs/20260427T224143Z_scifact_baseline_pilot_metadata.json)
-- [`reports/runs/20260427T225122Z_scifact_baseline_summary.json`](../reports/runs/20260427T225122Z_scifact_baseline_summary.json)
-- [`reports/runs/20260427T225122Z_scifact_baseline_embedding_metadata.json`](../reports/runs/20260427T225122Z_scifact_baseline_embedding_metadata.json)
-- [`reports/runs/20260427T230412Z_scifact_bge_small_summary.json`](../reports/runs/20260427T230412Z_scifact_bge_small_summary.json)
-- [`reports/runs/20260427T230412Z_scifact_bge_small_embedding_metadata.json`](../reports/runs/20260427T230412Z_scifact_bge_small_embedding_metadata.json)
-- [`reports/runs/20260429T_next_bge_base_summary.json`](../reports/runs/20260429T_next_bge_base_summary.json)
-- [`reports/runs/20260429T_next_bge_base_embedding_metadata.json`](../reports/runs/20260429T_next_bge_base_embedding_metadata.json)
-- [`reports/runs/20260429T_next_gte_small_st_summary.json`](../reports/runs/20260429T_next_gte_small_st_summary.json)
-- [`reports/runs/20260429T_next_gte_small_st_embedding_metadata.json`](../reports/runs/20260429T_next_gte_small_st_embedding_metadata.json)
-- [`reports/runs/20260503T_stage10gb_d768_sochdb_vector_summary.json`](../reports/runs/20260503T_stage10gb_d768_sochdb_vector_summary.json)
-- [`reports/runs/20260503T_stage10gb_d768_metadata.json`](../reports/runs/20260503T_stage10gb_d768_metadata.json)
+## Published quality result
 
-## Established quality finding
+Quality is measured separately from scale, using SciFact retrieval benchmarks.
 
-The most important retrieval-quality result so far is that embedding choice moved
-quality more than HNSW tuning on SciFact.
+Current best published result:
 
-Latest verified server runs:
-
-- baseline sweep run: `20260427T225122Z`
-- baseline pilot metadata run: `20260427T224143Z`
-- `BAAI/bge-small-en-v1.5` sweep run: `20260427T230412Z`
-- `thenlper/gte-small` sweep run: `20260429T_next_gte_small_st`
-- `BAAI/bge-base-en-v1.5` sweep run: `20260429T_next_bge_base`
-
-Baseline embedding metadata:
-
-- backend: `sentence-transformers`
-- model: `sentence-transformers/all-MiniLM-L6-v2`
-- dataset: SciFact
-- documents: `5183`
-- queries: `300`
-- dimension: `384`
-
-BGE comparison embedding metadata:
-
-- backend: `fastembed`
-- model: `BAAI/bge-small-en-v1.5`
-- dataset: SciFact
-- documents: `5183`
-- queries: `300`
-- dimension: `384`
-
-GTE-small embedding metadata:
-
-- backend: `sentence-transformers`
-- model: `thenlper/gte-small`
-- dataset: SciFact
-- documents: `5183`
-- queries: `300`
-- dimension: `384`
-
-BGE-base embedding metadata:
-
-- backend: `fastembed`
 - model: `BAAI/bge-base-en-v1.5`
-- dataset: SciFact
-- documents: `5183`
-- queries: `300`
-- dimension: `768`
+- recall@5: `0.8121`
+- MRR: `0.7017`
+- nDCG@5: `0.7258`
 
-Summary of the current conclusion:
+Main takeaway:
 
-- baseline SciFact `recall@5` was about `0.7109`
-- `thenlper/gte-small` reached about `0.7786` `recall@5`
-- `BAAI/bge-base-en-v1.5` reached about `0.8121` `recall@5`
-- `MRR` and `nDCG` improved as well
-- `gte-small` stayed near baseline latency
-- `bge-base-en-v1.5` improved quality further, but with noticeably higher latency
-- HNSW parameter sweeps did not meaningfully change quality compared with the
-  embedding-model change
+- embedding choice moved quality more than HNSW tuning in the current setup
 
-### Exact SciFact comparison
+Useful reference points:
 
-| Embeddings | Run | recall@5 | MRR | nDCG@5 | p50 (ms) | p95 (ms) | mean (ms) |
-| :--- | :--- | ---: | ---: | ---: | ---: | ---: | ---: |
-| `all-MiniLM-L6-v2` + `fast` HNSW | `20260427T225122Z` | `0.7109` | `0.5883` | `0.6135` | `0.857` | `0.996` | `0.800` |
-| `all-MiniLM-L6-v2` + `balanced` HNSW | `20260427T225122Z` | `0.7109` | `0.5883` | `0.6135` | `0.953` | `1.019` | `0.880` |
-| `all-MiniLM-L6-v2` + `quality` HNSW | `20260427T225122Z` | `0.7109` | `0.5883` | `0.6135` | `0.900` | `1.001` | `0.813` |
-| `BAAI/bge-small-en-v1.5` + `fast` HNSW | `20260427T230412Z` | `0.7624` | `0.6603` | `0.6812` | `0.920` | `1.041` | `0.840` |
-| `BAAI/bge-small-en-v1.5` + `balanced` HNSW | `20260427T230412Z` | `0.7624` | `0.6603` | `0.6812` | `0.929` | `0.985` | `0.833` |
-| `BAAI/bge-small-en-v1.5` + `quality` HNSW | `20260427T230412Z` | `0.7624` | `0.6603` | `0.6812` | `0.720` | `0.992` | `0.775` |
-| `thenlper/gte-small` + `fast` HNSW | `20260429T_next_gte_small_st` | `0.7786` | `0.6711` | `0.6944` | `0.955` | `1.028` | `0.878` |
-| `thenlper/gte-small` + `balanced` HNSW | `20260429T_next_gte_small_st` | `0.7786` | `0.6711` | `0.6944` | `0.976` | `1.049` | `0.888` |
-| `thenlper/gte-small` + `quality` HNSW | `20260429T_next_gte_small_st` | `0.7786` | `0.6711` | `0.6944` | `0.968` | `1.056` | `0.901` |
-| `BAAI/bge-base-en-v1.5` + `fast` HNSW | `20260429T_next_bge_base` | `0.8121` | `0.7017` | `0.7258` | `1.787` | `3.287` | `1.951` |
-| `BAAI/bge-base-en-v1.5` + `balanced` HNSW | `20260429T_next_bge_base` | `0.8121` | `0.7017` | `0.7258` | `2.683` | `4.243` | `2.989` |
-| `BAAI/bge-base-en-v1.5` + `quality` HNSW | `20260429T_next_bge_base` | `0.8121` | `0.7017` | `0.7258` | `1.823` | `2.946` | `2.189` |
-
-### Best-to-best summary
-
-Using the best observed latency profile from each embedding set:
-
-| Comparison | recall@5 | MRR | nDCG@5 | mean latency |
+| Embeddings | recall@5 | MRR | nDCG@5 | mean latency |
 | :--- | ---: | ---: | ---: | ---: |
-| baseline `all-MiniLM-L6-v2` | `0.7109` | `0.5883` | `0.6135` | `0.800 ms` |
+| `all-MiniLM-L6-v2` | `0.7109` | `0.5883` | `0.6135` | `0.800 ms` |
 | `BAAI/bge-small-en-v1.5` | `0.7624` | `0.6603` | `0.6812` | `0.775 ms` |
 | `thenlper/gte-small` | `0.7786` | `0.6711` | `0.6944` | `0.878 ms` |
 | `BAAI/bge-base-en-v1.5` | `0.8121` | `0.7017` | `0.7258` | `1.951 ms` |
 
-Observed gains from the embedding change:
-
-- `recall@5`: `+0.0516` absolute, about `+7.3%` relative
-- `MRR`: `+0.0719` absolute, about `+12.2%` relative
-- `nDCG@5`: `+0.0677` absolute, about `+11.0%` relative
-
-Observed gains for `thenlper/gte-small` over baseline:
-
-- `recall@5`: `+0.0677` absolute, about `+9.5%` relative
-- `MRR`: `+0.0827` absolute, about `+14.1%` relative
-- `nDCG@5`: `+0.0809` absolute, about `+13.2%` relative
-
-Observed gains for `BAAI/bge-base-en-v1.5` over baseline:
-
-- `recall@5`: `+0.1012` absolute, about `+14.2%` relative
-- `MRR`: `+0.1134` absolute, about `+19.3%` relative
-- `nDCG@5`: `+0.1122` absolute, about `+18.3%` relative
-
-Interpretation:
-
-- the next strong retrieval lever is embedding selection
-- `BAAI/bge-base-en-v1.5` is the current quality leader on SciFact
-- `thenlper/gte-small` is a useful middle point when we want a lighter latency hit
-- HNSW sweeps are still useful for latency/recall tradeoff mapping
-- we should not oversell ANN tuning as the main quality breakthrough
-- dimensionality matters in this comparison set, so `384`-dim and `768`-dim wins
-  should not be treated as identical cost classes
+Published artifacts:
 
-## Recommended benchmark order from here
+- [`20260427T225122Z_scifact_baseline_summary.json`](../reports/runs/20260427T225122Z_scifact_baseline_summary.json)
+- [`20260427T230412Z_scifact_bge_small_summary.json`](../reports/runs/20260427T230412Z_scifact_bge_small_summary.json)
+- [`20260429T_next_gte_small_st_summary.json`](../reports/runs/20260429T_next_gte_small_st_summary.json)
+- [`20260429T_next_bge_base_summary.json`](../reports/runs/20260429T_next_bge_base_summary.json)
 
-For retrieval work, keep the methodology disciplined:
+## Published `10GB` scale result
 
-1. fix the dataset
-2. fix the embedding model
-3. sweep HNSW settings
-4. compare `recall@k`, `MRR`, `nDCG`, and latency
-5. compare embedding models on the same benchmark path
+Dataset:
 
-For large-scale system work:
-
-1. complete clean `10GB` results
-2. publish `100GB` results
-3. run `250GB` only after confirming disk headroom
-4. defer `1TB` until storage is expanded
-
-## Current staged run
-
-The first staged large-scale run is now complete:
-
-- run id: `20260503T_stage10gb_d768`
-- dataset: `synthetic_10gib_768d`
-- target size: `10 GiB`
+- run family: `synthetic_10gib_768d`
+- vectors: `3,495,253`
 - dimension: `768`
-- query count: `250`
-- runner: `scripts/run_sochdb_stage_vector.sh`
-- workload: `benchmarks/run_bulk_vector_workload.py`
-- local summary: [`reports/runs/20260503T_stage10gb_d768_sochdb_vector_summary.json`](../reports/runs/20260503T_stage10gb_d768_sochdb_vector_summary.json)
-- local metadata: [`reports/runs/20260503T_stage10gb_d768_metadata.json`](../reports/runs/20260503T_stage10gb_d768_metadata.json)
-
-Important implementation note:
 
-- the staged lane now uses the compiled `sochdb-bulk` binary on the server for
-  build and query operations
-- this replaced the earlier Python-only workload path, which failed on the
-  hosted machine because the stale `VectorIndex` import path was not available
+Build result:
 
-### `10GB` staged result
+- build throughput: about `891.6 vec/s`
+- build time: about `3920.14 s`
+- output index size: about `10,069.1 MB`
 
-What succeeded:
+Corrected steady-state search result:
 
-- index build completed successfully for `3,495,253` vectors
-- build time was about `3920.14 s` (`~65.3 min`)
-- observed build throughput was about `891.6 vec/s`
-- output index size was about `10,069.1 MB`
+- one-time index load: about `106.85 s`
+- sequential search: about `506.63 QPS`
+- sequential mean latency: about `1.97 ms`
+- sequential `p50`: about `1.87 ms`
+- sequential `p95`: about `2.40 ms`
+- batch search: about `356 QPS`
 
-What failed the performance bar:
+Important note:
 
-- `250` queries took about `27,455.91 s`
-- search throughput was only about `0.0091 QPS`
-- `p50` latency was about `109,814 ms`
-- `p95` latency was about `110,108 ms`
-- `mean` latency was about `109,822 ms`
+- the earlier `0.0091 QPS` / `~110s per query` result came from a bad benchmark
+  harness path and should not be treated as the real steady-state engine result
 
-Interpretation:
+Published artifacts:
 
-- the staged runner itself is now working end to end
-- the bottleneck has moved from benchmark plumbing to SochDB query-path behavior
-- we should not scale this lane to `100GB` yet
-- the next benchmark task is diagnosing why the current search path is roughly
-  `~110 s` per query at `10GB`
+- [`20260503T_stage10gb_d768_sochdb_vector_summary.json`](../reports/runs/20260503T_stage10gb_d768_sochdb_vector_summary.json)
+- [`20260503T_stage10gb_d768_metadata.json`](../reports/runs/20260503T_stage10gb_d768_metadata.json)
+- [`20260512T_10gb_optimized_native_summary.json`](../reports/runs/20260512T_10gb_optimized_native_summary.json)
+- [`20260512T_10gb_optimized_native_metadata.json`](../reports/runs/20260512T_10gb_optimized_native_metadata.json)
 
-## Scripts that define the server lane
+## What is pending
 
-- `scripts/run_sochdb_grpc_quality_sweep.sh`
-- `scripts/run_sochdb_embedding_bakeoff.sh`
-- `scripts/run_sochdb_stage_vector.sh`
-
-Related planning docs:
-
-- [`STAGED_BENCHMARK_PLAN.md`](./STAGED_BENCHMARK_PLAN.md)
-- [`RETRIEVAL_AND_VECTOR_PLAN.md`](./RETRIEVAL_AND_VECTOR_PLAN.md)
-
-## What is still pending
-
-- investigate the `10GB` search-latency failure before running `100GB`
-- complete the staged `10GB` -> `100GB` -> `250GB` scale path after that
+- publish `100GB` results using the corrected native steady-state methodology
+- publish `250GB` results after confirming disk headroom
 - defer `1TB` claims until storage is expanded
-
-This file should be the first place to update whenever new server benchmark work
-changes the current benchmark story.