Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 16 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,22 @@ This repository contains reproducible benchmarks comparing **SochDB** against ot
**📊 [See Published Results](PUBLISHED_RESULTS.md)** - Comprehensive benchmark findings with real LLM integration
**🖥️ [See Server Benchmark Status](docs/SERVER_BENCHMARK_STATUS.md)** - Current hosted benchmark lane, SciFact quality takeaway, and staged scale plan

## Current Highlights

Latest published benchmark takeaways:

- **Quality**: Best current SciFact result uses `BAAI/bge-base-en-v1.5`
with `recall@5 = 0.8121`, `MRR = 0.7017`, and `nDCG@5 = 0.7258`
- **Scale**: The corrected `10GB` staged run reached about `506.63 QPS`
with about `1.97 ms` mean latency after a one-time `106.85 s` index load
- **Methodology**: The earlier `~110s/query` `10GB` result came from a bad
harness path and is not the real steady-state engine search number

For the latest published benchmark state, see:

- [docs/SERVER_BENCHMARK_STATUS.md](docs/SERVER_BENCHMARK_STATUS.md)
- [docs/STAGED_BENCHMARK_PLAN.md](docs/STAGED_BENCHMARK_PLAN.md)

## Overview

We provide benchmarks across different dimensions:
Expand Down
272 changes: 50 additions & 222 deletions docs/SERVER_BENCHMARK_STATUS.md
Original file line number Diff line number Diff line change
@@ -1,261 +1,89 @@
# Server Benchmark Status

This document captures the current state of the heavy benchmark lane that runs on
the hosted SochDB server instead of on a laptop.
This document captures the current benchmark story for the hosted SochDB server.

## Why the server lane exists
## Current setup

Heavy benchmark work should happen on the benchmark server, not on a developer
laptop. That is especially true for:
- heavy benchmark work runs on the benchmark server, not on laptops
- hosted gRPC demo endpoint: `studio.agentslab.host:50053`
- current server class: about `12` CPU, about `62 GiB` RAM
- current storage is not appropriate for a final `1TB` claim yet

- retrieval-quality sweeps
- embedding bakeoffs
- staged large-dataset runs
- repeatable gRPC benchmark runs against the hosted demo endpoint

Current server target:

- host: private benchmark server
- SSH: stored out-of-band for operators only
- hosted gRPC endpoint: `studio.agentslab.host:50053`

## Current server constraints

The server is good enough for repeated CPU-oriented benchmark work, but it is not
the right machine for a final `1TB` claim yet.

- about `12` CPU
- about `62 GiB` RAM
- limited free root-disk capacity for honest `1TB` benchmarking
- weak GPU (`GeForce GT 710`), so embedding work should remain CPU-friendly

Because of that, the large-scale benchmark story should stay staged:
Because of that, large-scale benchmarking stays staged:

1. `10GB`
2. `100GB`
3. `250GB`
4. `1TB` only after moving to a larger disk or attached storage

## Current benchmark workspace on the server

- `<benchmark-workspace>/datasets`
- `<benchmark-workspace>/embeddings`
- `<benchmark-workspace>/results`
- `<benchmark-workspace>/logs`
- `<benchmark-workspace>/work`

These locations should be treated as the canonical landing zone for heavy benchmark
artifacts before we selectively publish summaries back into this repo.

Local snapshots now checked into this repo:
4. `1TB` only after storage expansion

- [`reports/runs/20260427T224143Z_scifact_baseline_pilot_metadata.json`](../reports/runs/20260427T224143Z_scifact_baseline_pilot_metadata.json)
- [`reports/runs/20260427T225122Z_scifact_baseline_summary.json`](../reports/runs/20260427T225122Z_scifact_baseline_summary.json)
- [`reports/runs/20260427T225122Z_scifact_baseline_embedding_metadata.json`](../reports/runs/20260427T225122Z_scifact_baseline_embedding_metadata.json)
- [`reports/runs/20260427T230412Z_scifact_bge_small_summary.json`](../reports/runs/20260427T230412Z_scifact_bge_small_summary.json)
- [`reports/runs/20260427T230412Z_scifact_bge_small_embedding_metadata.json`](../reports/runs/20260427T230412Z_scifact_bge_small_embedding_metadata.json)
- [`reports/runs/20260429T_next_bge_base_summary.json`](../reports/runs/20260429T_next_bge_base_summary.json)
- [`reports/runs/20260429T_next_bge_base_embedding_metadata.json`](../reports/runs/20260429T_next_bge_base_embedding_metadata.json)
- [`reports/runs/20260429T_next_gte_small_st_summary.json`](../reports/runs/20260429T_next_gte_small_st_summary.json)
- [`reports/runs/20260429T_next_gte_small_st_embedding_metadata.json`](../reports/runs/20260429T_next_gte_small_st_embedding_metadata.json)
- [`reports/runs/20260503T_stage10gb_d768_sochdb_vector_summary.json`](../reports/runs/20260503T_stage10gb_d768_sochdb_vector_summary.json)
- [`reports/runs/20260503T_stage10gb_d768_metadata.json`](../reports/runs/20260503T_stage10gb_d768_metadata.json)
## Published quality result

## Established quality finding
Quality is measured separately from scale, using SciFact retrieval benchmarks.

The most important retrieval-quality result so far is that embedding choice moved
quality more than HNSW tuning on SciFact.
Current best published result:

Latest verified server runs:

- baseline sweep run: `20260427T225122Z`
- baseline pilot metadata run: `20260427T224143Z`
- `BAAI/bge-small-en-v1.5` sweep run: `20260427T230412Z`
- `thenlper/gte-small` sweep run: `20260429T_next_gte_small_st`
- `BAAI/bge-base-en-v1.5` sweep run: `20260429T_next_bge_base`

Baseline embedding metadata:

- backend: `sentence-transformers`
- model: `sentence-transformers/all-MiniLM-L6-v2`
- dataset: SciFact
- documents: `5183`
- queries: `300`
- dimension: `384`

BGE comparison embedding metadata:

- backend: `fastembed`
- model: `BAAI/bge-small-en-v1.5`
- dataset: SciFact
- documents: `5183`
- queries: `300`
- dimension: `384`

GTE-small embedding metadata:

- backend: `sentence-transformers`
- model: `thenlper/gte-small`
- dataset: SciFact
- documents: `5183`
- queries: `300`
- dimension: `384`

BGE-base embedding metadata:

- backend: `fastembed`
- model: `BAAI/bge-base-en-v1.5`
- dataset: SciFact
- documents: `5183`
- queries: `300`
- dimension: `768`
- recall@5: `0.8121`
- MRR: `0.7017`
- nDCG@5: `0.7258`

Summary of the current conclusion:
Main takeaway:

- baseline SciFact `recall@5` was about `0.7109`
- `thenlper/gte-small` reached about `0.7786` `recall@5`
- `BAAI/bge-base-en-v1.5` reached about `0.8121` `recall@5`
- `MRR` and `nDCG` improved as well
- `gte-small` stayed near baseline latency
- `bge-base-en-v1.5` improved quality further, but with noticeably higher latency
- HNSW parameter sweeps did not meaningfully change quality compared with the
embedding-model change
- embedding choice moved quality more than HNSW tuning in the current setup

### Exact SciFact comparison
Useful reference points:

| Embeddings | Run | recall@5 | MRR | nDCG@5 | p50 (ms) | p95 (ms) | mean (ms) |
| :--- | :--- | ---: | ---: | ---: | ---: | ---: | ---: |
| `all-MiniLM-L6-v2` + `fast` HNSW | `20260427T225122Z` | `0.7109` | `0.5883` | `0.6135` | `0.857` | `0.996` | `0.800` |
| `all-MiniLM-L6-v2` + `balanced` HNSW | `20260427T225122Z` | `0.7109` | `0.5883` | `0.6135` | `0.953` | `1.019` | `0.880` |
| `all-MiniLM-L6-v2` + `quality` HNSW | `20260427T225122Z` | `0.7109` | `0.5883` | `0.6135` | `0.900` | `1.001` | `0.813` |
| `BAAI/bge-small-en-v1.5` + `fast` HNSW | `20260427T230412Z` | `0.7624` | `0.6603` | `0.6812` | `0.920` | `1.041` | `0.840` |
| `BAAI/bge-small-en-v1.5` + `balanced` HNSW | `20260427T230412Z` | `0.7624` | `0.6603` | `0.6812` | `0.929` | `0.985` | `0.833` |
| `BAAI/bge-small-en-v1.5` + `quality` HNSW | `20260427T230412Z` | `0.7624` | `0.6603` | `0.6812` | `0.720` | `0.992` | `0.775` |
| `thenlper/gte-small` + `fast` HNSW | `20260429T_next_gte_small_st` | `0.7786` | `0.6711` | `0.6944` | `0.955` | `1.028` | `0.878` |
| `thenlper/gte-small` + `balanced` HNSW | `20260429T_next_gte_small_st` | `0.7786` | `0.6711` | `0.6944` | `0.976` | `1.049` | `0.888` |
| `thenlper/gte-small` + `quality` HNSW | `20260429T_next_gte_small_st` | `0.7786` | `0.6711` | `0.6944` | `0.968` | `1.056` | `0.901` |
| `BAAI/bge-base-en-v1.5` + `fast` HNSW | `20260429T_next_bge_base` | `0.8121` | `0.7017` | `0.7258` | `1.787` | `3.287` | `1.951` |
| `BAAI/bge-base-en-v1.5` + `balanced` HNSW | `20260429T_next_bge_base` | `0.8121` | `0.7017` | `0.7258` | `2.683` | `4.243` | `2.989` |
| `BAAI/bge-base-en-v1.5` + `quality` HNSW | `20260429T_next_bge_base` | `0.8121` | `0.7017` | `0.7258` | `1.823` | `2.946` | `2.189` |

### Best-to-best summary

Using the best observed latency profile from each embedding set:

| Comparison | recall@5 | MRR | nDCG@5 | mean latency |
| Embeddings | recall@5 | MRR | nDCG@5 | mean latency |
| :--- | ---: | ---: | ---: | ---: |
| baseline `all-MiniLM-L6-v2` | `0.7109` | `0.5883` | `0.6135` | `0.800 ms` |
| `all-MiniLM-L6-v2` | `0.7109` | `0.5883` | `0.6135` | `0.800 ms` |
| `BAAI/bge-small-en-v1.5` | `0.7624` | `0.6603` | `0.6812` | `0.775 ms` |
| `thenlper/gte-small` | `0.7786` | `0.6711` | `0.6944` | `0.878 ms` |
| `BAAI/bge-base-en-v1.5` | `0.8121` | `0.7017` | `0.7258` | `1.951 ms` |

Observed gains from the embedding change:

- `recall@5`: `+0.0516` absolute, about `+7.3%` relative
- `MRR`: `+0.0719` absolute, about `+12.2%` relative
- `nDCG@5`: `+0.0677` absolute, about `+11.0%` relative

Observed gains for `thenlper/gte-small` over baseline:

- `recall@5`: `+0.0677` absolute, about `+9.5%` relative
- `MRR`: `+0.0827` absolute, about `+14.1%` relative
- `nDCG@5`: `+0.0809` absolute, about `+13.2%` relative

Observed gains for `BAAI/bge-base-en-v1.5` over baseline:

- `recall@5`: `+0.1012` absolute, about `+14.2%` relative
- `MRR`: `+0.1134` absolute, about `+19.3%` relative
- `nDCG@5`: `+0.1122` absolute, about `+18.3%` relative

Interpretation:

- the next strong retrieval lever is embedding selection
- `BAAI/bge-base-en-v1.5` is the current quality leader on SciFact
- `thenlper/gte-small` is a useful middle point when we want a lighter latency hit
- HNSW sweeps are still useful for latency/recall tradeoff mapping
- we should not oversell ANN tuning as the main quality breakthrough
- dimensionality matters in this comparison set, so `384`-dim and `768`-dim wins
should not be treated as identical cost classes
Published artifacts:

## Recommended benchmark order from here
- [`20260427T225122Z_scifact_baseline_summary.json`](../reports/runs/20260427T225122Z_scifact_baseline_summary.json)
- [`20260427T230412Z_scifact_bge_small_summary.json`](../reports/runs/20260427T230412Z_scifact_bge_small_summary.json)
- [`20260429T_next_gte_small_st_summary.json`](../reports/runs/20260429T_next_gte_small_st_summary.json)
- [`20260429T_next_bge_base_summary.json`](../reports/runs/20260429T_next_bge_base_summary.json)

For retrieval work, keep the methodology disciplined:
## Published `10GB` scale result

1. fix the dataset
2. fix the embedding model
3. sweep HNSW settings
4. compare `recall@k`, `MRR`, `nDCG`, and latency
5. compare embedding models on the same benchmark path
Dataset:

For large-scale system work:

1. complete clean `10GB` results
2. publish `100GB` results
3. run `250GB` only after confirming disk headroom
4. defer `1TB` until storage is expanded

## Current staged run

The first staged large-scale run is now complete:

- run id: `20260503T_stage10gb_d768`
- dataset: `synthetic_10gib_768d`
- target size: `10 GiB`
- run family: `synthetic_10gib_768d`
- vectors: `3,495,253`
- dimension: `768`
- query count: `250`
- runner: `scripts/run_sochdb_stage_vector.sh`
- workload: `benchmarks/run_bulk_vector_workload.py`
- local summary: [`reports/runs/20260503T_stage10gb_d768_sochdb_vector_summary.json`](../reports/runs/20260503T_stage10gb_d768_sochdb_vector_summary.json)
- local metadata: [`reports/runs/20260503T_stage10gb_d768_metadata.json`](../reports/runs/20260503T_stage10gb_d768_metadata.json)

Important implementation note:

- the staged lane now uses the compiled `sochdb-bulk` binary on the server for
build and query operations
- this replaced the earlier Python-only workload path, which failed on the
hosted machine because the stale `VectorIndex` import path was not available
Build result:

### `10GB` staged result
- build throughput: about `891.6 vec/s`
- build time: about `3920.14 s`
- output index size: about `10,069.1 MB`

What succeeded:
Corrected steady-state search result:

- index build completed successfully for `3,495,253` vectors
- build time was about `3920.14 s` (`~65.3 min`)
- observed build throughput was about `891.6 vec/s`
- output index size was about `10,069.1 MB`
- one-time index load: about `106.85 s`
- sequential search: about `506.63 QPS`
- sequential mean latency: about `1.97 ms`
- sequential `p50`: about `1.87 ms`
- sequential `p95`: about `2.40 ms`
- batch search: about `356 QPS`

What failed the performance bar:
Important note:

- `250` queries took about `27,455.91 s`
- search throughput was only about `0.0091 QPS`
- `p50` latency was about `109,814 ms`
- `p95` latency was about `110,108 ms`
- `mean` latency was about `109,822 ms`
- the earlier `0.0091 QPS` / `~110s per query` result came from a bad benchmark
harness path and should not be treated as the real steady-state engine result

Interpretation:
Published artifacts:

- the staged runner itself is now working end to end
- the bottleneck has moved from benchmark plumbing to SochDB query-path behavior
- we should not scale this lane to `100GB` yet
- the next benchmark task is diagnosing why the current search path is roughly
`~110 s` per query at `10GB`
- [`20260503T_stage10gb_d768_sochdb_vector_summary.json`](../reports/runs/20260503T_stage10gb_d768_sochdb_vector_summary.json)
- [`20260503T_stage10gb_d768_metadata.json`](../reports/runs/20260503T_stage10gb_d768_metadata.json)
- [`20260512T_10gb_optimized_native_summary.json`](../reports/runs/20260512T_10gb_optimized_native_summary.json)
- [`20260512T_10gb_optimized_native_metadata.json`](../reports/runs/20260512T_10gb_optimized_native_metadata.json)

## Scripts that define the server lane
## What is pending

- `scripts/run_sochdb_grpc_quality_sweep.sh`
- `scripts/run_sochdb_embedding_bakeoff.sh`
- `scripts/run_sochdb_stage_vector.sh`

Related planning docs:

- [`STAGED_BENCHMARK_PLAN.md`](./STAGED_BENCHMARK_PLAN.md)
- [`RETRIEVAL_AND_VECTOR_PLAN.md`](./RETRIEVAL_AND_VECTOR_PLAN.md)

## What is still pending

- investigate the `10GB` search-latency failure before running `100GB`
- complete the staged `10GB` -> `100GB` -> `250GB` scale path after that
- publish `100GB` results using the corrected native steady-state methodology
- publish `250GB` results after confirming disk headroom
- defer `1TB` claims until storage is expanded

This file should be the first place to update whenever new server benchmark work
changes the current benchmark story.
Loading