AetherV

A version-aware, incrementally maintained vector database that treats knowledge as a continuously evolving graph rather than a static collection of embeddings.

Current Implementation (`aetherv/`)

The aetherv Python package is the first working slice of the engine: a segmented, on-disk vector database with GPU-accelerated similarity search. It is installable from the repo root via pyproject.toml (Hatchling build, Python ≥ 3.11).

Dependencies

Library	Role
fastembed	Default text embedder (`TextEmbedding`); vectors are L2-normalized after encoding
jax	JIT-compiled dot-product search on GPU (or CPU fallback) per segment
numpy	Array interchange between embedder, storage, and search
polars	Columnar metadata store (`id`, `text`, `segment`, `row`) backed by Parquet
pyarrow	Arrow IPC read/write for fixed-size embedding vectors on disk

Dev extras: pytest.

Package layout

aetherv/
├── __init__.py          # Public API: Config, VectorDB, SearchResult, SegmentRecord
├── config.py            # Runtime paths and segment sizing
├── db.py                # VectorDB — insert, query, segment lifecycle
├── embedder.py          # Embedder protocol + FastEmbedder
├── segments.py          # Segment file path resolution
├── types.py             # SearchResult, SegmentRecord dataclasses
├── search/
│   └── gpu.py           # SegmentSearcher — JAX @jit score + top-k
└── storage/
    ├── arrow.py         # ArrowSegment — IPC write/read for embedding matrices
    ├── manifest.py      # JSON manifest of segment records
    └── metadata.py      # Polars Parquet store with O(1) (segment, row) lookup

On-disk layout

Opening a VectorDB at root (default vectordb/) produces:

vectordb/
├── metadata.parquet     # id, text, segment, row — Polars
├── manifest.json        # segment index (id, name, vector_count, created_at)
└── segments/
    ├── segment_000000.arrow
    ├── segment_000001.arrow
    └── ...

Each .arrow file is an Arrow IPC stream of FixedSizeList<float32> embeddings for one segment (default up to 10,000 vectors per segment, configurable via Config.segment_size).

How it works

Insert — VectorDB.insert(ids, texts) embeds all texts through the configured embedder (default FastEmbedder), batches vectors into segments, writes each batch as an Arrow IPC file, appends rows to the Polars metadata table, and registers a SegmentSearcher for the new segment.

Query — VectorDB.query(text, k) embeds the query, runs parallel top-k search across every loaded segment (ThreadPoolExecutor), merges candidates by score, and resolves (segment_id, row) pairs back to (id, text) via the metadata lookup.

Search kernel — Because embeddings are normalized, a dot product equals cosine similarity. SegmentSearcher uploads a segment matrix to the JAX device once, then uses a @jax.jit matvec for scoring and numpy.argpartition for top-k within each segment.

Pluggability — Embedder is a Protocol; tests inject a deterministic hash-based embedder to avoid model downloads. Config controls root path, segment size, and filenames.

Install and usage

pip install -e ".[dev]"   # from repo root
pytest

from aetherv import Config, VectorDB

db = VectorDB("vectordb", config=Config(segment_size=10_000))
db.insert(
    ids=[1, 2, 3],
    texts=["JAX accelerates search", "Polars stores metadata", "Arrow holds vectors"],
)

for hit in db.query("dataframe library", k=2):
    print(hit.score, hit.id, hit.text)

Tests

tests/test_vectordb.py covers insert/query, O(1) metadata lookup, legacy manifest loading, and empty inserts using a DeterministicEmbedder (no network or model download).

Project AetherV Evolution Engine (AEE)

Vision

Build the first retrieval-aware, version-aware, incrementally maintained vector database capable of operating on continuously changing knowledge without full reindexing.

Current vector databases optimize:

Similarity search
ANN indexing
Storage efficiency

AetherV Evolution Engine optimizes:

Knowledge freshness
Incremental updates
Retrieval correctness under change
Autonomous index maintenance

Core Thesis

The future bottleneck of RAG is not retrieval speed.

The bottleneck is:

"How can a retrieval system stay correct while its knowledge continuously changes?"

Current systems:

Document changes → Rechunk → Re-embed → Reindex

AetherV:

Document changes → Semantic diff → Impact prediction → Localized updates → Retrieval remains correct

Product Definition

Category:

Version-Aware Dynamic Retrieval Engine

Tagline:

Git for Knowledge + Vector Database

Primary Users:

Enterprise RAG
Documentation systems
Agentic systems
Knowledge management platforms
Real-time data platforms

Key Differentiators

1. Semantic Change Engine

Determine what actually changed.

Input:

Document v1 Document v2

Output:

Added concepts
Modified concepts
Deleted concepts
Dependency impact

Goal:

Avoid unnecessary embedding generation.

2. Retrieval Impact Predictor

Novel research component.

Question:

Will this change affect retrieval?

Example:

"128GB RAM" → "129GB RAM"

Embedding changes.

Retrieval behavior likely does not.

Decision:

Skip expensive update.

Expected savings:

70-95% embedding reduction.

3. Version-Aware Retrieval

Every chunk becomes temporal.

Chunk schema:

{ chunk_id, version_id, valid_from, valid_to, parent_version }

Supports:

Historical retrieval
Change tracking
Temporal QA

4. Dependency Graph

Knowledge becomes a graph.

Chunk → Summary → RAPTOR node → KG entity → Agent memory

If node changes:

Automatically identify stale descendants.

5. LSM Vector Index

Inspired by LSM-VEC.

Structure:

L0 = recent updates

L1 = warm data

L2 = stable data

L3 = archive

Advantages:

Fast inserts
Fast deletes
No global rebuilds

Research Goals

Goal 1

Reduce embedding regeneration by 90%.

Goal 2

Reduce index rebuild operations to near zero.

Goal 3

Maintain retrieval accuracy >99% of full reindex baseline.

Goal 4

Support continuous ingestion at enterprise scale.

Goal 5

Achieve sub-second update propagation.

System Architecture

Layer 0 Source Connectors

Layer 1 CDC Engine

Layer 2 Semantic Diff Engine

Layer 3 Retrieval Impact Predictor

Layer 4 Embedding Manager

Layer 5 Version Store

Layer 6 Dependency Graph

Layer 7 LSM Vector Index

Layer 8 Retrieval API

Work Packages

WP-1 Foundation

Duration: 2 weeks

Deliverables:

Monorepo
CI/CD
Benchmark framework
Dataset registry

Success Criteria:

Repeatable experiments.

WP-2 Change Detection Engine

Duration: 3 weeks

Tasks:

File CDC
Database CDC
Event ingestion
Hash-based diffing

Output:

Changed chunk list

Success Criteria:

Detect changes with >99.9% precision.

WP-3 Semantic Diff Engine

Duration: 4 weeks

Tasks:

AST extraction
Chunk fingerprinting
Concept extraction
Semantic similarity graph

Output:

Semantic delta object

Success Criteria:

Correctly classify additions, deletions, modifications.

WP-4 Retrieval Impact Predictor

Duration: 6 weeks

Research Track

Tasks:

Build retrieval benchmark
Learn retrieval sensitivity
Predict update necessity

Output:

Impact score

0.0 → no update

1.0 → must update

Success Criteria:

Skip >70% updates while preserving retrieval quality.

WP-5 Versioned Storage

Duration: 3 weeks

Tasks:

Chunk versioning
Temporal metadata
Lineage tracking

Success Criteria:

Historical reconstruction support.

WP-6 Dependency Graph

Duration: 4 weeks

Tasks:

Graph schema
Edge inference
Incremental propagation

Success Criteria:

Detect all downstream stale nodes.

WP-7 Incremental Embedding Engine

Duration: 6 weeks

Tasks:

Selective re-embedding
Delta embedding experiments
Embedding cache

Success Criteria:

90% reduction in embedding workload.

WP-8 LSM Vector Index

Duration: 8 weeks

Tasks:

L0-L3 architecture
Incremental HNSW
Compaction engine

Success Criteria:

No full index rebuilds.

WP-9 Retrieval Layer

Duration: 4 weeks

Tasks:

Hybrid search
Temporal search
Version-aware ranking

Success Criteria:

Beat baseline RAG retrieval.

WP-10 Research Publication

Duration: Ongoing

Targets:

arXiv
VLDB
SIGIR
NeurIPS Datasets & Benchmarks

Potential Paper Titles:

Retrieval-Aware Incremental Embedding

Version-Aware Dynamic Vector Retrieval

AetherV: A Knowledge Evolution Engine for Continually Updated RAG Systems

Success Metrics

Embedding Cost Reduction: Target >90%

Update Latency: Target <1 second

Index Rebuild Frequency: Target zero

Retrieval Accuracy Loss: Target <1%

Storage Overhead: Target <20%

MVP Scope

MVP includes:

✓ CDC

✓ Semantic diffing

✓ Version tracking

✓ Selective re-embedding

✓ Incremental HNSW

✓ Evaluation suite

MVP excludes:

✗ Multi-node clustering

✗ GPU acceleration

✗ Agent orchestration

✗ RAPTOR integration

✗ Knowledge graph generation

These become Phase 2.

Phase 2

Multi-node distributed engine
GPU kernels
RAPTOR hierarchy
GraphRAG integration
Agent memory support
Real-time streaming ingestion
Learned ANN routing

Project Charter: Project AetherV (Heterogeneous Vector Engine)

This document establishes the scope, architectural design, and implementation roadmap for building a next-generation, heterogeneous vector database from scratch. Project AetherV shifts away from monolithic database paradigms by separating dynamic control logic from parallelized mathematical acceleration.

1. Executive Summary: Why, How, and What It Solves

The "Why" (The Rationale)

Current vector databases face a sharp trade-off: they are either optimized for fast, static similarity searches on hardware accelerators (like GPU/TPU indices) or built for dynamic text processing, metadata filtering, and graph routing on standard CPUs. When a production multi-agent system runs complex pipelines—such as RAPTOR tree traversals, hybrid lexical/dense search, and real-time self-correcting routing loops—monolithic databases create severe latency overhead. They waste valuable GPU compute on sequential pointer-chasing logic or bottleneck the CPU with massive array operations.

What It Solves (The Core Bottlenecks)

The Inter-Node Latency Tax: Eliminates the serialization and network overhead of bouncing data between standalone graph stores, BM25 engines, and vector indices.
The Abstraction Tax: Prevents framework state bloat (inherent in heavy orchestrators) by managing the agentic state machine directly inside the database control layer.
The Accelerator Compiling Problem: Solves JAX’s rigid requirement for static array shapes during execution by implementing a zero-copy memory bridge over fixed-capacity pre-allocated CPU layouts.

The "How" (The Architectural Split)

AetherV splits the database into two highly specialized planes operating over shared memory:

[ Ingestion / Query Client ]
             │
             ▼
┌────────────────────────────────────────────────────────┐
│               CPU CONTROL PLANE (Polars)               │
│  - HNSW Graph Traversal    - BM25 Token Dictionaries   │
│  - RAPTOR Tree Clustering  - Deterministic Statechart  │
└────────────────────────────┬───────────────────────────┘
                             │  Zero-Copy Apache Arrow Bridge
                             ▼
┌────────────────────────────────────────────────────────┐
│               GPU EXECUTION PLANE (JAX)                │
│  - Fused Cosine Similarity - Batched Cross-Encoder Rerank│
│  - Conformal Masking        - Vectorized Index Scans    │
└────────────────────────────────────────────────────────┘

2. Technical Stack Selection

To achieve maximum throughout and mechanical sympathy with the underlying hardware, the core stack is strictly constrained to high-performance, non-bloated libraries:

Control Plane & Memory Layout: Polars / Apache Arrow. Polars provides fast columnar structures on the CPU, and Apache Arrow ensures that data can be exposed to accelerators via raw memory pointers without slow serialization passes.
Execution Plane: JAX (XLA). JAX compiles distance functions, cross-encoder scoring, and conformal mask thresholds into fused, branch-free GPU kernels.
Concurrency Model: **Python asyncio**. Manages concurrent query streams and asynchronous actor worker tasks without thread blocking.

3. Structural Database Schemas

CPU Memory Map (Polars Schema)

The main collection matrix is tracked in a centralized, contiguous memory table. Every document chunk maps directly to a fixed implicit integer row index ($i$).

Column Name	Type	Purpose / Description
`id`	`pl.UInt32`	Global unique identifier / contiguous matrix offset
`parent_id`	`pl.Int32`	RAPTOR tree pointer (Points to parent summary ID; `-1` if root)
`layer_level`	`pl.UInt8`	RAPTOR level ($0$ = base chunk, $1$ = summary child, etc.)
`hnsw_edges`	`pl.List(pl.Int32)`	Padded fixed-length array of neighbor node indices
`sparse_tokens`	`pl.List(pl.UInt32)`	Hashed vocabulary bin tokens for BM25 matching
`sparse_weights`	`pl.List(pl.Float32)`	Corresponding pre-computed term frequencies
`doc_string`	`pl.String`	Raw text block (Retained strictly for synthesis output)

GPU Tensor Allocation (JAX Fixed-Capacity Arrays)

The execution plane registers a static block of memory on device initialization to prevent costly re-allocation delays during ingestion.

DENSE_MATRIX: jax.Array of shape [MAX_CAPACITY, FEATURE_DIM] (float32)
SPARSE_MATRIX: jax.Array of shape [MAX_CAPACITY, MAX_TOKEN_PADDINGS] (float32)

4. The Phased Implementation Roadmap

Phase 1: The Zero-Copy Memory Bridge (Weeks 1–2)

Goal: Establish zero-overhead data handoffs between CPU records and compiled GPU math.
Deliverables:
Configure a fixed-capacity Apache Arrow allocation layer holding raw vector representations.
Implement the core JAX execution kernel for batched Cosine Similarity using jax.jit and jax.vmap.
Build the interface layer that reads memory pointers from a Polars chunk partition and maps them into JAX device arrays without copying data strings.

Phase 2: Dual-Engine Hybrid Search (Weeks 3–4)

Goal: Combine exact dense vector lookups with sparse BM25 mechanics on the accelerator.
Deliverables:
Create a tokenization helper that hashes text blocks into a uniform, padded sparse array layout.
Write the fused JAX mathematical kernel that simultaneously computes dense cosine distance and sparse dot-product scores.
Implement a Reciprocal Rank Fusion (RRF) step directly inside the XLA compilation block to return unified top-K index positions.

Phase 3: Hierarchical Structures & Context Filtering (Weeks 5–6)

Goal: Add support for structural document grouping (RAPTOR) and statistical compression.
Deliverables:
Build the offline ingestion pipeline that clusters base text embeddings via a lightweight Gaussian Mixture Model (GMM) and inserts parent summaries back into the Polars table layout.
Write the CPU traversal router to fetch parent summary references when deep context is requested.
Implement a JAX-backed Conformal Filtering layer to dynamically threshold and mask out noise based on the scoring variance across retrieved arrays.

Phase 4: Statechart Orchestration & Agentic RAG (Weeks 7–8)

Goal: Embed the multi-agent routing loop directly into the database control plane.
Deliverables:
Implement an asynchronous StatechartSupervisor that transitions search workflows based on strict routing matrices.
Build independent actor workers (VectorSearchActor, CriticActor, SynthesizerActor) operating over non-blocking message loops.
Conduct end-to-end performance tracing to measure query execution times under parallel load constraints.

5. Verification & Success Criteria

To transition the project from alpha implementation to a production-ready target, the engine must satisfy three objective performance baselines:

Zero Dynamic Re-compilation: After the initial database setup pass, adding new records or executing search pipelines must result in zero XLA re-compilation events.
Deterministic Graph Routing: The statechart control plane must demonstrate a $100%$ validation rate against invalid state transition attempts during loop cycles.
Throughput Scaling: The combined engine must achieve lower latency scales compared to a traditional monolithic database paired with a separate client-side agent framework when running recursive or self-correcting RAG workflows.

Which specific phase of the roadmap should be detailed first to begin drafting the underlying codebase?

Ultimate Goal

Transform vector databases from static embedding stores into continuously evolving knowledge systems.

The core artifact is not the vector index.

The core artifact is a continuously maintained knowledge evolution graph.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
aetherv		aetherv
blogs		blogs
examples		examples
tests		tests
.gitignore		.gitignore
README.md		README.md
aetherv_roadmap_plan_b2d8f12b.plan.md		aetherv_roadmap_plan_b2d8f12b.plan.md
plan.md		plan.md
plan_description.md		plan_description.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

AetherV

Current Implementation (aetherv/)

Dependencies

Package layout

On-disk layout

How it works

Install and usage

Tests

Project AetherV Evolution Engine (AEE)

Vision

Core Thesis

Product Definition

Key Differentiators

1. Semantic Change Engine

2. Retrieval Impact Predictor

3. Version-Aware Retrieval

4. Dependency Graph

5. LSM Vector Index

Research Goals

System Architecture

Work Packages

Success Metrics

MVP Scope

Phase 2

Project Charter: Project AetherV (Heterogeneous Vector Engine)

1. Executive Summary: Why, How, and What It Solves

The "Why" (The Rationale)

What It Solves (The Core Bottlenecks)

The "How" (The Architectural Split)

2. Technical Stack Selection

3. Structural Database Schemas

CPU Memory Map (Polars Schema)

GPU Tensor Allocation (JAX Fixed-Capacity Arrays)

4. The Phased Implementation Roadmap

Phase 1: The Zero-Copy Memory Bridge (Weeks 1–2)

Phase 2: Dual-Engine Hybrid Search (Weeks 3–4)

Phase 3: Hierarchical Structures & Context Filtering (Weeks 5–6)

Phase 4: Statechart Orchestration & Agentic RAG (Weeks 7–8)

5. Verification & Success Criteria

Ultimate Goal

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Current Implementation (`aetherv/`)

Packages