Dorer disk poc delete vector #843

dor-forer · 2025-11-26T15:10:00Z

Describe the changes in the pull request

A clear and concise description of what the PR is solving.

Which issues this PR fixes

#...
MOD...

Main objects this PR modified

...
...

Mark if applicable

This PR introduces API changes
This PR introduces serialization changes

…into dorer-disk-serialization-poc

…isAI/VectorSimilarity into dorer-disk-poc-mark-deleted-benchmark

…into dorer-disk-poc-mark-deleted-benchmark

…into dorer-disk-poc-delete-vector

Copilot

Pull request overview

This PR implements vector deletion functionality with batch processing and graph repair for the HNSW disk-based index. The changes introduce a complete deletion pipeline that marks vectors as deleted, batches deletion operations for efficiency, and repairs the graph structure to maintain connectivity and search quality after deletions.

Key Changes

Implemented batch deletion processing with configurable threshold (deleteBatchThreshold) to amortize disk I/O costs
Added graph repair logic that uses heuristic-based neighbor selection to maintain graph quality when nodes are deleted
Introduced separate staging areas for insert operations, delete operations, and opportunistic repair updates to avoid conflicts

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 13 comments.

Show a summary per file

File	Description
tests/unit/test_hnsw_disk.cpp	Added comprehensive test suites for batch deletion, interleaved insert/delete operations, staged repair, bidirectional edge updates, unidirectional edge cleanup, and graph repair with heuristics
tests/benchmark/run_files/bm_hnsw_disk_single_fp32.cpp	Updated comment to reflect dataset size (though comment-code mismatch exists)
tests/benchmark/bm_initialization/bm_hnsw_disk_initialize_fp32.h	Added `TopK_DeleteLabel` benchmark registration for measuring search performance after deletions with graph repair
tests/benchmark/bm_common.h	Implemented `TopK_HNSW_DISK_DeleteLabel` benchmark that tests search quality and performance after vector deletions with full graph repair
src/VecSim/algorithms/hnsw/hnsw_disk.h	Core implementation of deletion logic including `pendingDeleteIds`, `processDeleteBatch`, separate staging areas, opportunistic repair in `getNeighbors`, and bidirectional edge repair during deletion
src/VecSim/algorithms/hnsw/hnsw_disk_serializer.h	Updated serialization validation and restoration to handle new staging area data structures (`stagedInsertUpdates`, `stagedDeleteUpdates`, `stagedRepairUpdates`, `stagedInsertNeighborUpdates`)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tests/unit/test_hnsw_disk.cpp

src/VecSim/algorithms/hnsw/hnsw_disk.h

tests/benchmark/bm_common.h

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 18 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/VecSim/algorithms/hnsw/hnsw_disk.h

tests/unit/test_hnsw_disk.cpp

Copilot · 2025-11-30T08:08:11Z

src/VecSim/algorithms/hnsw/hnsw_disk.h

+    inline uint64_t makeRepairKey(idType node_id, size_t level) const {
+        return (static_cast<uint64_t>(node_id) << 32) | static_cast<uint64_t>(level);
+    }


The bit shift operation node_id << 32 can potentially overflow for 32-bit idType. If idType is uint32_t, shifting left by 32 bits is undefined behavior. The cast to uint64_t happens after the shift, which is too late. The shift should happen after casting to uint64_t.

Copilot · 2025-11-30T08:08:11Z

src/VecSim/algorithms/hnsw/hnsw_disk.h

+    auto raw_it = rawVectorsInRAM.find(internalId);
+    if (raw_it != rawVectorsInRAM.end()) {
+        rawVectorsInRAM.erase(raw_it);
+    }


Duplicate raw vector cleanup. The raw vector is erased from rawVectorsInRAM here in markDelete() (lines 2462-2465), and then erased again in processDeleteBatch() at lines 1898-1901. While the second erase is harmless (the iterator won't find the entry), this duplication is confusing and could lead to maintenance issues. Choose one location for this cleanup operation.

Suggested change

auto raw_it = rawVectorsInRAM.find(internalId);

if (raw_it != rawVectorsInRAM.end()) {

rawVectorsInRAM.erase(raw_it);

}

// Raw vector cleanup is handled in processDeleteBatch(), not here.

Copilot · 2025-11-30T08:08:11Z

src/VecSim/algorithms/hnsw/hnsw_disk.h

        const void* raw_vector_data = getRawVector(update.node_id);
+        if (raw_vector_data == nullptr) {
+            this->log(VecSimCommonStrings::LOG_WARNING_STRING,
+                        "WARNING: Skipping graph update for node %u at level %zu - no raw vector data available",
+                        update.node_id, update.level);
+            continue;
+        }


Inconsistent null-handling between graph updates and neighbor updates. The graph update section checks for null raw vector data (lines 897-903), but the neighbor update section at lines 959-963 does not perform the same check before calling serializeGraphValue(). This could lead to undefined behavior if getRawVector() returns nullptr for neighbor updates. Apply the same null check pattern to both sections.

Copilot · 2025-11-30T08:08:11Z

tests/benchmark/bm_common.h

+            std::cout << "Not enough non-deleted ground truth results to compare against (only "
+                      << filtered_res->results.size() << " out of " << k << " requested)" << std::endl;


The error message uses std::cout instead of the benchmark error reporting mechanism. This message should be reported through st.SkipWithError() or similar benchmark reporting facility, as it indicates a test configuration problem that should fail the benchmark rather than being printed to console.

Suggested change

std::cout << "Not enough non-deleted ground truth results to compare against (only "

<< filtered_res->results.size() << " out of " << k << " requested)" << std::endl;

st.SkipWithError(

("Not enough non-deleted ground truth results to compare against (only " +

std::to_string(filtered_res->results.size()) + " out of " +

std::to_string(k) + " requested)").c_str());

Copilot · 2025-11-30T08:08:12Z

src/VecSim/algorithms/hnsw/hnsw_disk.h

    };

-    vecsim_stl::vector<NeighborUpdate> stagedNeighborUpdates;
+    // Can maybe merge with stagedInsertUpdates


The comment "Can maybe merge with stagedInsertUpdates" suggests uncertainty about the design. This indicates potential code duplication or unclear separation of concerns. Either merge these structures if they serve the same purpose, or clarify in the comment why they need to remain separate. Leaving "maybe" comments in production code creates technical debt.

Suggested change

// Can maybe merge with stagedInsertUpdates

// stagedInsertNeighborUpdates is kept separate from stagedInsertUpdates because it specifically tracks neighbor list updates

// required for bidirectional connections during insert operations, whereas stagedInsertUpdates handles other insert-related state.

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-30T12:44:13Z

src/VecSim/algorithms/hnsw/hnsw_disk.h

+            const void* raw_vector_data = getRawVector(node_id);

-        // Serialize with new format and add to batch
-        std::string graph_value = serializeGraphValue(raw_vector_data, updated_neighbors);
-        neighborBatch.Put(cf, neighborKey.asSlice(), graph_value);
+            // Serialize with new format and add to batch
+            std::string graph_value = serializeGraphValue(raw_vector_data, updated_neighbors);
+            neighborBatch.Put(cf, neighborKey.asSlice(), graph_value);


Missing null check for raw_vector_data before using it. If getRawVector(node_id) returns nullptr (which can happen as seen in line 903's check), the code will pass a nullptr to serializeGraphValue(), potentially causing a crash. This is inconsistent with the null check done for graphUpdates at line 902-908. Should add a check and either skip the update or handle it appropriately.

src/VecSim/algorithms/hnsw/hnsw_disk.h

Copilot · 2025-11-30T12:44:14Z

src/VecSim/algorithms/hnsw/hnsw_disk.h

+
+    // Helper to create a unique key for (node_id, level) pair for hash map
+    inline uint64_t makeRepairKey(idType node_id, size_t level) const {
+        return (static_cast<uint64_t>(node_id) << 32) | static_cast<uint64_t>(level);


The makeRepairKey function could produce collisions if level exceeds 32 bits. While HNSW levels are typically small (< 20), the cast static_cast<uint64_t>(level) from size_t doesn't mask the value. On 64-bit systems where size_t is 64 bits, a large level value (> 2^32) would overflow into the node_id portion of the key, causing incorrect behavior. Consider adding a mask: (static_cast<uint64_t>(node_id) << 32) | (static_cast<uint64_t>(level) & 0xFFFFFFFF) or asserting that level < (1ULL << 32).

Suggested change

return (static_cast<uint64_t>(node_id) << 32) | static_cast<uint64_t>(level);

return (static_cast<uint64_t>(node_id) << 32) | (static_cast<uint64_t>(level) & 0xFFFFFFFF);

src/VecSim/algorithms/hnsw/hnsw_disk.h

Copilot · 2025-11-30T12:44:14Z

src/VecSim/algorithms/hnsw/hnsw_disk.h

+                            vecsim_stl::vector<idType> new_neighbor_neighbors(this->allocator);
+                            getNeighbors(new_neighbor_id, level, new_neighbor_neighbors);


Calling getNeighbors(new_neighbor_id, level, new_neighbor_neighbors) inside the repair loop could return stale or incorrect data. Since stageDeleteUpdate has already been called for neighbor_id (line 1860), if new_neighbor_id was also a neighbor of the deleted node, its staged update may have already been written to stagedDeleteMap. However, getNeighbors will check stagedDeleteMap and return the staged (possibly incomplete) neighbor list before all repairs are finished. This could lead to inconsistent bidirectional edge handling. Consider collecting all updates first, then applying bidirectional repairs in a second pass, or tracking which nodes have been updated in this batch.

src/VecSim/algorithms/hnsw/hnsw_disk.h

Copilot

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 10 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-30T14:57:45Z

src/VecSim/algorithms/hnsw/hnsw_disk.h

+    // Mutable to allow staging from const search methods.
+    // IMPORTANT: This class is NOT thread-safe. All operations (including const methods like
+    // getNeighbors and search) must be called from a single thread. The mutable fields below
+    // are modified during read operations for opportunistic graph cleanup.


The comment states "This class is NOT thread-safe" and mentions that "All operations (including const methods like getNeighbors and search) must be called from a single thread." However, using mutable fields that are modified in const methods creates a misleading API. The const qualifier typically implies thread-safety for concurrent reads. Consider either:

Removing the const qualifier from methods that modify these fields (like getNeighbors)

Implementing proper synchronization for these mutable fields

Documenting the single-threaded constraint more prominently in the class documentation

Suggested change

// Mutable to allow staging from const search methods.

// IMPORTANT: This class is NOT thread-safe. All operations (including const methods like

// getNeighbors and search) must be called from a single thread. The mutable fields below

// are modified during read operations for opportunistic graph cleanup.

// NOTE: Methods that modify stagedRepairUpdates and stagedRepairMap are NOT const.

// IMPORTANT: This class is NOT thread-safe. All operations (including getNeighbors and search)

// must be called from a single thread. The mutable fields below are modified during read

// operations for opportunistic graph cleanup.

Copilot · 2025-11-30T14:57:45Z

src/VecSim/algorithms/hnsw/hnsw_disk.h

+    inline bool filterDeletedNodes(vecsim_stl::vector<idType>& neighbors) const {
+        size_t original_size = neighbors.size();
+        auto new_end = std::remove_if(neighbors.begin(), neighbors.end(),
+            [this](idType id) { return id >= curElementCount || isMarkedDeleted(id); });


The filterDeletedNodes method uses std::remove_if which requires dereferencing iterators to call the lambda. Since this is a hot path (called during neighbor retrieval), consider checking if isMarkedDeleted has proper bounds checking. If id >= curElementCount is checked before calling isMarkedDeleted(id), the second check in isMarkedDeleted may be redundant. Verify the performance impact of this double-check pattern.

Suggested change

[this](idType id) { return id >= curElementCount || isMarkedDeleted(id); });

[this](idType id) { return isMarkedDeleted(id); });

src/VecSim/algorithms/hnsw/hnsw_disk.h

tests/benchmark/bm_common.h

src/VecSim/algorithms/hnsw/hnsw_disk.h

Copilot

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-30T15:48:42Z

src/VecSim/algorithms/hnsw/hnsw_disk.h

+                bool is_bidirectional = false;
+                for (idType nn : neighbor_neighbors) {
+                    if (nn == deleted_id) {
+                        is_bidirectional = true;
+                        break;
+                    }
+                }


The bidirectional edge check performs a linear search through neighbor_neighbors for each neighbor. For large graphs with many neighbors, this results in O(neighbors * neighbors.size()) complexity. Consider using a hash set for O(1) lookup:

std::unordered_set<idType> neighbor_set(neighbor_neighbors.begin(), neighbor_neighbors.end()); bool is_bidirectional = neighbor_set.find(deleted_id) != neighbor_set.end();

This would be especially beneficial when M is large.

Suggested change

bool is_bidirectional = false;

for (idType nn : neighbor_neighbors) {

if (nn == deleted_id) {

is_bidirectional = true;

break;

}

}

std::unordered_set<idType> neighbor_set(neighbor_neighbors.begin(), neighbor_neighbors.end());

bool is_bidirectional = neighbor_set.find(deleted_id) != neighbor_set.end();

Copilot · 2025-11-30T15:48:43Z

src/VecSim/algorithms/hnsw/hnsw_disk.h

+     * Threshold for batching delete operations.
+     * When the number of pending deletions reaches this value, the deletions are processed in a batch.
+     */
+    size_t deleteBatchThreshold = 10;


[nitpick] The deleteBatchThreshold is hardcoded to 10, which seems quite low compared to batchThreshold (which defaults to 10 for construction but can be configured). Consider:

Making this configurable through HNSWParams

Documenting why 10 was chosen as the default

Using a higher value (e.g., 100) for better batching efficiency, similar to what's tested in the BatchedDeletionTest

Copilot · 2025-11-30T15:48:43Z

src/VecSim/algorithms/hnsw/hnsw_disk_serializer.h

+    if (!stagedInsertUpdates.empty()) {
+        throw std::runtime_error("Serialization error: stagedInsertUpdates not empty after flush");
    }
-    if (!stagedNeighborUpdates.empty()) {
-        throw std::runtime_error("Serialization error: stagedNeighborUpdates not empty after flush");
+    if (!stagedDeleteUpdates.empty()) {
+        throw std::runtime_error("Serialization error: stagedDeleteUpdates not empty after flush");
+    }
+    if (!stagedInsertNeighborUpdates.empty()) {
+        throw std::runtime_error("Serialization error: stagedInsertNeighborUpdates not empty after flush");
    }


Missing validation checks for stagedRepairUpdates and pendingDeleteIds after flush. These should also be empty before serialization to ensure a consistent index state. Add:

if (!stagedRepairUpdates.empty()) { throw std::runtime_error("Serialization error: stagedRepairUpdates not empty after flush"); } if (!pendingDeleteIds.empty()) { throw std::runtime_error("Serialization error: pendingDeleteIds not empty after flush"); }

Consider calling flushDeleteBatch() in addition to flushBatch() at line 263.

Copilot · 2025-11-30T15:48:43Z

tests/unit/test_hnsw_disk.cpp

+    delete results;
+}
+
+TEST_F(HNSWDiskIndexTest, StagedRepairTest) {


[nitpick] The test name StagedRepairTest is somewhat ambiguous. Consider renaming to OpportunisticGraphRepairTest or LazyEdgeCleanupTest to better describe what the test is verifying: that deleted edges are opportunistically cleaned up during search operations.

Suggested change

TEST_F(HNSWDiskIndexTest, StagedRepairTest) {

TEST_F(HNSWDiskIndexTest, OpportunisticGraphRepairTest) {

dor-forer added 30 commits November 12, 2025 15:59

First step

492cc49

comment

bbce0de

Merge branch 'disk-poc' of https://github.com/RedisAI/VectorSimilarity …

df75f23

…into dorer-disk-serialization-poc

Restore vectors

10db3e6

Working serializtion

c6ee16d

Load vectors from disk

15905c1

load from vector

3e44c1e

Add serializer script

fc6da4e

Change the gitginore

a6cb0f5

mark delete

3264036

Remove old tests

2b75427

add mark as deleted

0353c83

pr changes

f368793

Merge branch 'dorer-disk-serialization-poc' of https://github.com/Red…

be11405

…isAI/VectorSimilarity into dorer-disk-poc-mark-deleted-benchmark

fix the bm

412f36c

Merge branch 'disk-poc' of https://github.com/RedisAI/VectorSimilarity …

8d24245

…into dorer-disk-poc-mark-deleted-benchmark

Fixed the replace entry point

c462b78

change order

cab1174

Remove prints

f892b50

added delete vectors and unit tests

d6d0cf5

Added bm and tests

e21b31c

Removed the freeidvector

f1825e6

Add repair updates

546908f

remove comments

c03a55a

Merge branch 'disk-poc' of https://github.com/RedisAI/VectorSimilarity …

5694a9f

…into dorer-disk-poc-delete-vector

change to if continue

ae8dfd1

getchached vector data

d37df8a

remove cache for now

de20615

back to origin

f8b40ee

Fix the stale

27d79cd

Copilot started reviewing on behalf of dor-forer November 27, 2025 15:06 View session

Copilot finished reviewing on behalf of dor-forer November 27, 2025 15:09

Copilot AI reviewed Nov 27, 2025

View reviewed changes

dor-forer added 2 commits November 30, 2025 09:49

Change to map for better performance

8e1794c

PR suggestopns

64d320f

dor-forer requested a review from Copilot November 30, 2025 08:00

Copilot started reviewing on behalf of dor-forer November 30, 2025 08:01 View session

Copilot finished reviewing on behalf of dor-forer November 30, 2025 08:05

Copilot AI reviewed Nov 30, 2025

View reviewed changes

dor-forer added 3 commits November 30, 2025 10:40

improvments

fc3e8ea

PR changes

e4d2bef

Added to ci

ea14d1c

dor-forer requested a review from Copilot November 30, 2025 12:38

Copilot started reviewing on behalf of dor-forer November 30, 2025 12:38 View session

Copilot finished reviewing on behalf of dor-forer November 30, 2025 12:42

Copilot AI reviewed Nov 30, 2025

View reviewed changes

dor-forer added 5 commits November 30, 2025 15:36

Added download to ci

32fe613

Added track delete time to bm

7bb9665

Added the bm filename

e3b7193

Change the name of file

fc59e8a

added no gt bm

70af9a3

dor-forer requested a review from Copilot November 30, 2025 14:51

Copilot started reviewing on behalf of dor-forer November 30, 2025 14:51 View session

Copilot finished reviewing on behalf of dor-forer November 30, 2025 14:55

Copilot AI reviewed Nov 30, 2025

View reviewed changes

PR fix

2874b61

dor-forer requested a review from Copilot November 30, 2025 15:42

Copilot started reviewing on behalf of dor-forer November 30, 2025 15:43 View session

Copilot finished reviewing on behalf of dor-forer November 30, 2025 15:46

Copilot AI reviewed Nov 30, 2025

View reviewed changes

		std::cout << "Not enough non-deleted ground truth results to compare against (only "
		<< filtered_res->results.size() << " out of " << k << " requested)" << std::endl;

-            std::cout << "Not enough non-deleted ground truth results to compare against (only "
-                      << filtered_res->results.size() << " out of " << k << " requested)" << std::endl;
+            st.SkipWithError(
+                ("Not enough non-deleted ground truth results to compare against (only " +
+                 std::to_string(filtered_res->results.size()) + " out of " +
+                 std::to_string(k) + " requested)").c_str());

	// Can maybe merge with stagedInsertUpdates
	// stagedInsertNeighborUpdates is kept separate from stagedInsertUpdates because it specifically tracks neighbor list updates
	// required for bidirectional connections during insert operations, whereas stagedInsertUpdates handles other insert-related state.

	return (static_cast<uint64_t>(node_id) << 32) \| static_cast<uint64_t>(level);
	return (static_cast<uint64_t>(node_id) << 32) \| (static_cast<uint64_t>(level) & 0xFFFFFFFF);

		vecsim_stl::vector<idType> new_neighbor_neighbors(this->allocator);
		getNeighbors(new_neighbor_id, level, new_neighbor_neighbors);

	[this](idType id) { return id >= curElementCount \|\| isMarkedDeleted(id); });
	[this](idType id) { return isMarkedDeleted(id); });

	TEST_F(HNSWDiskIndexTest, StagedRepairTest) {
	TEST_F(HNSWDiskIndexTest, OpportunisticGraphRepairTest) {

Dorer disk poc delete vector #843

Are you sure you want to change the base?

Dorer disk poc delete vector #843

Conversation

dor-forer commented Nov 26, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Key Changes

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Nov 30, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 30, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 30, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 30, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 30, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Nov 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Nov 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Nov 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Nov 30, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!