-
Notifications
You must be signed in to change notification settings - Fork 22
Dorer disk poc delete vector #843
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: disk-poc
Are you sure you want to change the base?
Conversation
…into dorer-disk-serialization-poc
…isAI/VectorSimilarity into dorer-disk-poc-mark-deleted-benchmark
…into dorer-disk-poc-mark-deleted-benchmark
…into dorer-disk-poc-delete-vector
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements vector deletion functionality with batch processing and graph repair for the HNSW disk-based index. The changes introduce a complete deletion pipeline that marks vectors as deleted, batches deletion operations for efficiency, and repairs the graph structure to maintain connectivity and search quality after deletions.
Key Changes
- Implemented batch deletion processing with configurable threshold (
deleteBatchThreshold) to amortize disk I/O costs - Added graph repair logic that uses heuristic-based neighbor selection to maintain graph quality when nodes are deleted
- Introduced separate staging areas for insert operations, delete operations, and opportunistic repair updates to avoid conflicts
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 13 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/unit/test_hnsw_disk.cpp | Added comprehensive test suites for batch deletion, interleaved insert/delete operations, staged repair, bidirectional edge updates, unidirectional edge cleanup, and graph repair with heuristics |
| tests/benchmark/run_files/bm_hnsw_disk_single_fp32.cpp | Updated comment to reflect dataset size (though comment-code mismatch exists) |
| tests/benchmark/bm_initialization/bm_hnsw_disk_initialize_fp32.h | Added TopK_DeleteLabel benchmark registration for measuring search performance after deletions with graph repair |
| tests/benchmark/bm_common.h | Implemented TopK_HNSW_DISK_DeleteLabel benchmark that tests search quality and performance after vector deletions with full graph repair |
| src/VecSim/algorithms/hnsw/hnsw_disk.h | Core implementation of deletion logic including pendingDeleteIds, processDeleteBatch, separate staging areas, opportunistic repair in getNeighbors, and bidirectional edge repair during deletion |
| src/VecSim/algorithms/hnsw/hnsw_disk_serializer.h | Updated serialization validation and restoration to handle new staging area data structures (stagedInsertUpdates, stagedDeleteUpdates, stagedRepairUpdates, stagedInsertNeighborUpdates) |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 18 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| inline uint64_t makeRepairKey(idType node_id, size_t level) const { | ||
| return (static_cast<uint64_t>(node_id) << 32) | static_cast<uint64_t>(level); | ||
| } |
Copilot
AI
Nov 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bit shift operation node_id << 32 can potentially overflow for 32-bit idType. If idType is uint32_t, shifting left by 32 bits is undefined behavior. The cast to uint64_t happens after the shift, which is too late. The shift should happen after casting to uint64_t.
| auto raw_it = rawVectorsInRAM.find(internalId); | ||
| if (raw_it != rawVectorsInRAM.end()) { | ||
| rawVectorsInRAM.erase(raw_it); | ||
| } |
Copilot
AI
Nov 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicate raw vector cleanup. The raw vector is erased from rawVectorsInRAM here in markDelete() (lines 2462-2465), and then erased again in processDeleteBatch() at lines 1898-1901. While the second erase is harmless (the iterator won't find the entry), this duplication is confusing and could lead to maintenance issues. Choose one location for this cleanup operation.
| auto raw_it = rawVectorsInRAM.find(internalId); | |
| if (raw_it != rawVectorsInRAM.end()) { | |
| rawVectorsInRAM.erase(raw_it); | |
| } | |
| // Raw vector cleanup is handled in processDeleteBatch(), not here. |
| const void* raw_vector_data = getRawVector(update.node_id); | ||
| if (raw_vector_data == nullptr) { | ||
| this->log(VecSimCommonStrings::LOG_WARNING_STRING, | ||
| "WARNING: Skipping graph update for node %u at level %zu - no raw vector data available", | ||
| update.node_id, update.level); | ||
| continue; | ||
| } |
Copilot
AI
Nov 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistent null-handling between graph updates and neighbor updates. The graph update section checks for null raw vector data (lines 897-903), but the neighbor update section at lines 959-963 does not perform the same check before calling serializeGraphValue(). This could lead to undefined behavior if getRawVector() returns nullptr for neighbor updates. Apply the same null check pattern to both sections.
| std::cout << "Not enough non-deleted ground truth results to compare against (only " | ||
| << filtered_res->results.size() << " out of " << k << " requested)" << std::endl; |
Copilot
AI
Nov 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error message uses std::cout instead of the benchmark error reporting mechanism. This message should be reported through st.SkipWithError() or similar benchmark reporting facility, as it indicates a test configuration problem that should fail the benchmark rather than being printed to console.
| std::cout << "Not enough non-deleted ground truth results to compare against (only " | |
| << filtered_res->results.size() << " out of " << k << " requested)" << std::endl; | |
| st.SkipWithError( | |
| ("Not enough non-deleted ground truth results to compare against (only " + | |
| std::to_string(filtered_res->results.size()) + " out of " + | |
| std::to_string(k) + " requested)").c_str()); |
| }; | ||
|
|
||
| vecsim_stl::vector<NeighborUpdate> stagedNeighborUpdates; | ||
| // Can maybe merge with stagedInsertUpdates |
Copilot
AI
Nov 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment "Can maybe merge with stagedInsertUpdates" suggests uncertainty about the design. This indicates potential code duplication or unclear separation of concerns. Either merge these structures if they serve the same purpose, or clarify in the comment why they need to remain separate. Leaving "maybe" comments in production code creates technical debt.
| // Can maybe merge with stagedInsertUpdates | |
| // stagedInsertNeighborUpdates is kept separate from stagedInsertUpdates because it specifically tracks neighbor list updates | |
| // required for bidirectional connections during insert operations, whereas stagedInsertUpdates handles other insert-related state. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| const void* raw_vector_data = getRawVector(node_id); | ||
|
|
||
| // Serialize with new format and add to batch | ||
| std::string graph_value = serializeGraphValue(raw_vector_data, updated_neighbors); | ||
| neighborBatch.Put(cf, neighborKey.asSlice(), graph_value); | ||
| // Serialize with new format and add to batch | ||
| std::string graph_value = serializeGraphValue(raw_vector_data, updated_neighbors); | ||
| neighborBatch.Put(cf, neighborKey.asSlice(), graph_value); |
Copilot
AI
Nov 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing null check for raw_vector_data before using it. If getRawVector(node_id) returns nullptr (which can happen as seen in line 903's check), the code will pass a nullptr to serializeGraphValue(), potentially causing a crash. This is inconsistent with the null check done for graphUpdates at line 902-908. Should add a check and either skip the update or handle it appropriately.
|
|
||
| // Helper to create a unique key for (node_id, level) pair for hash map | ||
| inline uint64_t makeRepairKey(idType node_id, size_t level) const { | ||
| return (static_cast<uint64_t>(node_id) << 32) | static_cast<uint64_t>(level); |
Copilot
AI
Nov 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The makeRepairKey function could produce collisions if level exceeds 32 bits. While HNSW levels are typically small (< 20), the cast static_cast<uint64_t>(level) from size_t doesn't mask the value. On 64-bit systems where size_t is 64 bits, a large level value (> 2^32) would overflow into the node_id portion of the key, causing incorrect behavior. Consider adding a mask: (static_cast<uint64_t>(node_id) << 32) | (static_cast<uint64_t>(level) & 0xFFFFFFFF) or asserting that level < (1ULL << 32).
| return (static_cast<uint64_t>(node_id) << 32) | static_cast<uint64_t>(level); | |
| return (static_cast<uint64_t>(node_id) << 32) | (static_cast<uint64_t>(level) & 0xFFFFFFFF); |
| vecsim_stl::vector<idType> new_neighbor_neighbors(this->allocator); | ||
| getNeighbors(new_neighbor_id, level, new_neighbor_neighbors); |
Copilot
AI
Nov 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling getNeighbors(new_neighbor_id, level, new_neighbor_neighbors) inside the repair loop could return stale or incorrect data. Since stageDeleteUpdate has already been called for neighbor_id (line 1860), if new_neighbor_id was also a neighbor of the deleted node, its staged update may have already been written to stagedDeleteMap. However, getNeighbors will check stagedDeleteMap and return the staged (possibly incomplete) neighbor list before all repairs are finished. This could lead to inconsistent bidirectional edge handling. Consider collecting all updates first, then applying bidirectional repairs in a second pass, or tracking which nodes have been updated in this batch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 9 out of 9 changed files in this pull request and generated 10 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Mutable to allow staging from const search methods. | ||
| // IMPORTANT: This class is NOT thread-safe. All operations (including const methods like | ||
| // getNeighbors and search) must be called from a single thread. The mutable fields below | ||
| // are modified during read operations for opportunistic graph cleanup. |
Copilot
AI
Nov 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment states "This class is NOT thread-safe" and mentions that "All operations (including const methods like getNeighbors and search) must be called from a single thread." However, using mutable fields that are modified in const methods creates a misleading API. The const qualifier typically implies thread-safety for concurrent reads. Consider either:
- Removing the
constqualifier from methods that modify these fields (likegetNeighbors) - Implementing proper synchronization for these mutable fields
- Documenting the single-threaded constraint more prominently in the class documentation
| // Mutable to allow staging from const search methods. | |
| // IMPORTANT: This class is NOT thread-safe. All operations (including const methods like | |
| // getNeighbors and search) must be called from a single thread. The mutable fields below | |
| // are modified during read operations for opportunistic graph cleanup. | |
| // NOTE: Methods that modify stagedRepairUpdates and stagedRepairMap are NOT const. | |
| // IMPORTANT: This class is NOT thread-safe. All operations (including getNeighbors and search) | |
| // must be called from a single thread. The mutable fields below are modified during read | |
| // operations for opportunistic graph cleanup. |
| inline bool filterDeletedNodes(vecsim_stl::vector<idType>& neighbors) const { | ||
| size_t original_size = neighbors.size(); | ||
| auto new_end = std::remove_if(neighbors.begin(), neighbors.end(), | ||
| [this](idType id) { return id >= curElementCount || isMarkedDeleted(id); }); |
Copilot
AI
Nov 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The filterDeletedNodes method uses std::remove_if which requires dereferencing iterators to call the lambda. Since this is a hot path (called during neighbor retrieval), consider checking if isMarkedDeleted has proper bounds checking. If id >= curElementCount is checked before calling isMarkedDeleted(id), the second check in isMarkedDeleted may be redundant. Verify the performance impact of this double-check pattern.
| [this](idType id) { return id >= curElementCount || isMarkedDeleted(id); }); | |
| [this](idType id) { return isMarkedDeleted(id); }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| bool is_bidirectional = false; | ||
| for (idType nn : neighbor_neighbors) { | ||
| if (nn == deleted_id) { | ||
| is_bidirectional = true; | ||
| break; | ||
| } | ||
| } |
Copilot
AI
Nov 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bidirectional edge check performs a linear search through neighbor_neighbors for each neighbor. For large graphs with many neighbors, this results in O(neighbors * neighbors.size()) complexity. Consider using a hash set for O(1) lookup:
std::unordered_set<idType> neighbor_set(neighbor_neighbors.begin(), neighbor_neighbors.end());
bool is_bidirectional = neighbor_set.find(deleted_id) != neighbor_set.end();This would be especially beneficial when M is large.
| bool is_bidirectional = false; | |
| for (idType nn : neighbor_neighbors) { | |
| if (nn == deleted_id) { | |
| is_bidirectional = true; | |
| break; | |
| } | |
| } | |
| std::unordered_set<idType> neighbor_set(neighbor_neighbors.begin(), neighbor_neighbors.end()); | |
| bool is_bidirectional = neighbor_set.find(deleted_id) != neighbor_set.end(); |
| * Threshold for batching delete operations. | ||
| * When the number of pending deletions reaches this value, the deletions are processed in a batch. | ||
| */ | ||
| size_t deleteBatchThreshold = 10; |
Copilot
AI
Nov 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The deleteBatchThreshold is hardcoded to 10, which seems quite low compared to batchThreshold (which defaults to 10 for construction but can be configured). Consider:
- Making this configurable through HNSWParams
- Documenting why 10 was chosen as the default
- Using a higher value (e.g., 100) for better batching efficiency, similar to what's tested in the BatchedDeletionTest
| if (!stagedInsertUpdates.empty()) { | ||
| throw std::runtime_error("Serialization error: stagedInsertUpdates not empty after flush"); | ||
| } | ||
| if (!stagedNeighborUpdates.empty()) { | ||
| throw std::runtime_error("Serialization error: stagedNeighborUpdates not empty after flush"); | ||
| if (!stagedDeleteUpdates.empty()) { | ||
| throw std::runtime_error("Serialization error: stagedDeleteUpdates not empty after flush"); | ||
| } | ||
| if (!stagedInsertNeighborUpdates.empty()) { | ||
| throw std::runtime_error("Serialization error: stagedInsertNeighborUpdates not empty after flush"); | ||
| } |
Copilot
AI
Nov 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing validation checks for stagedRepairUpdates and pendingDeleteIds after flush. These should also be empty before serialization to ensure a consistent index state. Add:
if (!stagedRepairUpdates.empty()) {
throw std::runtime_error("Serialization error: stagedRepairUpdates not empty after flush");
}
if (!pendingDeleteIds.empty()) {
throw std::runtime_error("Serialization error: pendingDeleteIds not empty after flush");
}Consider calling flushDeleteBatch() in addition to flushBatch() at line 263.
| delete results; | ||
| } | ||
|
|
||
| TEST_F(HNSWDiskIndexTest, StagedRepairTest) { |
Copilot
AI
Nov 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The test name StagedRepairTest is somewhat ambiguous. Consider renaming to OpportunisticGraphRepairTest or LazyEdgeCleanupTest to better describe what the test is verifying: that deleted edges are opportunistically cleaned up during search operations.
| TEST_F(HNSWDiskIndexTest, StagedRepairTest) { | |
| TEST_F(HNSWDiskIndexTest, OpportunisticGraphRepairTest) { |
Describe the changes in the pull request
A clear and concise description of what the PR is solving.
Which issues this PR fixes
Main objects this PR modified
Mark if applicable