Skip to content

Conversation

@dor-forer
Copy link
Collaborator

Describe the changes in the pull request

A clear and concise description of what the PR is solving.

Which issues this PR fixes

  1. #...
  2. MOD...

Main objects this PR modified

  1. ...
  2. ...

Mark if applicable

  • This PR introduces API changes
  • This PR introduces serialization changes

Copilot finished reviewing on behalf of dor-forer November 27, 2025 15:09
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements vector deletion functionality with batch processing and graph repair for the HNSW disk-based index. The changes introduce a complete deletion pipeline that marks vectors as deleted, batches deletion operations for efficiency, and repairs the graph structure to maintain connectivity and search quality after deletions.

Key Changes

  • Implemented batch deletion processing with configurable threshold (deleteBatchThreshold) to amortize disk I/O costs
  • Added graph repair logic that uses heuristic-based neighbor selection to maintain graph quality when nodes are deleted
  • Introduced separate staging areas for insert operations, delete operations, and opportunistic repair updates to avoid conflicts

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 13 comments.

Show a summary per file
File Description
tests/unit/test_hnsw_disk.cpp Added comprehensive test suites for batch deletion, interleaved insert/delete operations, staged repair, bidirectional edge updates, unidirectional edge cleanup, and graph repair with heuristics
tests/benchmark/run_files/bm_hnsw_disk_single_fp32.cpp Updated comment to reflect dataset size (though comment-code mismatch exists)
tests/benchmark/bm_initialization/bm_hnsw_disk_initialize_fp32.h Added TopK_DeleteLabel benchmark registration for measuring search performance after deletions with graph repair
tests/benchmark/bm_common.h Implemented TopK_HNSW_DISK_DeleteLabel benchmark that tests search quality and performance after vector deletions with full graph repair
src/VecSim/algorithms/hnsw/hnsw_disk.h Core implementation of deletion logic including pendingDeleteIds, processDeleteBatch, separate staging areas, opportunistic repair in getNeighbors, and bidirectional edge repair during deletion
src/VecSim/algorithms/hnsw/hnsw_disk_serializer.h Updated serialization validation and restoration to handle new staging area data structures (stagedInsertUpdates, stagedDeleteUpdates, stagedRepairUpdates, stagedInsertNeighborUpdates)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@dor-forer dor-forer requested a review from Copilot November 30, 2025 08:00
Copilot finished reviewing on behalf of dor-forer November 30, 2025 08:05
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 18 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +298 to +300
inline uint64_t makeRepairKey(idType node_id, size_t level) const {
return (static_cast<uint64_t>(node_id) << 32) | static_cast<uint64_t>(level);
}
Copy link

Copilot AI Nov 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bit shift operation node_id << 32 can potentially overflow for 32-bit idType. If idType is uint32_t, shifting left by 32 bits is undefined behavior. The cast to uint64_t happens after the shift, which is too late. The shift should happen after casting to uint64_t.

Copilot uses AI. Check for mistakes.
Comment on lines 2462 to 2465
auto raw_it = rawVectorsInRAM.find(internalId);
if (raw_it != rawVectorsInRAM.end()) {
rawVectorsInRAM.erase(raw_it);
}
Copy link

Copilot AI Nov 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate raw vector cleanup. The raw vector is erased from rawVectorsInRAM here in markDelete() (lines 2462-2465), and then erased again in processDeleteBatch() at lines 1898-1901. While the second erase is harmless (the iterator won't find the entry), this duplication is confusing and could lead to maintenance issues. Choose one location for this cleanup operation.

Suggested change
auto raw_it = rawVectorsInRAM.find(internalId);
if (raw_it != rawVectorsInRAM.end()) {
rawVectorsInRAM.erase(raw_it);
}
// Raw vector cleanup is handled in processDeleteBatch(), not here.

Copilot uses AI. Check for mistakes.
Comment on lines 828 to +903
const void* raw_vector_data = getRawVector(update.node_id);
if (raw_vector_data == nullptr) {
this->log(VecSimCommonStrings::LOG_WARNING_STRING,
"WARNING: Skipping graph update for node %u at level %zu - no raw vector data available",
update.node_id, update.level);
continue;
}
Copy link

Copilot AI Nov 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent null-handling between graph updates and neighbor updates. The graph update section checks for null raw vector data (lines 897-903), but the neighbor update section at lines 959-963 does not perform the same check before calling serializeGraphValue(). This could lead to undefined behavior if getRawVector() returns nullptr for neighbor updates. Apply the same null check pattern to both sections.

Copilot uses AI. Check for mistakes.
Comment on lines +294 to +295
std::cout << "Not enough non-deleted ground truth results to compare against (only "
<< filtered_res->results.size() << " out of " << k << " requested)" << std::endl;
Copy link

Copilot AI Nov 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message uses std::cout instead of the benchmark error reporting mechanism. This message should be reported through st.SkipWithError() or similar benchmark reporting facility, as it indicates a test configuration problem that should fail the benchmark rather than being printed to console.

Suggested change
std::cout << "Not enough non-deleted ground truth results to compare against (only "
<< filtered_res->results.size() << " out of " << k << " requested)" << std::endl;
st.SkipWithError(
("Not enough non-deleted ground truth results to compare against (only " +
std::to_string(filtered_res->results.size()) + " out of " +
std::to_string(k) + " requested)").c_str());

Copilot uses AI. Check for mistakes.
};

vecsim_stl::vector<NeighborUpdate> stagedNeighborUpdates;
// Can maybe merge with stagedInsertUpdates
Copy link

Copilot AI Nov 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment "Can maybe merge with stagedInsertUpdates" suggests uncertainty about the design. This indicates potential code duplication or unclear separation of concerns. Either merge these structures if they serve the same purpose, or clarify in the comment why they need to remain separate. Leaving "maybe" comments in production code creates technical debt.

Suggested change
// Can maybe merge with stagedInsertUpdates
// stagedInsertNeighborUpdates is kept separate from stagedInsertUpdates because it specifically tracks neighbor list updates
// required for bidirectional connections during insert operations, whereas stagedInsertUpdates handles other insert-related state.

Copilot uses AI. Check for mistakes.
@dor-forer dor-forer requested a review from Copilot November 30, 2025 12:38
Copilot finished reviewing on behalf of dor-forer November 30, 2025 12:42
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +964 to +968
const void* raw_vector_data = getRawVector(node_id);

// Serialize with new format and add to batch
std::string graph_value = serializeGraphValue(raw_vector_data, updated_neighbors);
neighborBatch.Put(cf, neighborKey.asSlice(), graph_value);
// Serialize with new format and add to batch
std::string graph_value = serializeGraphValue(raw_vector_data, updated_neighbors);
neighborBatch.Put(cf, neighborKey.asSlice(), graph_value);
Copy link

Copilot AI Nov 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing null check for raw_vector_data before using it. If getRawVector(node_id) returns nullptr (which can happen as seen in line 903's check), the code will pass a nullptr to serializeGraphValue(), potentially causing a crash. This is inconsistent with the null check done for graphUpdates at line 902-908. Should add a check and either skip the update or handle it appropriately.

Copilot uses AI. Check for mistakes.

// Helper to create a unique key for (node_id, level) pair for hash map
inline uint64_t makeRepairKey(idType node_id, size_t level) const {
return (static_cast<uint64_t>(node_id) << 32) | static_cast<uint64_t>(level);
Copy link

Copilot AI Nov 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The makeRepairKey function could produce collisions if level exceeds 32 bits. While HNSW levels are typically small (< 20), the cast static_cast<uint64_t>(level) from size_t doesn't mask the value. On 64-bit systems where size_t is 64 bits, a large level value (> 2^32) would overflow into the node_id portion of the key, causing incorrect behavior. Consider adding a mask: (static_cast<uint64_t>(node_id) << 32) | (static_cast<uint64_t>(level) & 0xFFFFFFFF) or asserting that level < (1ULL << 32).

Suggested change
return (static_cast<uint64_t>(node_id) << 32) | static_cast<uint64_t>(level);
return (static_cast<uint64_t>(node_id) << 32) | (static_cast<uint64_t>(level) & 0xFFFFFFFF);

Copilot uses AI. Check for mistakes.
Comment on lines 1869 to 1870
vecsim_stl::vector<idType> new_neighbor_neighbors(this->allocator);
getNeighbors(new_neighbor_id, level, new_neighbor_neighbors);
Copy link

Copilot AI Nov 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling getNeighbors(new_neighbor_id, level, new_neighbor_neighbors) inside the repair loop could return stale or incorrect data. Since stageDeleteUpdate has already been called for neighbor_id (line 1860), if new_neighbor_id was also a neighbor of the deleted node, its staged update may have already been written to stagedDeleteMap. However, getNeighbors will check stagedDeleteMap and return the staged (possibly incomplete) neighbor list before all repairs are finished. This could lead to inconsistent bidirectional edge handling. Consider collecting all updates first, then applying bidirectional repairs in a second pass, or tracking which nodes have been updated in this batch.

Copilot uses AI. Check for mistakes.
@dor-forer dor-forer requested a review from Copilot November 30, 2025 14:51
Copilot finished reviewing on behalf of dor-forer November 30, 2025 14:55
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 10 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +205 to +208
// Mutable to allow staging from const search methods.
// IMPORTANT: This class is NOT thread-safe. All operations (including const methods like
// getNeighbors and search) must be called from a single thread. The mutable fields below
// are modified during read operations for opportunistic graph cleanup.
Copy link

Copilot AI Nov 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment states "This class is NOT thread-safe" and mentions that "All operations (including const methods like getNeighbors and search) must be called from a single thread." However, using mutable fields that are modified in const methods creates a misleading API. The const qualifier typically implies thread-safety for concurrent reads. Consider either:

  1. Removing the const qualifier from methods that modify these fields (like getNeighbors)
  2. Implementing proper synchronization for these mutable fields
  3. Documenting the single-threaded constraint more prominently in the class documentation
Suggested change
// Mutable to allow staging from const search methods.
// IMPORTANT: This class is NOT thread-safe. All operations (including const methods like
// getNeighbors and search) must be called from a single thread. The mutable fields below
// are modified during read operations for opportunistic graph cleanup.
// NOTE: Methods that modify stagedRepairUpdates and stagedRepairMap are NOT const.
// IMPORTANT: This class is NOT thread-safe. All operations (including getNeighbors and search)
// must be called from a single thread. The mutable fields below are modified during read
// operations for opportunistic graph cleanup.

Copilot uses AI. Check for mistakes.
inline bool filterDeletedNodes(vecsim_stl::vector<idType>& neighbors) const {
size_t original_size = neighbors.size();
auto new_end = std::remove_if(neighbors.begin(), neighbors.end(),
[this](idType id) { return id >= curElementCount || isMarkedDeleted(id); });
Copy link

Copilot AI Nov 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The filterDeletedNodes method uses std::remove_if which requires dereferencing iterators to call the lambda. Since this is a hot path (called during neighbor retrieval), consider checking if isMarkedDeleted has proper bounds checking. If id >= curElementCount is checked before calling isMarkedDeleted(id), the second check in isMarkedDeleted may be redundant. Verify the performance impact of this double-check pattern.

Suggested change
[this](idType id) { return id >= curElementCount || isMarkedDeleted(id); });
[this](idType id) { return isMarkedDeleted(id); });

Copilot uses AI. Check for mistakes.
@dor-forer dor-forer requested a review from Copilot November 30, 2025 15:42
Copilot finished reviewing on behalf of dor-forer November 30, 2025 15:46
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1855 to +1861
bool is_bidirectional = false;
for (idType nn : neighbor_neighbors) {
if (nn == deleted_id) {
is_bidirectional = true;
break;
}
}
Copy link

Copilot AI Nov 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bidirectional edge check performs a linear search through neighbor_neighbors for each neighbor. For large graphs with many neighbors, this results in O(neighbors * neighbors.size()) complexity. Consider using a hash set for O(1) lookup:

std::unordered_set<idType> neighbor_set(neighbor_neighbors.begin(), neighbor_neighbors.end());
bool is_bidirectional = neighbor_set.find(deleted_id) != neighbor_set.end();

This would be especially beneficial when M is large.

Suggested change
bool is_bidirectional = false;
for (idType nn : neighbor_neighbors) {
if (nn == deleted_id) {
is_bidirectional = true;
break;
}
}
std::unordered_set<idType> neighbor_set(neighbor_neighbors.begin(), neighbor_neighbors.end());
bool is_bidirectional = neighbor_set.find(deleted_id) != neighbor_set.end();

Copilot uses AI. Check for mistakes.
* Threshold for batching delete operations.
* When the number of pending deletions reaches this value, the deletions are processed in a batch.
*/
size_t deleteBatchThreshold = 10;
Copy link

Copilot AI Nov 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The deleteBatchThreshold is hardcoded to 10, which seems quite low compared to batchThreshold (which defaults to 10 for construction but can be configured). Consider:

  1. Making this configurable through HNSWParams
  2. Documenting why 10 was chosen as the default
  3. Using a higher value (e.g., 100) for better batching efficiency, similar to what's tested in the BatchedDeletionTest

Copilot uses AI. Check for mistakes.
Comment on lines +269 to 277
if (!stagedInsertUpdates.empty()) {
throw std::runtime_error("Serialization error: stagedInsertUpdates not empty after flush");
}
if (!stagedNeighborUpdates.empty()) {
throw std::runtime_error("Serialization error: stagedNeighborUpdates not empty after flush");
if (!stagedDeleteUpdates.empty()) {
throw std::runtime_error("Serialization error: stagedDeleteUpdates not empty after flush");
}
if (!stagedInsertNeighborUpdates.empty()) {
throw std::runtime_error("Serialization error: stagedInsertNeighborUpdates not empty after flush");
}
Copy link

Copilot AI Nov 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing validation checks for stagedRepairUpdates and pendingDeleteIds after flush. These should also be empty before serialization to ensure a consistent index state. Add:

if (!stagedRepairUpdates.empty()) {
    throw std::runtime_error("Serialization error: stagedRepairUpdates not empty after flush");
}
if (!pendingDeleteIds.empty()) {
    throw std::runtime_error("Serialization error: pendingDeleteIds not empty after flush");
}

Consider calling flushDeleteBatch() in addition to flushBatch() at line 263.

Copilot uses AI. Check for mistakes.
delete results;
}

TEST_F(HNSWDiskIndexTest, StagedRepairTest) {
Copy link

Copilot AI Nov 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The test name StagedRepairTest is somewhat ambiguous. Consider renaming to OpportunisticGraphRepairTest or LazyEdgeCleanupTest to better describe what the test is verifying: that deleted edges are opportunistically cleaned up during search operations.

Suggested change
TEST_F(HNSWDiskIndexTest, StagedRepairTest) {
TEST_F(HNSWDiskIndexTest, OpportunisticGraphRepairTest) {

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants