Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: astra db chunks deletion based on metadata field #5537

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

smatiolids
Copy link
Contributor

Purpose
This PR addresses the need to reload specific documents without affecting others. To achieve this, a new option, "deletion_field", has been introduced.

Functionality

When "deletion_field" is set (e.g., "file_path"), the system will delete all documents in the target collection where metadata["file_path"] matches the corresponding value in the incoming documents.
This ensures that chunks from the specific file are removed before reloading it, preventing duplicates or conflicts.

… document management

- Introduced a new 'deletion_field' input to specify a metadata field for deleting documents before loading new data.
- Enhanced the _add_documents_to_vector_store method to handle document deletion based on the specified field, improving data management capabilities.
@dosubot dosubot bot added size:S This PR changes 10-29 lines, ignoring generated files. enhancement New feature or request labels Jan 3, 2025
Copy link

codspeed-hq bot commented Jan 3, 2025

CodSpeed Performance Report

Merging #5537 will degrade performances by 62.5%

Comparing smatiolids:feat/astra_deletion_based_on_metadata (cbd1635) with main (16ff8eb)

Summary

⚡ 2 improvements
❌ 1 regressions
✅ 12 untouched benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

Benchmark main smatiolids:feat/astra_deletion_based_on_metadata Change
test_get_and_cache_all_types_dict 1 ms 2.8 ms -62.5%
test_successful_run_with_input_type_text 271.3 ms 189.1 ms +43.5%
test_successful_run_with_output_type_debug 271.9 ms 214.7 ms +26.69%

smatiolids and others added 2 commits January 3, 2025 18:07
…ove readability.

- Optimized the deletion logic by using a set comprehension to eliminate duplicates when gathering delete values from documents.
@smatiolids smatiolids changed the title Feat/astra deletion based on metadata feat/astra deletion based on metadata Jan 3, 2025
@smatiolids smatiolids changed the title feat/astra deletion based on metadata feat: astra db deletion chunks based on metadata field Jan 3, 2025
@github-actions github-actions bot added enhancement New feature or request and removed enhancement New feature or request labels Jan 3, 2025
@smatiolids smatiolids changed the title feat: astra db deletion chunks based on metadata field feat: astra db chunks deletion based on metadata field Jan 3, 2025
@github-actions github-actions bot added enhancement New feature or request and removed enhancement New feature or request labels Jan 3, 2025
@ogabrielluiz ogabrielluiz requested a review from erichare January 6, 2025 12:05
@@ -607,6 +616,18 @@ def _add_documents_to_vector_store(self, vector_store) -> None:
msg = "Vector Store Inputs must be Data objects."
raise TypeError(msg)

if documents and self.deletion_field:
self.log(f"Deleting documents where {self.deletion_field}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should we remove this log line?

self.log(f"Deleting documents where {self.deletion_field} matches {delete_values}.")
collection.delete_many({f"metadata.{self.deletion_field}": {"$in": delete_values}})
except Exception as e:
msg = f"Error deleting documents from AstraDBVectorStore: {e}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
msg = f"Error deleting documents from AstraDBVectorStore: {e}"
msg = f"Error deleting documents from AstraDBVectorStore based on '{self.deletion_field}': {e}"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request size:S This PR changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants