-
Notifications
You must be signed in to change notification settings - Fork 678
feat: add exact content deduplication in document creation #861
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
+426
−4
Merged
Changes from 1 commit
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -428,6 +428,19 @@ async def query_documents( | |
| return docs | ||
|
|
||
|
|
||
| def _normalize_content(content: str) -> str: | ||
| """Normalize document content for exact-match deduplication. | ||
|
|
||
| Content is compared after trimming surrounding whitespace and lowercasing | ||
|
|
||
| The SQL filter in ``create_documents`` must stay in sync with this: | ||
| ``lower(regexp_replace(content, '^\\s+|\\s+$', '', 'g'))``. Postgres' | ||
| ``trim()`` only strips spaces, so a regex is used to match Python's | ||
| ``str.strip()`` across all whitespace. | ||
| """ | ||
| return content.strip().lower() | ||
|
|
||
|
|
||
| async def create_documents( | ||
| db: AsyncSession, | ||
| documents: list[schemas.DocumentCreate], | ||
|
|
@@ -440,12 +453,17 @@ async def create_documents( | |
| """ | ||
| Create multiple documents with optional duplicate detection. | ||
|
|
||
| The ``deduplicate`` flag additionally enables semantic (cosine-similarity) | ||
| dedup via ``is_rejected_duplicate`` for documents that survive the exact | ||
| deduplication check. | ||
|
|
||
| Args: | ||
| db: Database session | ||
| documents: List of document creation schemas | ||
| workspace_name: Name of the workspace | ||
| observer: Name of the observing peer | ||
| observed: Name of the observed peer | ||
| deduplicate: Enable semantic duplicate detection | ||
|
|
||
| Returns: | ||
| List of DocumentCreate schemas that were actually inserted (excludes | ||
|
|
@@ -456,8 +474,54 @@ async def create_documents( | |
| # Store (document_model, embedding) pairs - IDs aren't available until after commit | ||
| docs_with_embeddings: list[tuple[models.Document, list[float]]] = [] | ||
|
|
||
| # exact-content dedup (independent of `deduplicate`): pre-fetch | ||
| # existing live documents whose normalized content matches anything in this | ||
| # batch, scoped to (workspace, observer, observed). The SQL normalization must | ||
| # mirror _normalize_content. | ||
| batch_normalized: set[str] = {_normalize_content(d.content) for d in documents} | ||
| existing_by_normalized: dict[str, models.Document] = {} | ||
| if batch_normalized: | ||
| normalized_content_sql = func.lower( | ||
| func.regexp_replace(models.Document.content, r"^\s+|\s+$", "", "g") | ||
| ) | ||
| existing_result = await db.execute( | ||
| select(models.Document).where( | ||
| models.Document.workspace_name == workspace_name, | ||
| models.Document.observer == observer, | ||
| models.Document.observed == observed, | ||
| models.Document.deleted_at.is_(None), | ||
| normalized_content_sql.in_(batch_normalized), | ||
| ) | ||
| ) | ||
| for existing_doc in existing_result.scalars(): | ||
| # If multiple historical rows share normalized content, reinforcing | ||
| # one is sufficient; keep the first. | ||
| existing_by_normalized.setdefault( | ||
| _normalize_content(existing_doc.content), existing_doc | ||
| ) | ||
|
|
||
| # Tracks normalized content already accepted from this batch so exact | ||
| # duplicates within a single inference call collapse to one document. | ||
| seen_in_batch: set[str] = set() | ||
|
|
||
| for doc in documents: | ||
| try: | ||
| normalized_content = _normalize_content(doc.content) | ||
|
|
||
| # Exact-match dedup, always on: | ||
| # 1) collapse exact duplicates within this batch (drop silently). | ||
| if normalized_content in seen_in_batch: | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we want to check for dups earlier? pre-embedding? I know it won't impact the embedding quality so it would just be an optimization change. |
||
| continue | ||
| seen_in_batch.add(normalized_content) | ||
|
|
||
| # 2) drop exact duplicates of an existing live document, recording | ||
| # the re-derivation as reinforcement on the existing row. | ||
| existing_match = existing_by_normalized.get(normalized_content) | ||
| if existing_match is not None: | ||
| existing_match.times_derived = models.Document.times_derived + 1 | ||
| await db.flush() | ||
| continue | ||
|
|
||
| # for each document, if deduplicate is True, perform a process | ||
| # that checks against existing documents and either rejects this document | ||
| # as a duplicate OR deletes an existing document that is a duplicate. | ||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win
🧩 Analysis chain
🌐 Web query:
PostgreSQL regexp_replace \s whitespace class match non-breaking space unicode💡 Result:
In PostgreSQL, the
\sshorthand in regular expressions represents the[[:space:]]POSIX character class [1][2]. Whether this class includes the non-breaking space (U+00A0) depends on your database's locale (LC_CTYPE) and encoding [3]. In many standard environments,[[:space:]]does not include the non-breaking space, which means\swill fail to match it [4][3]. To reliably match or replace non-breaking spaces along with standard whitespace, you must explicitly include the non-breaking space character in your regular expression [4][5]. You can use the following approach to replace both standard whitespace and non-breaking spaces globally: SELECT regexp_replace(your_column, '[\s\u00A0]+', ' ', 'g'); In this expression: -\smatches standard whitespace characters defined by your system's[[:space:]]class [1][2]. -\u00A0explicitly adds the non-breaking space to the character class [4]. - Thegflag ensures all occurrences are replaced, rather than just the first one [4]. If you need to target a wider range of Unicode whitespace or separator characters, you can expand the character class further [6]: SELECT regexp_replace(your_column, '[\s\u00A0\u1680\u2000-\u200A\u202F\u205F\u3000]+', ' ', 'g'); Because the behavior of\sis locale-dependent, testing against your specific data and locale settings is recommended [3]. If you are only dealing with simple replacements, the non-regexreplace()function is often more performant and avoids locale-related ambiguities [7].Citations:
🏁 Script executed:
Repository: plastic-labs/honcho
Length of output: 10062
🏁 Script executed:
Repository: plastic-labs/honcho
Length of output: 255
🏁 Script executed:
Repository: plastic-labs/honcho
Length of output: 197
🌐 Web query:
PostgreSQL lower() locale-dependent Unicode casing Python str.lower difference💡 Result:
The primary difference between PostgreSQL's lower function and Python's str.lower method lies in their reliance on locale settings and their adherence to Unicode standards [1][2][3][4]. PostgreSQL lower PostgreSQL's lower function is locale-dependent [1][2]. Its behavior is governed by the database's collation (specifically the LC_CTYPE setting of the collation) [5][6][7]. - When using libc-based collations, lower relies on the operating system's locale definitions, which can vary significantly across environments [6][7]. - When using ICU-based collations, lower leverages ICU for Unicode-compliant case mapping, which provides more consistent and robust support for various languages, including those with special casing rules like Turkish ('I' to 'ı') [5][6][8][7]. - Because it is tied to the collation, the same query might produce different results if the database is initialized with a different locale or if the expression uses an explicit COLLATE clause [5][6]. Python str.lower Python's str.lower method implements a fixed, Unicode-standard-based conversion that is not locale-aware [3][4]. - It performs a "simple" case mapping according to the Unicode standard, which does not change regardless of the system's locale settings [3][4]. - It does not account for language-specific requirements such as the Turkish "dotted/dotless I" distinction (where 'I' should map to 'ı' rather than 'i') [3][4]. - For case-insensitive comparisons, Python's str.casefold is generally preferred over str.lower because it handles more complex Unicode mappings (e.g., German 'ß' to 'ss') to facilitate caseless matching [3]. Summary of Differences 1. Locale Sensitivity: PostgreSQL lower is sensitive to the configured collation (locale) [1][5], whereas Python str.lower is locale-insensitive and always follows the same Unicode mapping [3][4]. 2. Language Support: PostgreSQL can support language-specific casing rules (e.g., Turkish) if the appropriate ICU collation is used [5][8]. Python's built-in str.lower does not support these unless external libraries like PyICU are used [3]. 3. Consistency: Python provides predictable, identical results across all environments [4]. PostgreSQL behavior for lower depends on the server's environment or collation configuration [1][6].
Citations:
🏁 Script executed:
Repository: plastic-labs/honcho
Length of output: 197
Unicode normalization can drift between Python and PostgreSQL here.
str.strip()/str.lower()are Unicode-aware, whileregexp_replace(..., '\\s', ...)and PostgreSQLlower()are locale-dependent, so the prefetch can miss rows for inputs with non-ASCII whitespace or locale-specific casing and insert a duplicate. Match candidates in Python, or document the supported normalization boundary.🤖 Prompt for AI Agents