feat: add exact content deduplication in document creation by Rajat-Ahuja1997 · Pull Request #861 · plastic-labs/honcho

Rajat-Ahuja1997 · 2026-06-30T17:33:26Z

Summary by CodeRabbit

Bug Fixes
- Improved document creation deduplication by normalizing text (case-insensitive and whitespace-trimmed) before matching.
- Exact duplicates within the same batch are collapsed to a single inserted record.
- Exact matches against existing live documents now reinforce the existing record (updating times_derived atomically) instead of creating a new row.
- When enabled, semantic deduplication continues to work after the exact-dedup step, preserving reinforcement outcomes.
Tests
- Added async coverage for normalized exact-duplicate collapsing, live-record reinforcement, times_derived max-style behavior, and exact/semantic dedup interaction.

coderabbitai · 2026-06-30T17:33:57Z

Walkthrough

Adds exact-content deduplication to create_documents() using normalized text matching, reinforces existing live rows on matches, and keeps semantic deduplication as an additional step. Four async tests cover batch duplicates, reinforcement, incoming times_derived, and interaction with semantic dedup.

Changes

Exact-content deduplication in create_documents

Layer / File(s)	Summary
Normalize helper and dedup logic `src/crud/document.py`	Adds `_normalize_content(content)` and updates `create_documents()` to normalize incoming content, match existing live rows by normalized SQL content, drop duplicate entries within the same batch, increment `times_derived` for exact matches, and update the reinforcement path in `is_rejected_duplicate()`. The docstring now distinguishes exact dedup from semantic dedup.
Exact dedup test cases `tests/crud/test_document.py`	Adds four async tests covering within-batch exact duplicates, reinforcement of an existing live row on a later exact match, exact dedup honoring incoming `times_derived`, and exact dedup behavior when semantic replacement runs with autoflush disabled.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐇 I sniffed the text and trimmed the fuzz,
Lowercased the burrow’s buzz.
Old rows got a gentle nudge today,
New twins hopped away,
And times_derived still found its way.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly describes the main change: exact content deduplication during document creation.
Docstring Coverage	✅ Passed	Docstring coverage is 90.91% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch rajat/add-exact-dedupe-in-doc-creation

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/crud/document.py`:
- Around line 431-441: The normalization used by _normalize_content and the
matching SQL in create_documents can diverge for Unicode whitespace and
locale-specific casing, so update the deduplication flow to use the same
normalization boundary on both sides or explicitly constrain/document the
supported cases. In particular, review _normalize_content and the SQL filter it
must stay in sync with, and either move candidate matching to Python using the
same str.strip()/lower() behavior or adjust the PostgreSQL prefetch logic so it
cannot miss equivalent rows and create duplicates.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 69b297d5-4347-4f75-8492-815c2dd5d603

📥 Commits

Reviewing files that changed from the base of the PR and between eb386c3 and 7dbe145.

📒 Files selected for processing (2)

src/crud/document.py
tests/crud/test_document.py

coderabbitai · 2026-06-30T17:39:18Z

+def _normalize_content(content: str) -> str:
+    """Normalize document content for exact-match deduplication.
+
+    Content is compared after trimming surrounding whitespace and lowercasing
+
+    The SQL filter in ``create_documents`` must stay in sync with this:
+    ``lower(regexp_replace(content, '^\\s+|\\s+$', '', 'g'))``. Postgres'
+    ``trim()`` only strips spaces, so a regex is used to match Python's
+    ``str.strip()`` across all whitespace.
+    """
+    return content.strip().lower()


🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

PostgreSQL regexp_replace \s whitespace class match non-breaking space unicode

💡 Result:

In PostgreSQL, the \s shorthand in regular expressions represents the [[:space:]] POSIX character class [1][2]. Whether this class includes the non-breaking space (U+00A0) depends on your database's locale (LC_CTYPE) and encoding [3]. In many standard environments, [[:space:]] does not include the non-breaking space, which means \s will fail to match it [4][3]. To reliably match or replace non-breaking spaces along with standard whitespace, you must explicitly include the non-breaking space character in your regular expression [4][5]. You can use the following approach to replace both standard whitespace and non-breaking spaces globally: SELECT regexp_replace(your_column, '[\s\u00A0]+', ' ', 'g'); In this expression: - \s matches standard whitespace characters defined by your system's [[:space:]] class [1][2]. - \u00A0 explicitly adds the non-breaking space to the character class [4]. - The g flag ensures all occurrences are replaced, rather than just the first one [4]. If you need to target a wider range of Unicode whitespace or separator characters, you can expand the character class further [6]: SELECT regexp_replace(your_column, '[\s\u00A0\u1680\u2000-\u200A\u202F\u205F\u3000]+', ' ', 'g'); Because the behavior of \s is locale-dependent, testing against your specific data and locale settings is recommended [3]. If you are only dealing with simple replacements, the non-regex replace() function is often more performant and avoids locale-related ambiguities [7].

Citations:

1: https://www.postgresql.org/docs/17/functions-matching.html

2: https://www.postgresql.org/docs/current/functions-matching.html

3: https://www.postgresql.org/message-id/17525.1478531130%40sss.pgh.pa.us

4: https://dba.stackexchange.com/questions/345484/nbsp-non-breaking-space-is-not-considered-whitespace-by-postgres

5: https://stackoverflow.com/questions/22699535/trim-trailing-spaces-with-postgresql

6: https://stackoverflow.com/questions/63302656/remove-all-unicode-space-separators-in-postgresql

7: https://stackoverflow.com/questions/36849886/replacing-nonbreaking-spaces-a0-in-postgres

🏁 Script executed:

#!/bin/bash set -euo pipefail # Map relevant files/symbols first git ls-files src/crud/document.py ast-grep outline src/crud/document.py --view expanded || true # Show the relevant section with line numbers sed -n '380,520p' src/crud/document.py # Find the SQL normalization expression and its callers rg -n "regexp_replace\\(|_normalize_content|create_documents|lower\\(" src/crud/document.py src -g '!**/__pycache__/**'

Repository: plastic-labs/honcho

Length of output: 10062

🏁 Script executed:

#!/bin/bash set -euo pipefail # Inspect exact behavior of PostgreSQL regex whitespace semantics from local docs if present python3 - <<'PY' from pathlib import Path p = Path('src/crud/document.py') text = p.read_text() for needle in ["regexp_replace", "_normalize_content", "strip()", "lower()"]: print(f"FOUND {needle}:", needle in text) PY

Repository: plastic-labs/honcho

Length of output: 255

🏁 Script executed:

#!/bin/bash set -euo pipefail # Read the specific region only if it exists if [ -f src/crud/document.py ]; then nl -ba src/crud/document.py | sed -n '410,460p' fi

Repository: plastic-labs/honcho

Length of output: 197

🌐 Web query:

PostgreSQL lower() locale-dependent Unicode casing Python str.lower difference

💡 Result:

The primary difference between PostgreSQL's lower function and Python's str.lower method lies in their reliance on locale settings and their adherence to Unicode standards [1][2][3][4]. PostgreSQL lower PostgreSQL's lower function is locale-dependent [1][2]. Its behavior is governed by the database's collation (specifically the LC_CTYPE setting of the collation) [5][6][7]. - When using libc-based collations, lower relies on the operating system's locale definitions, which can vary significantly across environments [6][7]. - When using ICU-based collations, lower leverages ICU for Unicode-compliant case mapping, which provides more consistent and robust support for various languages, including those with special casing rules like Turkish ('I' to 'ı') [5][6][8][7]. - Because it is tied to the collation, the same query might produce different results if the database is initialized with a different locale or if the expression uses an explicit COLLATE clause [5][6]. Python str.lower Python's str.lower method implements a fixed, Unicode-standard-based conversion that is not locale-aware [3][4]. - It performs a "simple" case mapping according to the Unicode standard, which does not change regardless of the system's locale settings [3][4]. - It does not account for language-specific requirements such as the Turkish "dotted/dotless I" distinction (where 'I' should map to 'ı' rather than 'i') [3][4]. - For case-insensitive comparisons, Python's str.casefold is generally preferred over str.lower because it handles more complex Unicode mappings (e.g., German 'ß' to 'ss') to facilitate caseless matching [3]. Summary of Differences 1. Locale Sensitivity: PostgreSQL lower is sensitive to the configured collation (locale) [1][5], whereas Python str.lower is locale-insensitive and always follows the same Unicode mapping [3][4]. 2. Language Support: PostgreSQL can support language-specific casing rules (e.g., Turkish) if the appropriate ICU collation is used [5][8]. Python's built-in str.lower does not support these unless external libraries like PyICU are used [3]. 3. Consistency: Python provides predictable, identical results across all environments [4]. PostgreSQL behavior for lower depends on the server's environment or collation configuration [1][6].

Citations:

1: https://www.postgresql.org/docs/14/functions-string.html

2: https://www.postgresql.org/docs/current/functions-string.html

3: https://unicodefyi.com/series/developers-handbook/comparison-and-sorting/

4: More fully implement Unicode's case mappings python/cpython#82029

5: https://www.postgresql.org/docs/18/collation.html

6: https://www.postgresql.org/docs/18/locale.html

7: https://www.postgresql.org/docs/current/collation.html

8: https://www.postgresql.org/message-id/flat/1ecded1b-4ff8-46da-9b94-6e240db2a23b%40manitou-mail.org

🏁 Script executed:

#!/bin/bash set -euo pipefail # Inspect the surrounding create_documents logic and any dedup/uniqueness checks nl -ba src/crud/document.py | sed -n '300,520p'

Repository: plastic-labs/honcho

Length of output: 197

Unicode normalization can drift between Python and PostgreSQL here. str.strip()/str.lower() are Unicode-aware, while regexp_replace(..., '\\s', ...) and PostgreSQL lower() are locale-dependent, so the prefetch can miss rows for inputs with non-ASCII whitespace or locale-specific casing and insert a duplicate. Match candidates in Python, or document the supported normalization boundary.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/crud/document.py` around lines 431 - 441, The normalization used by _normalize_content and the matching SQL in create_documents can diverge for Unicode whitespace and locale-specific casing, so update the deduplication flow to use the same normalization boundary on both sides or explicitly constrain/document the supported cases. In particular, review _normalize_content and the SQL filter it must stay in sync with, and either move candidate matching to Python using the same str.strip()/lower() behavior or adjust the PostgreSQL prefetch logic so it cannot miss equivalent rows and create duplicates.

…ts and existing + 1

coderabbitai

🧹 Nitpick comments (1)

src/crud/document.py (1)

500-508: 🚀 Performance & Scalability | 🔵 Trivial | 💤 Low value

Optional: avoid fetching embeddings in the dedup prefetch.

select(models.Document) hydrates the full row, including the large embedding Vector column, but this prefetch only needs content and times_derived (PK is always loaded) to build the map and reinforce matches. Deferring the embedding trims I/O and memory for each matched row.

♻️ Proposed change

+from sqlalchemy.orm import load_only
@@
         existing_result = await db.execute(
             select(models.Document).where(
                 models.Document.workspace_name == workspace_name,
                 models.Document.observer == observer,
                 models.Document.observed == observed,
                 models.Document.deleted_at.is_(None),
                 normalized_content_sql.in_(batch_normalized),
-            )
+            ).options(
+                load_only(models.Document.content, models.Document.times_derived)
+            )
         )

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/crud/document.py` around lines 500 - 508, The dedup prefetch in the
document lookup is loading full Document rows, including the large embedding
field, even though the logic only needs content and times_derived. Update the
query built in the existing_result fetch path to load only the required
columns/attributes while keeping the same filters on models.Document, so the
match map can still be built without hydrating embeddings.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/crud/document.py`:
- Around line 500-508: The dedup prefetch in the document lookup is loading full
Document rows, including the large embedding field, even though the logic only
needs content and times_derived. Update the query built in the existing_result
fetch path to load only the required columns/attributes while keeping the same
filters on models.Document, so the match map can still be built without
hydrating embeddings.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c827cab1-ab88-4da8-8cf0-41298d260349

📥 Commits

Reviewing files that changed from the base of the PR and between a2fd3ba and 1e225f2.

📒 Files selected for processing (2)

src/crud/document.py
tests/crud/test_document.py

feat: add exact content deduplication in document creation

7dbe145

coderabbitai Bot reviewed Jun 30, 2026

View reviewed changes

Rajat-Ahuja1997 and others added 2 commits June 30, 2026 13:50

feat: add comment for index

a2fd3ba

fix: harden times_derived logic across all callers to use max of inpu…

1e225f2

…ts and existing + 1

coderabbitai Bot reviewed Jun 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add exact content deduplication in document creation#861

feat: add exact content deduplication in document creation#861
Rajat-Ahuja1997 wants to merge 3 commits into
mainfrom
rajat/add-exact-dedupe-in-doc-creation

Rajat-Ahuja1997 commented Jun 30, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 30, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 30, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Rajat-Ahuja1997 commented Jun 30, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Rajat-Ahuja1997 commented Jun 30, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 30, 2026 •

edited

Loading