Skip to content

feat: add exact content deduplication in document creation#861

Open
Rajat-Ahuja1997 wants to merge 3 commits into
mainfrom
rajat/add-exact-dedupe-in-doc-creation
Open

feat: add exact content deduplication in document creation#861
Rajat-Ahuja1997 wants to merge 3 commits into
mainfrom
rajat/add-exact-dedupe-in-doc-creation

Conversation

@Rajat-Ahuja1997

@Rajat-Ahuja1997 Rajat-Ahuja1997 commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator

Summary by CodeRabbit

  • Bug Fixes
    • Improved document creation deduplication by normalizing text (case-insensitive and whitespace-trimmed) before matching.
    • Exact duplicates within the same batch are collapsed to a single inserted record.
    • Exact matches against existing live documents now reinforce the existing record (updating times_derived atomically) instead of creating a new row.
    • When enabled, semantic deduplication continues to work after the exact-dedup step, preserving reinforcement outcomes.
  • Tests
    • Added async coverage for normalized exact-duplicate collapsing, live-record reinforcement, times_derived max-style behavior, and exact/semantic dedup interaction.

@coderabbitai

coderabbitai Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Walkthrough

Adds exact-content deduplication to create_documents() using normalized text matching, reinforces existing live rows on matches, and keeps semantic deduplication as an additional step. Four async tests cover batch duplicates, reinforcement, incoming times_derived, and interaction with semantic dedup.

Changes

Exact-content deduplication in create_documents

Layer / File(s) Summary
Normalize helper and dedup logic
src/crud/document.py
Adds _normalize_content(content) and updates create_documents() to normalize incoming content, match existing live rows by normalized SQL content, drop duplicate entries within the same batch, increment times_derived for exact matches, and update the reinforcement path in is_rejected_duplicate(). The docstring now distinguishes exact dedup from semantic dedup.
Exact dedup test cases
tests/crud/test_document.py
Adds four async tests covering within-batch exact duplicates, reinforcement of an existing live row on a later exact match, exact dedup honoring incoming times_derived, and exact dedup behavior when semantic replacement runs with autoflush disabled.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐇 I sniffed the text and trimmed the fuzz,
Lowercased the burrow’s buzz.
Old rows got a gentle nudge today,
New twins hopped away,
And times_derived still found its way.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly describes the main change: exact content deduplication during document creation.
Docstring Coverage ✅ Passed Docstring coverage is 90.91% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch rajat/add-exact-dedupe-in-doc-creation

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/crud/document.py`:
- Around line 431-441: The normalization used by _normalize_content and the
matching SQL in create_documents can diverge for Unicode whitespace and
locale-specific casing, so update the deduplication flow to use the same
normalization boundary on both sides or explicitly constrain/document the
supported cases. In particular, review _normalize_content and the SQL filter it
must stay in sync with, and either move candidate matching to Python using the
same str.strip()/lower() behavior or adjust the PostgreSQL prefetch logic so it
cannot miss equivalent rows and create duplicates.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 69b297d5-4347-4f75-8492-815c2dd5d603

📥 Commits

Reviewing files that changed from the base of the PR and between eb386c3 and 7dbe145.

📒 Files selected for processing (2)
  • src/crud/document.py
  • tests/crud/test_document.py

Comment thread src/crud/document.py
Comment on lines +431 to +441
def _normalize_content(content: str) -> str:
"""Normalize document content for exact-match deduplication.

Content is compared after trimming surrounding whitespace and lowercasing

The SQL filter in ``create_documents`` must stay in sync with this:
``lower(regexp_replace(content, '^\\s+|\\s+$', '', 'g'))``. Postgres'
``trim()`` only strips spaces, so a regex is used to match Python's
``str.strip()`` across all whitespace.
"""
return content.strip().lower()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

PostgreSQL regexp_replace \s whitespace class match non-breaking space unicode

💡 Result:

In PostgreSQL, the \s shorthand in regular expressions represents the [[:space:]] POSIX character class [1][2]. Whether this class includes the non-breaking space (U+00A0) depends on your database's locale (LC_CTYPE) and encoding [3]. In many standard environments, [[:space:]] does not include the non-breaking space, which means \s will fail to match it [4][3]. To reliably match or replace non-breaking spaces along with standard whitespace, you must explicitly include the non-breaking space character in your regular expression [4][5]. You can use the following approach to replace both standard whitespace and non-breaking spaces globally: SELECT regexp_replace(your_column, '[\s\u00A0]+', ' ', 'g'); In this expression: - \s matches standard whitespace characters defined by your system's [[:space:]] class [1][2]. - \u00A0 explicitly adds the non-breaking space to the character class [4]. - The g flag ensures all occurrences are replaced, rather than just the first one [4]. If you need to target a wider range of Unicode whitespace or separator characters, you can expand the character class further [6]: SELECT regexp_replace(your_column, '[\s\u00A0\u1680\u2000-\u200A\u202F\u205F\u3000]+', ' ', 'g'); Because the behavior of \s is locale-dependent, testing against your specific data and locale settings is recommended [3]. If you are only dealing with simple replacements, the non-regex replace() function is often more performant and avoids locale-related ambiguities [7].

Citations:


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Map relevant files/symbols first
git ls-files src/crud/document.py
ast-grep outline src/crud/document.py --view expanded || true

# Show the relevant section with line numbers
sed -n '380,520p' src/crud/document.py

# Find the SQL normalization expression and its callers
rg -n "regexp_replace\\(|_normalize_content|create_documents|lower\\(" src/crud/document.py src -g '!**/__pycache__/**'

Repository: plastic-labs/honcho

Length of output: 10062


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect exact behavior of PostgreSQL regex whitespace semantics from local docs if present
python3 - <<'PY'
from pathlib import Path
p = Path('src/crud/document.py')
text = p.read_text()
for needle in ["regexp_replace", "_normalize_content", "strip()", "lower()"]:
    print(f"FOUND {needle}:", needle in text)
PY

Repository: plastic-labs/honcho

Length of output: 255


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Read the specific region only if it exists
if [ -f src/crud/document.py ]; then
  nl -ba src/crud/document.py | sed -n '410,460p'
fi

Repository: plastic-labs/honcho

Length of output: 197


🌐 Web query:

PostgreSQL lower() locale-dependent Unicode casing Python str.lower difference

💡 Result:

The primary difference between PostgreSQL's lower function and Python's str.lower method lies in their reliance on locale settings and their adherence to Unicode standards [1][2][3][4]. PostgreSQL lower PostgreSQL's lower function is locale-dependent [1][2]. Its behavior is governed by the database's collation (specifically the LC_CTYPE setting of the collation) [5][6][7]. - When using libc-based collations, lower relies on the operating system's locale definitions, which can vary significantly across environments [6][7]. - When using ICU-based collations, lower leverages ICU for Unicode-compliant case mapping, which provides more consistent and robust support for various languages, including those with special casing rules like Turkish ('I' to 'ı') [5][6][8][7]. - Because it is tied to the collation, the same query might produce different results if the database is initialized with a different locale or if the expression uses an explicit COLLATE clause [5][6]. Python str.lower Python's str.lower method implements a fixed, Unicode-standard-based conversion that is not locale-aware [3][4]. - It performs a "simple" case mapping according to the Unicode standard, which does not change regardless of the system's locale settings [3][4]. - It does not account for language-specific requirements such as the Turkish "dotted/dotless I" distinction (where 'I' should map to 'ı' rather than 'i') [3][4]. - For case-insensitive comparisons, Python's str.casefold is generally preferred over str.lower because it handles more complex Unicode mappings (e.g., German 'ß' to 'ss') to facilitate caseless matching [3]. Summary of Differences 1. Locale Sensitivity: PostgreSQL lower is sensitive to the configured collation (locale) [1][5], whereas Python str.lower is locale-insensitive and always follows the same Unicode mapping [3][4]. 2. Language Support: PostgreSQL can support language-specific casing rules (e.g., Turkish) if the appropriate ICU collation is used [5][8]. Python's built-in str.lower does not support these unless external libraries like PyICU are used [3]. 3. Consistency: Python provides predictable, identical results across all environments [4]. PostgreSQL behavior for lower depends on the server's environment or collation configuration [1][6].

Citations:


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect the surrounding create_documents logic and any dedup/uniqueness checks
nl -ba src/crud/document.py | sed -n '300,520p'

Repository: plastic-labs/honcho

Length of output: 197


Unicode normalization can drift between Python and PostgreSQL here. str.strip()/str.lower() are Unicode-aware, while regexp_replace(..., '\\s', ...) and PostgreSQL lower() are locale-dependent, so the prefetch can miss rows for inputs with non-ASCII whitespace or locale-specific casing and insert a duplicate. Match candidates in Python, or document the supported normalization boundary.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/crud/document.py` around lines 431 - 441, The normalization used by
_normalize_content and the matching SQL in create_documents can diverge for
Unicode whitespace and locale-specific casing, so update the deduplication flow
to use the same normalization boundary on both sides or explicitly
constrain/document the supported cases. In particular, review _normalize_content
and the SQL filter it must stay in sync with, and either move candidate matching
to Python using the same str.strip()/lower() behavior or adjust the PostgreSQL
prefetch logic so it cannot miss equivalent rows and create duplicates.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/crud/document.py (1)

500-508: 🚀 Performance & Scalability | 🔵 Trivial | 💤 Low value

Optional: avoid fetching embeddings in the dedup prefetch.

select(models.Document) hydrates the full row, including the large embedding Vector column, but this prefetch only needs content and times_derived (PK is always loaded) to build the map and reinforce matches. Deferring the embedding trims I/O and memory for each matched row.

♻️ Proposed change
+from sqlalchemy.orm import load_only
@@
         existing_result = await db.execute(
             select(models.Document).where(
                 models.Document.workspace_name == workspace_name,
                 models.Document.observer == observer,
                 models.Document.observed == observed,
                 models.Document.deleted_at.is_(None),
                 normalized_content_sql.in_(batch_normalized),
-            )
+            ).options(
+                load_only(models.Document.content, models.Document.times_derived)
+            )
         )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/crud/document.py` around lines 500 - 508, The dedup prefetch in the
document lookup is loading full Document rows, including the large embedding
field, even though the logic only needs content and times_derived. Update the
query built in the existing_result fetch path to load only the required
columns/attributes while keeping the same filters on models.Document, so the
match map can still be built without hydrating embeddings.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/crud/document.py`:
- Around line 500-508: The dedup prefetch in the document lookup is loading full
Document rows, including the large embedding field, even though the logic only
needs content and times_derived. Update the query built in the existing_result
fetch path to load only the required columns/attributes while keeping the same
filters on models.Document, so the match map can still be built without
hydrating embeddings.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c827cab1-ab88-4da8-8cf0-41298d260349

📥 Commits

Reviewing files that changed from the base of the PR and between a2fd3ba and 1e225f2.

📒 Files selected for processing (2)
  • src/crud/document.py
  • tests/crud/test_document.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants