feat: add exact content deduplication in document creation#861
feat: add exact content deduplication in document creation#861Rajat-Ahuja1997 wants to merge 3 commits into
Conversation
WalkthroughAdds exact-content deduplication to ChangesExact-content deduplication in
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/crud/document.py`:
- Around line 431-441: The normalization used by _normalize_content and the
matching SQL in create_documents can diverge for Unicode whitespace and
locale-specific casing, so update the deduplication flow to use the same
normalization boundary on both sides or explicitly constrain/document the
supported cases. In particular, review _normalize_content and the SQL filter it
must stay in sync with, and either move candidate matching to Python using the
same str.strip()/lower() behavior or adjust the PostgreSQL prefetch logic so it
cannot miss equivalent rows and create duplicates.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 69b297d5-4347-4f75-8492-815c2dd5d603
📒 Files selected for processing (2)
src/crud/document.pytests/crud/test_document.py
| def _normalize_content(content: str) -> str: | ||
| """Normalize document content for exact-match deduplication. | ||
|
|
||
| Content is compared after trimming surrounding whitespace and lowercasing | ||
|
|
||
| The SQL filter in ``create_documents`` must stay in sync with this: | ||
| ``lower(regexp_replace(content, '^\\s+|\\s+$', '', 'g'))``. Postgres' | ||
| ``trim()`` only strips spaces, so a regex is used to match Python's | ||
| ``str.strip()`` across all whitespace. | ||
| """ | ||
| return content.strip().lower() |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win
🧩 Analysis chain
🌐 Web query:
PostgreSQL regexp_replace \s whitespace class match non-breaking space unicode
💡 Result:
In PostgreSQL, the \s shorthand in regular expressions represents the [[:space:]] POSIX character class [1][2]. Whether this class includes the non-breaking space (U+00A0) depends on your database's locale (LC_CTYPE) and encoding [3]. In many standard environments, [[:space:]] does not include the non-breaking space, which means \s will fail to match it [4][3]. To reliably match or replace non-breaking spaces along with standard whitespace, you must explicitly include the non-breaking space character in your regular expression [4][5]. You can use the following approach to replace both standard whitespace and non-breaking spaces globally: SELECT regexp_replace(your_column, '[\s\u00A0]+', ' ', 'g'); In this expression: - \s matches standard whitespace characters defined by your system's [[:space:]] class [1][2]. - \u00A0 explicitly adds the non-breaking space to the character class [4]. - The g flag ensures all occurrences are replaced, rather than just the first one [4]. If you need to target a wider range of Unicode whitespace or separator characters, you can expand the character class further [6]: SELECT regexp_replace(your_column, '[\s\u00A0\u1680\u2000-\u200A\u202F\u205F\u3000]+', ' ', 'g'); Because the behavior of \s is locale-dependent, testing against your specific data and locale settings is recommended [3]. If you are only dealing with simple replacements, the non-regex replace() function is often more performant and avoids locale-related ambiguities [7].
Citations:
- 1: https://www.postgresql.org/docs/17/functions-matching.html
- 2: https://www.postgresql.org/docs/current/functions-matching.html
- 3: https://www.postgresql.org/message-id/17525.1478531130%40sss.pgh.pa.us
- 4: https://dba.stackexchange.com/questions/345484/nbsp-non-breaking-space-is-not-considered-whitespace-by-postgres
- 5: https://stackoverflow.com/questions/22699535/trim-trailing-spaces-with-postgresql
- 6: https://stackoverflow.com/questions/63302656/remove-all-unicode-space-separators-in-postgresql
- 7: https://stackoverflow.com/questions/36849886/replacing-nonbreaking-spaces-a0-in-postgres
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Map relevant files/symbols first
git ls-files src/crud/document.py
ast-grep outline src/crud/document.py --view expanded || true
# Show the relevant section with line numbers
sed -n '380,520p' src/crud/document.py
# Find the SQL normalization expression and its callers
rg -n "regexp_replace\\(|_normalize_content|create_documents|lower\\(" src/crud/document.py src -g '!**/__pycache__/**'Repository: plastic-labs/honcho
Length of output: 10062
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Inspect exact behavior of PostgreSQL regex whitespace semantics from local docs if present
python3 - <<'PY'
from pathlib import Path
p = Path('src/crud/document.py')
text = p.read_text()
for needle in ["regexp_replace", "_normalize_content", "strip()", "lower()"]:
print(f"FOUND {needle}:", needle in text)
PYRepository: plastic-labs/honcho
Length of output: 255
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Read the specific region only if it exists
if [ -f src/crud/document.py ]; then
nl -ba src/crud/document.py | sed -n '410,460p'
fiRepository: plastic-labs/honcho
Length of output: 197
🌐 Web query:
PostgreSQL lower() locale-dependent Unicode casing Python str.lower difference
💡 Result:
The primary difference between PostgreSQL's lower function and Python's str.lower method lies in their reliance on locale settings and their adherence to Unicode standards [1][2][3][4]. PostgreSQL lower PostgreSQL's lower function is locale-dependent [1][2]. Its behavior is governed by the database's collation (specifically the LC_CTYPE setting of the collation) [5][6][7]. - When using libc-based collations, lower relies on the operating system's locale definitions, which can vary significantly across environments [6][7]. - When using ICU-based collations, lower leverages ICU for Unicode-compliant case mapping, which provides more consistent and robust support for various languages, including those with special casing rules like Turkish ('I' to 'ı') [5][6][8][7]. - Because it is tied to the collation, the same query might produce different results if the database is initialized with a different locale or if the expression uses an explicit COLLATE clause [5][6]. Python str.lower Python's str.lower method implements a fixed, Unicode-standard-based conversion that is not locale-aware [3][4]. - It performs a "simple" case mapping according to the Unicode standard, which does not change regardless of the system's locale settings [3][4]. - It does not account for language-specific requirements such as the Turkish "dotted/dotless I" distinction (where 'I' should map to 'ı' rather than 'i') [3][4]. - For case-insensitive comparisons, Python's str.casefold is generally preferred over str.lower because it handles more complex Unicode mappings (e.g., German 'ß' to 'ss') to facilitate caseless matching [3]. Summary of Differences 1. Locale Sensitivity: PostgreSQL lower is sensitive to the configured collation (locale) [1][5], whereas Python str.lower is locale-insensitive and always follows the same Unicode mapping [3][4]. 2. Language Support: PostgreSQL can support language-specific casing rules (e.g., Turkish) if the appropriate ICU collation is used [5][8]. Python's built-in str.lower does not support these unless external libraries like PyICU are used [3]. 3. Consistency: Python provides predictable, identical results across all environments [4]. PostgreSQL behavior for lower depends on the server's environment or collation configuration [1][6].
Citations:
- 1: https://www.postgresql.org/docs/14/functions-string.html
- 2: https://www.postgresql.org/docs/current/functions-string.html
- 3: https://unicodefyi.com/series/developers-handbook/comparison-and-sorting/
- 4: More fully implement Unicode's case mappings python/cpython#82029
- 5: https://www.postgresql.org/docs/18/collation.html
- 6: https://www.postgresql.org/docs/18/locale.html
- 7: https://www.postgresql.org/docs/current/collation.html
- 8: https://www.postgresql.org/message-id/flat/1ecded1b-4ff8-46da-9b94-6e240db2a23b%40manitou-mail.org
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Inspect the surrounding create_documents logic and any dedup/uniqueness checks
nl -ba src/crud/document.py | sed -n '300,520p'Repository: plastic-labs/honcho
Length of output: 197
Unicode normalization can drift between Python and PostgreSQL here. str.strip()/str.lower() are Unicode-aware, while regexp_replace(..., '\\s', ...) and PostgreSQL lower() are locale-dependent, so the prefetch can miss rows for inputs with non-ASCII whitespace or locale-specific casing and insert a duplicate. Match candidates in Python, or document the supported normalization boundary.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/crud/document.py` around lines 431 - 441, The normalization used by
_normalize_content and the matching SQL in create_documents can diverge for
Unicode whitespace and locale-specific casing, so update the deduplication flow
to use the same normalization boundary on both sides or explicitly
constrain/document the supported cases. In particular, review _normalize_content
and the SQL filter it must stay in sync with, and either move candidate matching
to Python using the same str.strip()/lower() behavior or adjust the PostgreSQL
prefetch logic so it cannot miss equivalent rows and create duplicates.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
src/crud/document.py (1)
500-508: 🚀 Performance & Scalability | 🔵 Trivial | 💤 Low valueOptional: avoid fetching embeddings in the dedup prefetch.
select(models.Document)hydrates the full row, including the largeembeddingVector column, but this prefetch only needscontentandtimes_derived(PK is always loaded) to build the map and reinforce matches. Deferring the embedding trims I/O and memory for each matched row.♻️ Proposed change
+from sqlalchemy.orm import load_only @@ existing_result = await db.execute( select(models.Document).where( models.Document.workspace_name == workspace_name, models.Document.observer == observer, models.Document.observed == observed, models.Document.deleted_at.is_(None), normalized_content_sql.in_(batch_normalized), - ) + ).options( + load_only(models.Document.content, models.Document.times_derived) + ) )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/crud/document.py` around lines 500 - 508, The dedup prefetch in the document lookup is loading full Document rows, including the large embedding field, even though the logic only needs content and times_derived. Update the query built in the existing_result fetch path to load only the required columns/attributes while keeping the same filters on models.Document, so the match map can still be built without hydrating embeddings.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@src/crud/document.py`:
- Around line 500-508: The dedup prefetch in the document lookup is loading full
Document rows, including the large embedding field, even though the logic only
needs content and times_derived. Update the query built in the existing_result
fetch path to load only the required columns/attributes while keeping the same
filters on models.Document, so the match map can still be built without
hydrating embeddings.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: c827cab1-ab88-4da8-8cf0-41298d260349
📒 Files selected for processing (2)
src/crud/document.pytests/crud/test_document.py
Summary by CodeRabbit
times_derivedatomically) instead of creating a new row.times_derivedmax-style behavior, and exact/semantic dedup interaction.