fix: support quoted exact-phrase full text search#204
Conversation
The full text search uses flexsearch with the LatinBalance encoder, which maps similar-looking words to the same token. This is good for fuzzy recall but produces unwanted matches when the user knows exactly what they want. For example, searching for "aldi" surfaces "ALDEA HOMES" rows before the real "ALDI" rows. This adds an exact-match path: a query wrapped in double quotes is treated as a case-insensitive substring match against the original text instead of the fuzzy token search. The default unquoted behavior is unchanged, so existing fuzzy searches keep working. To make the index logic testable in node, the SearchIndex class moves into a standalone module that the worker imports and re-exports. The index now also keeps the original text per id so the exact path can scan it. Fixes apple#137
domoritz
left a comment
There was a problem hiding this comment.
What if I want to search for "aldi" store? I guess that's not supported?
| export function parseExactPhrase(query: string): string | null { | ||
| if (query.length >= 2 && query.startsWith('"') && query.endsWith('"')) { | ||
| let inner = query.slice(1, -1); | ||
| return inner.length > 0 ? inner : null; |
There was a problem hiding this comment.
Would it be simpler to check for longer than 2 above instead of longer or equal?
There was a problem hiding this comment.
Good catch, and your other comment about "aldi" store pushed me to rethink this whole parse step. I replaced parseExactPhrase with a parseQuery that walks the string quote by quote, so the special-case length guard is gone entirely. A quoted run becomes a phrase, anything outside the quotes stays free text, and empty quotes just contribute nothing. That removed the >= 2 vs > 2 ambiguity you flagged here.
Generalize the search parser so a query can combine exact phrases with fuzzy tokens, for example "aldi" store requires the exact substring aldi and fuzzy-matches store. parseExactPhrase becomes parseQuery, which returns the list of quoted phrases plus the remaining free text. The query path filters candidates by every required phrase, narrowing to the fuzzy hits when free text is also present, and falls back to the original fuzzy path when there are no phrases.
It is now, I pushed 9d5bcaa to support exactly that. The query parser splits a string into quoted phrases plus the leftover free text, so Concretely, with rows |
Fixes #137
Root cause
The full text search indexes text with flexsearch using the LatinBalance encoder:
LatinBalance is a phonetic-style encoder that maps similar-looking words to the same token. That is great for fuzzy recall, but it means a query can match words the user did not intend. As reported in the issue, searching for "aldi" returns "ALDEA HOMES" rows ahead of the real "ALDI" rows because both encode to overlapping tokens. The reporter asked for a way to express an exact match, and a collaborator confirmed the encoder is the cause.
Fix
This adds an opt-in exact-match path. A query wrapped in double quotes, for example "aldi", is treated as a case-insensitive substring match against the original text rather than the fuzzy token search. Unquoted queries keep the existing fuzzy behavior, so nothing changes for current searches.
To support this, the index now keeps the original text per id alongside the flexsearch index, and the exact path scans those texts while still respecting the result limit.
To make the logic testable in a node environment, the SearchIndex class moves out of the worker entry into a standalone module (search_index.ts) that the worker imports and re-exports. The public worker surface is unchanged.
Verification
Added test/search_index.test.ts. One test reproduces the issue by showing the fuzzy search returns the "ALDEA" rows for "aldi", and another asserts the quoted query "aldi" returns only the rows that actually contain that substring. The exact-match test fails before this change and passes after it.
Commands run in packages/viewer:
The unquoted fuzzy path is left untouched, so this is additive behavior gated behind quoting.