Skip to content

fix: support quoted exact-phrase full text search#204

Open
adityasingh2400 wants to merge 2 commits into
apple:mainfrom
adityasingh2400:fix-search-exact-phrase
Open

fix: support quoted exact-phrase full text search#204
adityasingh2400 wants to merge 2 commits into
apple:mainfrom
adityasingh2400:fix-search-exact-phrase

Conversation

@adityasingh2400

Copy link
Copy Markdown
Contributor

Fixes #137

Root cause

The full text search indexes text with flexsearch using the LatinBalance encoder:

const options: IndexOptions = {
  tokenize: "forward",
  encoder: Charset.LatinBalance,
};

LatinBalance is a phonetic-style encoder that maps similar-looking words to the same token. That is great for fuzzy recall, but it means a query can match words the user did not intend. As reported in the issue, searching for "aldi" returns "ALDEA HOMES" rows ahead of the real "ALDI" rows because both encode to overlapping tokens. The reporter asked for a way to express an exact match, and a collaborator confirmed the encoder is the cause.

Fix

This adds an opt-in exact-match path. A query wrapped in double quotes, for example "aldi", is treated as a case-insensitive substring match against the original text rather than the fuzzy token search. Unquoted queries keep the existing fuzzy behavior, so nothing changes for current searches.

To support this, the index now keeps the original text per id alongside the flexsearch index, and the exact path scans those texts while still respecting the result limit.

To make the logic testable in a node environment, the SearchIndex class moves out of the worker entry into a standalone module (search_index.ts) that the worker imports and re-exports. The public worker surface is unchanged.

Verification

Added test/search_index.test.ts. One test reproduces the issue by showing the fuzzy search returns the "ALDEA" rows for "aldi", and another asserts the quoted query "aldi" returns only the rows that actually contain that substring. The exact-match test fails before this change and passes after it.

Commands run in packages/viewer:

  • npx vitest run, 40 passed across 3 files
  • npx prettier -c on the changed files, all clean

The unquoted fuzzy path is left untouched, so this is additive behavior gated behind quoting.

The full text search uses flexsearch with the LatinBalance encoder, which
maps similar-looking words to the same token. This is good for fuzzy recall
but produces unwanted matches when the user knows exactly what they want. For
example, searching for "aldi" surfaces "ALDEA HOMES" rows before the real
"ALDI" rows.

This adds an exact-match path: a query wrapped in double quotes is treated as
a case-insensitive substring match against the original text instead of the
fuzzy token search. The default unquoted behavior is unchanged, so existing
fuzzy searches keep working.

To make the index logic testable in node, the SearchIndex class moves into a
standalone module that the worker imports and re-exports. The index now also
keeps the original text per id so the exact path can scan it.

Fixes apple#137

@domoritz domoritz left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if I want to search for "aldi" store? I guess that's not supported?

export function parseExactPhrase(query: string): string | null {
if (query.length >= 2 && query.startsWith('"') && query.endsWith('"')) {
let inner = query.slice(1, -1);
return inner.length > 0 ? inner : null;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be simpler to check for longer than 2 above instead of longer or equal?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, and your other comment about "aldi" store pushed me to rethink this whole parse step. I replaced parseExactPhrase with a parseQuery that walks the string quote by quote, so the special-case length guard is gone entirely. A quoted run becomes a phrase, anything outside the quotes stays free text, and empty quotes just contribute nothing. That removed the >= 2 vs > 2 ambiguity you flagged here.

Generalize the search parser so a query can combine exact phrases with
fuzzy tokens, for example "aldi" store requires the exact substring
aldi and fuzzy-matches store. parseExactPhrase becomes parseQuery, which
returns the list of quoted phrases plus the remaining free text. The
query path filters candidates by every required phrase, narrowing to the
fuzzy hits when free text is also present, and falls back to the original
fuzzy path when there are no phrases.
@adityasingh2400

Copy link
Copy Markdown
Contributor Author

What if I want to search for "aldi" store? I guess that's not supported?

It is now, I pushed 9d5bcaa to support exactly that. The query parser splits a string into quoted phrases plus the leftover free text, so "aldi" store parses to one required phrase (aldi) and the free text store. The phrase is matched as an exact case-insensitive substring, the free text goes through the normal fuzzy index, and a row has to satisfy both. You can also stack phrases, "aldi" "downtown" requires both substrings.

Concretely, with rows ALDI Supermarket, Corner ALDI, and ALDI store downtown, searching "aldi" returns all three, while "aldi" store narrows to just ALDI store downtown. A plain unquoted query is unchanged, it is all free text with no phrases, so the existing fuzzy path is untouched. New tests cover the mixed, multi-phrase, and unterminated-quote cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug (?): False matches in full-text search

2 participants