fix: support quoted exact-phrase full text search by adityasingh2400 · Pull Request #204 · apple/embedding-atlas

adityasingh2400 · 2026-06-02T03:20:56Z

Fixes #137

Root cause

The full text search indexes text with flexsearch using the LatinBalance encoder:

const options: IndexOptions = {
  tokenize: "forward",
  encoder: Charset.LatinBalance,
};

LatinBalance is a phonetic-style encoder that maps similar-looking words to the same token. That is great for fuzzy recall, but it means a query can match words the user did not intend. As reported in the issue, searching for "aldi" returns "ALDEA HOMES" rows ahead of the real "ALDI" rows because both encode to overlapping tokens. The reporter asked for a way to express an exact match, and a collaborator confirmed the encoder is the cause.

Fix

This adds an opt-in exact-match path. A query wrapped in double quotes, for example "aldi", is treated as a case-insensitive substring match against the original text rather than the fuzzy token search. Unquoted queries keep the existing fuzzy behavior, so nothing changes for current searches.

To support this, the index now keeps the original text per id alongside the flexsearch index, and the exact path scans those texts while still respecting the result limit.

To make the logic testable in a node environment, the SearchIndex class moves out of the worker entry into a standalone module (search_index.ts) that the worker imports and re-exports. The public worker surface is unchanged.

Verification

Added test/search_index.test.ts. One test reproduces the issue by showing the fuzzy search returns the "ALDEA" rows for "aldi", and another asserts the quoted query "aldi" returns only the rows that actually contain that substring. The exact-match test fails before this change and passes after it.

Commands run in packages/viewer:

npx vitest run, 40 passed across 3 files
npx prettier -c on the changed files, all clean

The unquoted fuzzy path is left untouched, so this is additive behavior gated behind quoting.

The full text search uses flexsearch with the LatinBalance encoder, which maps similar-looking words to the same token. This is good for fuzzy recall but produces unwanted matches when the user knows exactly what they want. For example, searching for "aldi" surfaces "ALDEA HOMES" rows before the real "ALDI" rows. This adds an exact-match path: a query wrapped in double quotes is treated as a case-insensitive substring match against the original text instead of the fuzzy token search. The default unquoted behavior is unchanged, so existing fuzzy searches keep working. To make the index logic testable in node, the SearchIndex class moves into a standalone module that the worker imports and re-exports. The index now also keeps the original text per id so the exact path can scan it. Fixes apple#137

domoritz

What if I want to search for "aldi" store? I guess that's not supported?

domoritz · 2026-06-02T12:15:19Z

+export function parseExactPhrase(query: string): string | null {
+  if (query.length >= 2 && query.startsWith('"') && query.endsWith('"')) {
+    let inner = query.slice(1, -1);
+    return inner.length > 0 ? inner : null;


Would it be simpler to check for longer than 2 above instead of longer or equal?

Good catch, and your other comment about "aldi" store pushed me to rethink this whole parse step. I replaced parseExactPhrase with a parseQuery that walks the string quote by quote, so the special-case length guard is gone entirely. A quoted run becomes a phrase, anything outside the quotes stays free text, and empty quotes just contribute nothing. That removed the >= 2 vs > 2 ambiguity you flagged here.

Generalize the search parser so a query can combine exact phrases with fuzzy tokens, for example "aldi" store requires the exact substring aldi and fuzzy-matches store. parseExactPhrase becomes parseQuery, which returns the list of quoted phrases plus the remaining free text. The query path filters candidates by every required phrase, narrowing to the fuzzy hits when free text is also present, and falls back to the original fuzzy path when there are no phrases.

adityasingh2400 · 2026-06-05T04:39:50Z

What if I want to search for "aldi" store? I guess that's not supported?

It is now, I pushed 9d5bcaa to support exactly that. The query parser splits a string into quoted phrases plus the leftover free text, so "aldi" store parses to one required phrase (aldi) and the free text store. The phrase is matched as an exact case-insensitive substring, the free text goes through the normal fuzzy index, and a row has to satisfy both. You can also stack phrases, "aldi" "downtown" requires both substrings.

Concretely, with rows ALDI Supermarket, Corner ALDI, and ALDI store downtown, searching "aldi" returns all three, while "aldi" store narrows to just ALDI store downtown. A plain unquoted query is unchanged, it is all free text with no phrases, so the existing fuzzy path is untouched. New tests cover the mixed, multi-phrase, and unterminated-quote cases.

domoritz reviewed Jun 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: support quoted exact-phrase full text search#204

fix: support quoted exact-phrase full text search#204
adityasingh2400 wants to merge 2 commits into
apple:mainfrom
adityasingh2400:fix-search-exact-phrase

adityasingh2400 commented Jun 2, 2026

Uh oh!

domoritz left a comment

Uh oh!

domoritz Jun 2, 2026

Uh oh!

adityasingh2400 Jun 5, 2026

Uh oh!

adityasingh2400 commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

adityasingh2400 commented Jun 2, 2026

Root cause

Fix

Verification

Uh oh!

domoritz left a comment

Choose a reason for hiding this comment

Uh oh!

domoritz Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

adityasingh2400 Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

adityasingh2400 commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants