Skip to content

GROQ Text matching with _id #102

@maxyinger

Description

@maxyinger

Issue

Current api version: "2022-07-07"
(however, this behavior was seen on all other versions I tried as well)

Scenario
We have internationalization for documents setup where ids follow the pattern:

{
  _id: i18n.<id>.<lang>
}

Started to run into an issue searching filtering based off the _id with the text matching pattern:

*[_id match "*." + $lang]

This works for most documents, but seems to not be matching on a document with the following _id:

// query
*[_id match "*.fr"]

// data
[
  { _id: "i18n.page-2021.fr" } // not matching
  { _id: "i18n.page-abcd.fr" } // matches correctly
]

Testing fields outside of _id

I think this is specific to the _id field, because it seems to match fine on GROQ Arcade when I put the same string as a different field ie:

// query
*[title match "*.fr"]

// data
[
  { title: "i18n.page-2021.fr" } // matches fine here
]

Assumed Cause

I'm guessing it has something to do with an edge case involved in the tokenization of _id being handled differently than other fields. Specifically around the number being present right before the matched text ie:

// query
*[_id match "*.fr"]

// data
[
  { _id: "i18n.page-2021.fr" }, // not matching
  { _id: "i18n.page-abcd.fr" } // matches correctly
]  

Alternatives considered

Looked into path() filters, but it seems those only work when the wild card characters are at the end of the path ie:

*[_id in path("**.fr")]

// data
[
  { _id: "i18n.page-2021.fr" }, // no match
]  

Is there support for something like this with the path function? is our id structure simply not compatible? Would you recommend a different approach to filter all documents who's _id ends in .$lang ?

Documentation Feedback

In general, I found it tough to find resources on Sanity's tokenization approach and how that maps to text matching. There is this example at the bottom of the Text Matching Section of the Query Cheat Sheet, but it doesn't elaborate much beyond that it doesn't work:

// Note how match operates on tokens!
"foo bar" match "fo*"  // -> true
"my-pretty-pony-123.jpg" match "my*.jpg"  // -> false

It might be nice to link to the Full-Text Search Operators Section in the cheat sheet.

References

https://www.sanity.io/answers/issue-with-filtering-documents-using-match-query-in-elasticsearch
sanity-io/sanity#1913

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions