-
Notifications
You must be signed in to change notification settings - Fork 204
feat: adding count with filtering operations to OpenSearchDocumentStore
#2653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: adding count with filtering operations to OpenSearchDocumentStore
#2653
Conversation
OpenSearchDocumentStore
|
Hey @tstadel I'd also appreciate your review on this since we want to make sure it will in platform as well. |
| # Fields that are not metadata (should stay at top level) | ||
| non_meta_fields = {"id", "content", "embedding", "blob", "sparse_embedding", "score"} | ||
|
|
||
| for hit in hits: | ||
| data = hit["_source"] | ||
| data = hit["_source"].copy() | ||
|
|
||
| # Reconstruct metadata dict from flattened fields | ||
| meta = {} | ||
| fields_to_remove = [] | ||
| for key, value in data.items(): | ||
| if key not in non_meta_fields: | ||
| meta[key] = value | ||
| fields_to_remove.append(key) | ||
|
|
||
| # Remove metadata fields from top level and add them to meta | ||
| for key in fields_to_remove: | ||
| data.pop(key, None) | ||
|
|
||
| if meta: | ||
| data["meta"] = meta | ||
|
|
||
| if "highlight" in hit: | ||
| data["metadata"]["highlighted"] = hit["highlight"] | ||
| if "meta" not in data: | ||
| data["meta"] = {} | ||
| data["meta"]["highlighted"] = hit["highlight"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain what was happening before making these changes? Before this were we throwing away all meta information when reconstructing the Document?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also could we add some integration tests in test_bm25_retriever.py and test_embedding_retriever.py to do a full check of all fields of a returned Document? It seems we are missing some tests to confirm that returned Docs are reconstructed properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also it seems like there is another function called _deserialize_document that contained the same logic here but doesn't seem to be used anywhere. Could we remove it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not needed and was over-engineered.
I've added extensive tests to ensure that both BM25 and Embedding retrievers can store and retrieve documents with "complex" metadata. It's working with and without these changes. I will revert it.
Thanks for spotting this!
| """ | ||
| Builds cardinality aggregations for all metadata fields in the index mapping. | ||
| """ | ||
| special_fields = {"content", "embedding", "id", "score", "blob", "sparse_embedding"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like this set of fields is reused a few times. Perhaps we could make it a global variable at the top of this file (or a class attribute) so we can have one source of truth?
| @staticmethod | ||
| def _build_cardinality_aggregations(index_mapping: dict[str, Any]) -> dict[str, Any]: | ||
| """ | ||
| Builds cardinality aggregations for all metadata fields in the index mapping. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it could be helpful to link to the OpenSearch docs on cardinality aggregations https://docs.opensearch.org/latest/aggregations/metric/cardinality/ in the docstring
…ten and retrieved
Related Issues
OpenSearchDocumentStore#2635Proposed Changes:
count_documents_by_filter()- count documents matching filter criteriacount_distinct_values_by_filter()- get distinct value counts for metadata fields with optional filteringget_fields_info()- retrieve field type information from index mappingget_field_min_max()- get min/max values for numeric metadata fieldsget_field_unique_values()- get unique values for a field with pagination and content-based filteringquery_sql()- execute SQL queries against OpenSearch with support for multiple response formats (JSON, CSV, JDBC, RAW)How did you test it?
Notes for the reviewer
httpx>=0.28.1dependencyChecklist
fix:,feat:,build:,chore:,ci:,docs:,style:,refactor:,perf:,test:.