Skip to content

Conversation

@davidsbatista
Copy link
Contributor

@davidsbatista davidsbatista commented Jan 5, 2026

Related Issues

Proposed Changes:

  • count_documents_by_filter() - count documents matching filter criteria
  • count_distinct_values_by_filter()- get distinct value counts for metadata fields with optional filtering
  • get_fields_info() - retrieve field type information from index mapping
  • get_field_min_max() - get min/max values for numeric metadata fields
  • get_field_unique_values() - get unique values for a field with pagination and content-based filtering
  • query_sql() - execute SQL queries against OpenSearch with support for multiple response formats (JSON, CSV, JDBC, RAW)

How did you test it?

  • added integrations tests covering the new methods both or sync and async versions

Notes for the reviewer

  • added httpx>=0.28.1 dependency
  • the query_sql() method performs a raw http request (based on httpx) if the specified response format is not JSON

Checklist

@github-actions github-actions bot added integration:opensearch type:documentation Improvements or additions to documentation labels Jan 5, 2026
@davidsbatista davidsbatista changed the title Feat/add count filtering to open search document store feat: adding count with filtering operations to open search document store Jan 5, 2026
@davidsbatista davidsbatista changed the title feat: adding count with filtering operations to open search document store feat: adding count with filtering operations to OpenSearchDocumentStore Jan 5, 2026
@davidsbatista davidsbatista marked this pull request as ready for review January 6, 2026 11:16
@davidsbatista davidsbatista requested a review from a team as a code owner January 6, 2026 11:16
@davidsbatista davidsbatista requested review from sjrl and removed request for a team January 6, 2026 11:16
@sjrl sjrl requested a review from tstadel January 7, 2026 08:36
@sjrl
Copy link
Contributor

sjrl commented Jan 7, 2026

Hey @tstadel I'd also appreciate your review on this since we want to make sure it will in platform as well.

Comment on lines 352 to 376
# Fields that are not metadata (should stay at top level)
non_meta_fields = {"id", "content", "embedding", "blob", "sparse_embedding", "score"}

for hit in hits:
data = hit["_source"]
data = hit["_source"].copy()

# Reconstruct metadata dict from flattened fields
meta = {}
fields_to_remove = []
for key, value in data.items():
if key not in non_meta_fields:
meta[key] = value
fields_to_remove.append(key)

# Remove metadata fields from top level and add them to meta
for key in fields_to_remove:
data.pop(key, None)

if meta:
data["meta"] = meta

if "highlight" in hit:
data["metadata"]["highlighted"] = hit["highlight"]
if "meta" not in data:
data["meta"] = {}
data["meta"]["highlighted"] = hit["highlight"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain what was happening before making these changes? Before this were we throwing away all meta information when reconstructing the Document?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also could we add some integration tests in test_bm25_retriever.py and test_embedding_retriever.py to do a full check of all fields of a returned Document? It seems we are missing some tests to confirm that returned Docs are reconstructed properly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also it seems like there is another function called _deserialize_document that contained the same logic here but doesn't seem to be used anywhere. Could we remove it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not needed and was over-engineered.

I've added extensive tests to ensure that both BM25 and Embedding retrievers can store and retrieve documents with "complex" metadata. It's working with and without these changes. I will revert it.

Thanks for spotting this!

"""
Builds cardinality aggregations for all metadata fields in the index mapping.
"""
special_fields = {"content", "embedding", "id", "score", "blob", "sparse_embedding"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this set of fields is reused a few times. Perhaps we could make it a global variable at the top of this file (or a class attribute) so we can have one source of truth?

@staticmethod
def _build_cardinality_aggregations(index_mapping: dict[str, Any]) -> dict[str, Any]:
"""
Builds cardinality aggregations for all metadata fields in the index mapping.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it could be helpful to link to the OpenSearch docs on cardinality aggregations https://docs.opensearch.org/latest/aggregations/metric/cardinality/ in the docstring

@davidsbatista davidsbatista requested a review from sjrl January 9, 2026 11:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

integration:opensearch type:documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

add the following operations to OpenSearchDocumentStore

3 participants