Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/fix opensearch vector mapping #2399

Merged

Conversation

mauricioalarcon
Copy link
Contributor

@mauricioalarcon mauricioalarcon commented Mar 18, 2025

Description

This PR also includes the changes proposed in 2376 -- I'm closing that one in favor of this (Mauricio)

The current OpenSearch configuration requires username and password for authentication, which is not compliant with security policies in many enterprises that enforce AWS IAM-based authentication (e.g., AWS SigV4 via SAML or IAM roles).

This feature request proposes adding support for AWS authentication methods such as AWSV4SignerAuth or AWS4Auth, which are already supported by the opensearch-py library. This would enable seamless authentication via AWS IAM roles, improving security, compliance, and ease of integration with AWS-hosted OpenSearch domains.

Fixes # #2375

Issue

The current implementation of the OpenSearch integration had a critical limitation with vector search filtering. The create_index() method was creating OpenSearch indices without explicitly specifying a vector engine, which caused OpenSearch to default to using nmslib as the vector engine. This default engine doesn't support query filters during search operations - it only allows post-query filtering, which is less efficient.
When filters are applied with nmslib, the system:

  1. First performs the vector similarity search
  2. Then applies filters to the results after the search is complete

This approach is inefficient because it processes and ranks potentially irrelevant vectors that will later be filtered out.

Solution

This PR updates the create_index() method to explicitly configure the Lucene engine with HNSW algorithm for vector search, matching the configuration already present in the create_col() method:

"vector": {
    "type": "knn_vector", 
    "dimension": self.vector_dim,
    "method": {"engine": "lucene", "name": "hnsw", "space_type": "cosinesimil"}
}

Benefits

With this change, the OpenSearch integration now:

  1. Supports true filter-during-search operations - Filters are applied during the vector similarity search, not after
  2. Improves search performance - Only relevant vectors (those matching the filter) are processed and ranked
  3. Creates consistency between the create_index() and create_col() methods, both using the same vector engine

Technical Details

  • The Lucene engine with HNSW algorithm is well-suited for most vector search use cases
  • This configuration uses cosine similarity as the distance metric, which works well for most embedding models
  • The engine configuration happens at index creation time and can't be changed for existing indices

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Refactor (does not change functionality, e.g. code style improvements, linting)
  • Documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration

Please delete options that are not relevant.

  • Unit Test

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published in downstream modules
  • I have checked my code and corrected any misspellings

Maintainer Checklist

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Made sure Checks passed

This commit introduces a new test validating the initialization of `OpenSearchDB` with `AWSV4SignerAuth` for HTTP authentication. Additionally, it updates the `poetry.lock` file to reflect dependencies changes with Poetry 2.1.1.

Related to mem0ai#2375
Copy link

vercel bot commented Mar 18, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
multimodal-demo ✅ Ready (Inspect) Visit Preview 💬 Add feedback Mar 19, 2025 3:43pm

Copy link
Member

@Dev-Khant Dev-Khant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @mauricioalarcon Thanks for this, changes looks good to me. Can you please resolve the merge conflict in poetry.lock file?

@mauricioalarcon
Copy link
Contributor Author

@Dev-Khant Thank you so much -- I've just adjusted poetry.lock

@Dev-Khant
Copy link
Member

Hey @mauricioalarcon Thanks for updating it, PR looks good to me.

@Dev-Khant Dev-Khant merged commit 7b51632 into mem0ai:main Mar 20, 2025
5 of 9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants