-
Notifications
You must be signed in to change notification settings - Fork 204
Feature/valkey document store #2589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
d9c48f3 to
2a84745
Compare
integrations/valkey/LICENSE.txt
Outdated
|
|
||
| To apply the Apache License to your work, attach the following boilerplate notice, with the fields enclosed by brackets "[]" replaced with your own identifying information. (Don't include the brackets!) The text should be enclosed in the appropriate comment syntax for the file format. We also recommend that a file or class name and description of purpose be included on the same "printed page" as the copyright notice for easier identification within third-party archives. | ||
|
|
||
| Copyright [yyyy] [name of copyright owner] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Copyright [yyyy] [name of copyright owner] | |
| Copyright 2025 deepset GmbH |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
| @@ -0,0 +1,109 @@ | |||
| # valkey-haystack | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
|
@daric93 thanks for the contribution! Some feedback:
I'll continue with a more in-depth review, but we are quite busy so this might take some time. |
.github/workflows/valkey.yml
Outdated
| os: [ubuntu-latest] # Valkey service container only available on Linux | ||
| python-version: ["3.9", "3.13"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are in the process of deprecating python 3.9 support. So following this PR #2616 could you update the python-version being tested here to 3.10?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
integrations/valkey/pyproject.toml
Outdated
| dynamic = ["version"] | ||
| description = '' | ||
| readme = "README.md" | ||
| requires-python = ">=3.9,<3.13" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed i the github workflow we test on python 3.13 but here you are saying valkey doesn't support 3.13. If it does we should update this to requires-python = ">=3.9"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
integrations/valkey/pyproject.toml
Outdated
| requires-python = ">=3.9,<3.13" | ||
| license = "Apache-2.0" | ||
| keywords = [] | ||
| authors = [{ name = "John Doe", email = "[email protected]" }] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| authors = [{ name = "John Doe", email = "[email protected]" }] | |
| authors = [{ name = "deepset", email = "[email protected]" }] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
integrations/valkey/pyproject.toml
Outdated
| test = [ | ||
| "pytest>=7.4", | ||
| "pytest-asyncio>=0.23", | ||
| "pytest-cov", | ||
| "pytest-rerunfailures", | ||
| "mypy", | ||
| ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't needed since it's defined under [tool.hatch.envs.test], so let's remove
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
integrations/valkey/pyproject.toml
Outdated
| examples = [ | ||
| "markdown-it-py", | ||
| "mdit_plain", | ||
| "haystack-ai>=2.11.0", | ||
| "numpy<2.0.0", | ||
| "torch==2.2.2; sys_platform == 'darwin' and platform_machine == 'x86_64'", | ||
| "torch>=2.0.0; sys_platform != 'darwin' or platform_machine != 'x86_64'", | ||
| "sentence-transformers>=5.0.0", | ||
| ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't add dependencies for examples in the pyproject.toml. So let's remove it from here, you can make a comment in the example files if there are extra deps to be installed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
| # Pipelines serialized with old versions of the component might not | ||
| # have the filter_policy field. | ||
| if filter_policy := data["init_parameters"].get("filter_policy"): | ||
| data["init_parameters"]["filter_policy"] = FilterPolicy.from_str(filter_policy) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This dev comment copied from an existing integration. Probably fine to leave the if check but I'd drop the dev comment since an old version of this component doesn't exist yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
| # Security and authentication configuration | ||
| use_tls: bool = False, | ||
| username: str | None = None, | ||
| password: str | None = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments here. If you are planning to use the type operator | then we should make the min version of this integration python 3.10 and update the github workflow to test with python 3.10 and update pyproject.toml to run ruff using py310. We are in the process of deprecating python 3.9 from the Haystack ecosystem so it'd be fine if we make this new integration require python 3.10.
Second comment:
For sensitive information like username and password we use our Secret class as to make sure we never store this type of information as plain strings especially during serialization. Please checkout the OpenSearch integration to see how auth is handled. See how we use Secret for both username and password here. Note that this will impact the to_dict and from_dict methods to properly handle Secret serialization and deserialization. So check out how OpenSearch handles that here and here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
| logger.error(msg) | ||
| raise ValkeyDocumentStoreError(msg) from e | ||
|
|
||
| async def async_filter_documents(self, filters: dict[str, Any] | None = None) -> list[Document]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the async counterparts of sync methods we always put the async modifier at the end so this should be filter_documents_async
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed
| docs = await document_store.async_filter_documents( | ||
| filters={"field": "meta.category", "operator": "==", "value": "news"} | ||
| ) | ||
|
|
||
| # Filter by numeric range | ||
| docs = await document_store.async_filter_documents( | ||
| filters={"field": "meta.priority", "operator": ">=", "value": 5} | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sure to update these examples when the name is changed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
| logger.error(msg) | ||
| raise ValkeyDocumentStoreError(msg) from e | ||
|
|
||
| async def async_count_documents(self) -> int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as here
| async def async_count_documents(self) -> int: | |
| async def count_documents_async(self) -> int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed
| try: | ||
| # Valkey only performs vector similarity search | ||
| # here we are querying with a dummy vector and the max compatible limit | ||
| limit = self.count_documents() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we be using the async version of count documents here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
| def search( | ||
| self, | ||
| embedding: list[float], | ||
| filters: dict[str, Any] | None = None, | ||
| limit: int = 10, | ||
| *, | ||
| with_embedding: bool = True, | ||
| ) -> list[Document]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not part of our normal protocol for DocumentStores so I'd be inclined to remove this function. I do see that we have similar functions in Chroma and Astra but not in any of our other document stores.
@anakin87 I'd appreciate your opinion on this too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, it's not problematic to add more methods even if not included in the Protocol.
But for consistency with other document stores, I'd call this _embedding_retrieval. See for example
Line 1192 in 59c8b0d
| def _embedding_retrieval( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed
| msg = f"Failed to delete all documents: {e}" | ||
| raise ValkeyDocumentStoreError(msg) from e | ||
|
|
||
| async def async_delete_all_documents(self) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| async def async_delete_all_documents(self) -> None: | |
| async def delete_all_documents_async(self) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed
|
|
||
| return written_count | ||
|
|
||
| async def async_write_documents( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| async def async_write_documents( | |
| async def write_documents_async( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed
| """ | ||
| self._validate_documents(documents) | ||
| if len(documents) == 0: | ||
| logger.warning("Calling ValkeyDocumentStore.write_documents() with empty list") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| logger.warning("Calling ValkeyDocumentStore.write_documents() with empty list") | |
| logger.warning("Calling ValkeyDocumentStore.write_documents_async() with empty list") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
| @component.output_types(documents=list[Document]) | ||
| def run( | ||
| self, | ||
| query_embedding: list[float], | ||
| filters: Optional[dict[str, Any]] = None, | ||
| top_k: Optional[int] = None, | ||
| ) -> dict[str, list[Document]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you went to the trouble of adding async versions of all the underlying document store methods could we also add a run_async to this component?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
|
|
||
| from .document_store import ValkeyDocumentStore, ValkeyDocumentStoreError | ||
|
|
||
| __all__ = ("ValkeyDocumentStore", "ValkeyDocumentStoreError") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| __all__ = ("ValkeyDocumentStore", "ValkeyDocumentStoreError") | |
| __all__ = ["ValkeyDocumentStore", "ValkeyDocumentStoreError"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
|
Also the test folder structure should be updated. We follow the pattern that there should be one test file per real file in the integration. So we should only have:
So rename your existing files and merge the other ones ( |
| @@ -0,0 +1,608 @@ | |||
| # ruff: noqa: S110 | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For tests in general, I recommend following these implementation tips.
As described in the tips, it's important to inherit testing mixins from https://github.com/deepset-ai/haystack/blob/main/haystack/testing/document_store.py
Pgvector tests are a good example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
2a84745 to
ccca143
Compare
|
Thanks for the review! I have addressed all the comments. Please take another look when you have time. |
Related Issues
Valkey Document Store
#2574
Proposed Changes:
Implemented ValkeyDocumentStore as a built-in integration for Haystack. This component follows Haystack’s DocumentStore interface using Valkey’s search module for vector similarity and metadata-based filtering.
Valkey supports all essential capabilities required by the DocumentStore interface.
How did you test it?
Checklist