Skip to content

Conversation

@daric93
Copy link

@daric93 daric93 commented Dec 5, 2025

Related Issues

Valkey Document Store
#2574

Proposed Changes:

Implemented ValkeyDocumentStore as a built-in integration for Haystack. This component follows Haystack’s DocumentStore interface using Valkey’s search module for vector similarity and metadata-based filtering.

Valkey supports all essential capabilities required by the DocumentStore interface.
How did you test it?

  • unit tests
  • integration tests
  • manual test (running example with haystack integration)

Checklist

@daric93 daric93 requested a review from a team as a code owner December 5, 2025 19:54
@daric93 daric93 requested review from sjrl and removed request for a team December 5, 2025 19:54
@github-actions github-actions bot added topic:CI type:documentation Improvements or additions to documentation labels Dec 5, 2025
@CLAassistant
Copy link

CLAassistant commented Dec 5, 2025

CLA assistant check
All committers have signed the CLA.

@daric93 daric93 force-pushed the feature/valkey-document-store branch 3 times, most recently from d9c48f3 to 2a84745 Compare December 8, 2025 16:00

To apply the Apache License to your work, attach the following boilerplate notice, with the fields enclosed by brackets "[]" replaced with your own identifying information. (Don't include the brackets!) The text should be enclosed in the appropriate comment syntax for the file format. We also recommend that a file or class name and description of purpose be included on the same "printed page" as the copyright notice for easier identification within third-party archives.

Copyright [yyyy] [name of copyright owner]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Copyright [yyyy] [name of copyright owner]
Copyright 2025 deepset GmbH

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -0,0 +1,109 @@
# valkey-haystack
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We keep our Readmes in this repo bare-bones (e.g. look at the opensearch example) and the content you have here will go into in our haystack-integrations repo (e.g. look at the opensearch integration page).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@sjrl
Copy link
Contributor

sjrl commented Dec 19, 2025

@daric93 thanks for the contribution!

Some feedback:

  • Update the labeler.yml file to include integration:valkey
  • Remove the file integrations/__init__.py which is here. This is purposely omitted since this messes up how namespacing works for our core integration packages.

I'll continue with a more in-depth review, but we are quite busy so this might take some time.

Comment on lines 33 to 34
os: [ubuntu-latest] # Valkey service container only available on Linux
python-version: ["3.9", "3.13"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are in the process of deprecating python 3.9 support. So following this PR #2616 could you update the python-version being tested here to 3.10?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

dynamic = ["version"]
description = ''
readme = "README.md"
requires-python = ">=3.9,<3.13"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed i the github workflow we test on python 3.13 but here you are saying valkey doesn't support 3.13. If it does we should update this to requires-python = ">=3.9"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

requires-python = ">=3.9,<3.13"
license = "Apache-2.0"
keywords = []
authors = [{ name = "John Doe", email = "[email protected]" }]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
authors = [{ name = "John Doe", email = "[email protected]" }]
authors = [{ name = "deepset", email = "[email protected]" }]

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment on lines 29 to 35
test = [
"pytest>=7.4",
"pytest-asyncio>=0.23",
"pytest-cov",
"pytest-rerunfailures",
"mypy",
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't needed since it's defined under [tool.hatch.envs.test], so let's remove

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

Comment on lines 36 to 44
examples = [
"markdown-it-py",
"mdit_plain",
"haystack-ai>=2.11.0",
"numpy<2.0.0",
"torch==2.2.2; sys_platform == 'darwin' and platform_machine == 'x86_64'",
"torch>=2.0.0; sys_platform != 'darwin' or platform_machine != 'x86_64'",
"sentence-transformers>=5.0.0",
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't add dependencies for examples in the pyproject.toml. So let's remove it from here, you can make a comment in the example files if there are extra deps to be installed

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

Comment on lines 115 to 118
# Pipelines serialized with old versions of the component might not
# have the filter_policy field.
if filter_policy := data["init_parameters"].get("filter_policy"):
data["init_parameters"]["filter_policy"] = FilterPolicy.from_str(filter_policy)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This dev comment copied from an existing integration. Probably fine to leave the if check but I'd drop the dev comment since an old version of this component doesn't exist yet.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

Comment on lines 142 to 145
# Security and authentication configuration
use_tls: bool = False,
username: str | None = None,
password: str | None = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments here. If you are planning to use the type operator | then we should make the min version of this integration python 3.10 and update the github workflow to test with python 3.10 and update pyproject.toml to run ruff using py310. We are in the process of deprecating python 3.9 from the Haystack ecosystem so it'd be fine if we make this new integration require python 3.10.

Second comment:
For sensitive information like username and password we use our Secret class as to make sure we never store this type of information as plain strings especially during serialization. Please checkout the OpenSearch integration to see how auth is handled. See how we use Secret for both username and password here. Note that this will impact the to_dict and from_dict methods to properly handle Secret serialization and deserialization. So check out how OpenSearch handles that here and here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

logger.error(msg)
raise ValkeyDocumentStoreError(msg) from e

async def async_filter_documents(self, filters: dict[str, Any] | None = None) -> list[Document]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the async counterparts of sync methods we always put the async modifier at the end so this should be filter_documents_async

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed

Comment on lines 517 to 524
docs = await document_store.async_filter_documents(
filters={"field": "meta.category", "operator": "==", "value": "news"}
)

# Filter by numeric range
docs = await document_store.async_filter_documents(
filters={"field": "meta.priority", "operator": ">=", "value": 5}
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure to update these examples when the name is changed

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

logger.error(msg)
raise ValkeyDocumentStoreError(msg) from e

async def async_count_documents(self) -> int:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as here

Suggested change
async def async_count_documents(self) -> int:
async def count_documents_async(self) -> int:

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed

try:
# Valkey only performs vector similarity search
# here we are querying with a dummy vector and the max compatible limit
limit = self.count_documents()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we be using the async version of count documents here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment on lines 808 to 815
def search(
self,
embedding: list[float],
filters: dict[str, Any] | None = None,
limit: int = 10,
*,
with_embedding: bool = True,
) -> list[Document]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not part of our normal protocol for DocumentStores so I'd be inclined to remove this function. I do see that we have similar functions in Chroma and Astra but not in any of our other document stores.

@anakin87 I'd appreciate your opinion on this too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, it's not problematic to add more methods even if not included in the Protocol.

But for consistency with other document stores, I'd call this _embedding_retrieval. See for example

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed

msg = f"Failed to delete all documents: {e}"
raise ValkeyDocumentStoreError(msg) from e

async def async_delete_all_documents(self) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
async def async_delete_all_documents(self) -> None:
async def delete_all_documents_async(self) -> None:

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed


return written_count

async def async_write_documents(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
async def async_write_documents(
async def write_documents_async(

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed

"""
self._validate_documents(documents)
if len(documents) == 0:
logger.warning("Calling ValkeyDocumentStore.write_documents() with empty list")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logger.warning("Calling ValkeyDocumentStore.write_documents() with empty list")
logger.warning("Calling ValkeyDocumentStore.write_documents_async() with empty list")

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment on lines 121 to 125
@component.output_types(documents=list[Document])
def run(
self,
query_embedding: list[float],
filters: Optional[dict[str, Any]] = None,
top_k: Optional[int] = None,
) -> dict[str, list[Document]]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you went to the trouble of adding async versions of all the underlying document store methods could we also add a run_async to this component?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added


from .document_store import ValkeyDocumentStore, ValkeyDocumentStoreError

__all__ = ("ValkeyDocumentStore", "ValkeyDocumentStoreError")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
__all__ = ("ValkeyDocumentStore", "ValkeyDocumentStoreError")
__all__ = ["ValkeyDocumentStore", "ValkeyDocumentStoreError"]

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@sjrl
Copy link
Contributor

sjrl commented Dec 19, 2025

Also the test folder structure should be updated. We follow the pattern that there should be one test file per real file in the integration. So we should only have:

  • test_document_store.py
  • test_document_store_async.py
  • test_embedding_retriever.py
  • test_filters.py

So rename your existing files and merge the other ones (test_static_methods.py and test_dict_converters.py)

@@ -0,0 +1,608 @@
# ruff: noqa: S110
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For tests in general, I recommend following these implementation tips.
As described in the tips, it's important to inherit testing mixins from https://github.com/deepset-ai/haystack/blob/main/haystack/testing/document_store.py

Pgvector tests are a good example

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@daric93 daric93 force-pushed the feature/valkey-document-store branch from 2a84745 to ccca143 Compare December 23, 2025 00:36
@daric93
Copy link
Author

daric93 commented Dec 23, 2025

Thanks for the review! I have addressed all the comments. Please take another look when you have time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

topic:CI type:documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants