Fix QdrantClient Document import issue and improve text processing fo…#311
Open
Tim-nocode wants to merge 1 commit intostanford-oval:mainfrom
Open
Fix QdrantClient Document import issue and improve text processing fo…#311Tim-nocode wants to merge 1 commit intostanford-oval:mainfrom
Tim-nocode wants to merge 1 commit intostanford-oval:mainfrom
Conversation
…r STORM
Summary:
This commit updates the STORM repository to work with the latest versions of qdrant_client by:
Replacing the deprecated Document import from qdrant_client with PointStruct.
Ensuring compatibility with RecursiveCharacterTextSplitter from LangChain by converting PointStruct into LangChain Document.
Fixing potential issues with CSV parsing and content chunking before vectorization.
Key Changes:
1. Fixed incompatibility with newer qdrant_client versions
Removed:
from qdrant_client import Document
Reason: In newer versions of qdrant_client, Document was removed and is no longer available.
Added instead:
from qdrant_client.models import PointStruct
Why? PointStruct is the correct way to structure documents before inserting them into Qdrant.
2. Updated document processing to avoid conflicts with LangChain
Old version:
documents = [
Document(
page_content=row[content_column],
metadata={
"title": row.get(title_column, ""),
"url": row[url_column],
"description": row.get(desc_column, ""),
},
)
for row in df.to_dict(orient="records")
]
New version:
documents = [
PointStruct(
id=index, # Unique identifier
vector=[], # Empty vector (will be generated later)
payload={
"content": row[content_column],
"title": row.get(title_column, ""),
"url": row[url_column],
"description": row.get(desc_column, ""),
},
)
for index, row in enumerate(df.to_dict(orient="records"))
]
Why? This ensures compatibility with qdrant_client and allows storing metadata separately.
3. Fixed compatibility with LangChain's RecursiveCharacterTextSplitter
Old version:
split_documents = text_splitter.split_documents(documents)
Issue: PointStruct does not have a page_content attribute, which text_splitter requires.
Fixed version:
from langchain.schema import Document as LangchainDocument
documents_langchain = [
LangchainDocument(
page_content=doc.payload["content"],
metadata=doc.payload
)
for doc in documents
]
split_documents = text_splitter.split_documents(documents_langchain)
Why? RecursiveCharacterTextSplitter requires page_content, which PointStruct does not have. Converting PointStruct to LangChain Document resolves this issue.
4. Ensured correct CSV parsing and encoding
Added sep="|" and encoding="utf-8" in pd.read_csv():
df = pd.read_csv(file_path, sep="|", encoding="utf-8")
Why?
Prevents issues where pandas treats the entire header row as a single column.
Ensures compatibility with datasets that use | as a separator.
5. Batch processing optimization
Ensured that data is properly batched before sending to Qdrant:
num_batches = (len(split_documents) + batch_size - 1) // batch_size
for i in tqdm(range(num_batches)):
start_idx = i * batch_size
end_idx = min((i + 1) * batch_size, len(split_documents))
qdrant.add_documents(
documents=split_documents[start_idx:end_idx],
batch_size=batch_size,
)
Why? Prevents timeout errors when handling large documents.
Ensures efficient memory usage and better API performance.
Impact & Benefits:
✅ Fixes compatibility issues with the latest qdrant_client versions.
✅ Ensures correct document chunking for LangChain's text splitter.
✅ Prevents "Content column not found" errors in CSV parsing.
✅ Improves stability when inserting large documents into Qdrant.
This commit ensures that STORM continues to work seamlessly with Qdrant and LangChain while providing better document processing support.
Next Steps:
Review and test with additional datasets.
Consider additional optimizations for embedding model selection.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
…r STORM
Summary:
This commit updates the STORM repository to work with the latest versions of qdrant_client by:
Replacing the deprecated Document import from qdrant_client with PointStruct. Ensuring compatibility with RecursiveCharacterTextSplitter from LangChain by converting PointStruct into LangChain Document. Fixing potential issues with CSV parsing and content chunking before vectorization.
Key Changes:
Removed:
from qdrant_client import Document
Reason: In newer versions of qdrant_client, Document was removed and is no longer available.
Added instead:
from qdrant_client.models import PointStruct
Why? PointStruct is the correct way to structure documents before inserting them into Qdrant.
Old version:
documents = [
Document(
page_content=row[content_column],
metadata={
"title": row.get(title_column, ""),
"url": row[url_column],
"description": row.get(desc_column, ""),
},
)
for row in df.to_dict(orient="records")
]
New version:
documents = [
PointStruct(
id=index, # Unique identifier
vector=[], # Empty vector (will be generated later)
payload={
"content": row[content_column],
"title": row.get(title_column, ""),
"url": row[url_column],
"description": row.get(desc_column, ""),
},
)
for index, row in enumerate(df.to_dict(orient="records"))
]
Why? This ensures compatibility with qdrant_client and allows storing metadata separately.
Old version:
split_documents = text_splitter.split_documents(documents) Issue: PointStruct does not have a page_content attribute, which text_splitter requires. Fixed version:
from langchain.schema import Document as LangchainDocument documents_langchain = [
LangchainDocument(
page_content=doc.payload["content"],
metadata=doc.payload
)
for doc in documents
]
split_documents = text_splitter.split_documents(documents_langchain)
Why? RecursiveCharacterTextSplitter requires page_content, which PointStruct does not have. Converting PointStruct to LangChain Document resolves this issue.
Added sep="|" and encoding="utf-8" in pd.read_csv():
df = pd.read_csv(file_path, sep="|", encoding="utf-8")
Why?
Prevents issues where pandas treats the entire header row as a single column.
Ensures compatibility with datasets that use | as a separator.
Ensured that data is properly batched before sending to Qdrant:
num_batches = (len(split_documents) + batch_size - 1) // batch_size for i in tqdm(range(num_batches)):
start_idx = i * batch_size
end_idx = min((i + 1) * batch_size, len(split_documents))
qdrant.add_documents(
documents=split_documents[start_idx:end_idx],
batch_size=batch_size,
)
Why? Prevents timeout errors when handling large documents.
Ensures efficient memory usage and better API performance.
Impact & Benefits:
✅ Fixes compatibility issues with the latest qdrant_client versions.
✅ Ensures correct document chunking for LangChain's text splitter.
✅ Prevents "Content column not found" errors in CSV parsing. ✅ Improves stability when inserting large documents into Qdrant.
This commit ensures that STORM continues to work seamlessly with Qdrant and LangChain while providing better document processing support.
Next Steps:
Review and test with additional datasets.
Consider additional optimizations for embedding model selection.