-
Notifications
You must be signed in to change notification settings - Fork 11
Port to langchain #222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Port to langchain #222
Conversation
Implements agent-based architecture using LangChain 1.0.8+ with automated quality evaluation via LangSmith. This replaces the manual conversation generation workflow with quantitative metrics and continuous monitoring. Key changes: - Add LangChain and LangSmith dependencies - Implement LangChainChatManager with RAG tools - Create automated evaluation scripts and evaluators - Add comprehensive tests for new components - Update documentation with migration guide and evaluation docs Note: CI/CD workflow file (.github/workflows/langsmith_evaluation.yml) needs to be added manually by a maintainer with workflow permissions. This provides a path to enhanced observability, model flexibility, and continuous quality monitoring while maintaining compatibility with existing functionality. Addresses codeforpdx#203 Co-authored-by: yangm2 <[email protected]>
|
Claude finished @yangm2's task —— View job Code Review Complete ✅I've completed a comprehensive review of this LangChain migration PR. This is an ambitious architectural change that modernizes the codebase with significant benefits, but there are several critical issues that need to be addressed before merging. 🔴 Critical Issues (Must Fix)1. Hardcoded Placeholder in Production Code
2. Broken Streaming Implementation
3. Non-Functional Tests
4. Evaluation Script Parameter Mismatch
5. Missing Integration in Main Application
🟡 High Priority IssuesSecurity
Architecture
Type Safety
🟢 Code Quality IssuesMaintainability
Error Handling
✅ Positive ObservationsExcellent Documentation
Good Engineering Practices
Strategic Value
📊 Test Coverage AnalysisExisting Tests
Missing Tests
Recommendation: Add integration test that validates the agent actually uses retrieval tools and generates proper citations. 🔒 Security AssessmentIdentified Issues
Mitigations Needed
Existing Protections
⚡ Performance ConsiderationsPotential Concerns
Recommendations
📋 Pre-Merge ChecklistBefore merging this PR, please address:
🎯 RecommendationStatus: Needs Work This PR represents excellent architectural thinking and thorough documentation, but the critical implementation issues prevent it from being production-ready. The main concerns are:
Suggested Path Forward:
Once these are resolved, this will be a valuable modernization that sets up the project for future success. Review Summary by Category
Total Issues: 20 items requiring attention (5 critical, 4 high priority, 11 quality improvements) I've left inline comments on specific lines for all issues. Please review and address these before merging. Happy to answer questions or provide additional guidance! |
| project_id=GOOGLE_CLOUD_PROJECT, | ||
| region=GOOGLE_CLOUD_LOCATION, | ||
| index_id=VERTEX_AI_DATASTORE, | ||
| endpoint_id="fix-me-later", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical Issue: Hardcoded placeholder value
The endpoint_id parameter is set to "fix-me-later" which will cause runtime errors when the vector store is actually used. This needs to be:
- Set to a real Vertex AI endpoint ID from your deployed vector search index
- Configured via environment variable like other GCP settings
- Properly documented in the setup guide
Recommendation:
VERTEX_AI_ENDPOINT_ID = os.getenv("VERTEX_AI_ENDPOINT_ID")
if VERTEX_AI_ENDPOINT_ID is None:
raise ValueError("VERTEX_AI_ENDPOINT_ID environment variable is not set.")
vector_store = VectorSearchVectorStoreDatastore.from_components(
project_id=GOOGLE_CLOUD_PROJECT,
region=GOOGLE_CLOUD_LOCATION,
index_id=VERTEX_AI_DATASTORE,
endpoint_id=VERTEX_AI_ENDPOINT_ID,
)| MODEL = os.getenv("MODEL_NAME", "gemini-2.5-pro") | ||
| VERTEX_AI_DATASTORE = os.getenv("VERTEX_AI_DATASTORE") | ||
| GOOGLE_CLOUD_PROJECT = os.getenv("GOOGLE_CLOUD_PROJECT") | ||
| GOOGLE_CLOUD_LOCATION = os.getenv("GOOGLE_CLOUD_LOCATION", "us-west1-c") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Issue: Inconsistent default location
GOOGLE_CLOUD_LOCATION defaults to "us-west1-c" here (line 27), but in the ChatVertexAI initialization (line 115) it defaults to "us-central1". This inconsistency could cause issues:
- Vector store lookups might use a different region than the LLM
- Configuration errors may be hard to debug
Recommendation:
Use a single default consistently across both, e.g., "us-central1" which is more common for Vertex AI.
| if GOOGLE_CLOUD_LOCATION is None: | ||
| raise ValueError("GOOGLE_CLOUD_LOCATION environment variable is not set.") | ||
|
|
||
| vector_store = VectorSearchVectorStoreDatastore.from_components( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential Issue: Global vector store initialization
The vector store is initialized at module import time (line 38-43), which means:
- Import errors will occur if environment variables are missing, even for unrelated imports
- All processes share the same vector store instance (may cause issues with connection pooling)
- Testing becomes harder because you can't easily mock this
Recommendation:
Consider lazy initialization in the __init__ method of LangChainChatManager:
class LangChainChatManager:
def __init__(self):
self.vector_store = VectorSearchVectorStoreDatastore.from_components(...)
# ... rest of initThis would also make mocking easier for tests.
|
|
||
| rag = vector_store.as_retriever( | ||
| search_kwargs={"k": 5}, | ||
| filter=f'city: ANY("{city.lower()}") AND state: ANY("{state.lower()}")', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Security Concern: SQL Injection-style filter vulnerability
The filter strings are constructed using f-strings without any sanitization:
filter=f'city: ANY("{city.lower()}") AND state: ANY("{state.lower()}")'If city or state contain quotes or special characters, this could:
- Break the filter syntax
- Potentially allow filter injection attacks
Recommendation:
Add input validation and sanitization:
import re
def sanitize_filter_value(value: str) -> str:
"""Escape special characters in filter values."""
# Allow only alphanumeric, spaces, and hyphens
if not re.match(r'^[a-zA-Z0-9\s\-]+$', value):
raise ValueError(f"Invalid characters in filter value: {value}")
return value.replace('"', '\\"')
# Then use:
filter=f'city: ANY("{sanitize_filter_value(city.lower())}") AND state: ANY("{sanitize_filter_value(state.lower())}")'| current_query = messages[-1]["content"] | ||
|
|
||
| # Stream the agent response. | ||
| for chunk in agent.stream( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical Issue: Incorrect stream mode and response handling
Several problems with the streaming implementation:
-
Wrong stream_mode: Using
stream_mode="messages"but the code expects"output"in chunks (line 209). The correct mode for agent streaming isstream_mode="values"orstream_mode="updates". -
Incorrect chunk access: The code accesses
chunk["output"]but LangGraph agents don't return chunks in this format. You need to handle the actual message delta structure. -
Missing error handling: No try/except around the streaming logic.
Recommendation:
for chunk in agent.stream(
{
"messages": [HumanMessage(content=current_query)],
"city": city,
"state": state,
},
stream_mode="values", # Get full state updates
):
# Extract the latest AI message content
if "messages" in chunk:
messages = chunk["messages"]
if messages and isinstance(messages[-1], AIMessage):
yield messages[-1].contentAlso note: This streaming implementation won't work correctly because you're yielding potentially duplicate content on each state update. Consider using stream_mode="messages" with proper delta handling.
| # # Create prompt template with system message and conversation history. | ||
| # prompt = ChatPromptTemplate.from_messages( | ||
| # [ | ||
| # ("system", system_prompt.text()), | ||
| # MessagesPlaceholder(variable_name="chat_history", optional=True), | ||
| # ("human", "{input}"), | ||
| # MessagesPlaceholder(variable_name="agent_scratchpad"), | ||
| # ] | ||
| # ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Smell: Commented-out code
This large block of commented-out code should either be:
- Removed if it's not needed
- Documented with a comment explaining why it's kept for reference
Keeping commented code in production reduces readability and creates confusion about the correct implementation.
| @pytest.fixture | ||
| def mock_vertex_ai(): | ||
| """Mock Vertex AI RAG responses.""" | ||
| with patch("tenantfirstaid.langchain_chat.rag.retrieve") as mock_retrieve: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test Issue: Mock references non-existent code
The mock patches tenantfirstaid.langchain_chat.rag.retrieve but looking at langchain_chat.py, there's no rag object with a retrieve method. The actual code uses vector_store.as_retriever().
This test will not actually mock the intended functionality and may give false confidence.
Recommendation:
@pytest.fixture
def mock_vertex_ai():
"""Mock Vertex AI RAG responses."""
with patch("tenantfirstaid.langchain_chat.vector_store") as mock_store:
mock_retriever = Mock()
mock_doc = Mock()
mock_doc.page_content = "ORS 90.427 requires 30 days notice..."
mock_retriever.invoke.return_value = [mock_doc]
mock_store.as_retriever.return_value = mock_retriever
yield mock_store| def test_retrieve_city_law_filters_correctly(mock_vertex_ai): | ||
| """Test that city law retrieval uses correct filter.""" | ||
| d: Dict[str, str] = { | ||
| "query": "eviction notice requirements", | ||
| "city": "portland", | ||
| "state": "or", | ||
| } | ||
|
|
||
| result = retrieve_city_law.invoke(d) | ||
|
|
||
| # Verify filter was constructed correctly. | ||
| call_args = mock_vertex_ai.call_args | ||
| assert 'city: ANY("portland")' in str(call_args) | ||
| assert 'state: ANY("or")' in str(call_args) | ||
| assert "ORS 90.427" in result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test Issue: Assertions don't match actual implementation
The test checks mock_vertex_ai.call_args (line 44) but the mock was set up on rag.retrieve, not the retrieval tool functions. This test will likely fail or not test what it intends to test.
Also, the tool is invoked with a dictionary (line 35-39) but the actual tool signature is retrieve_city_law(query: str, city: str, state: str) - this mismatch needs to be resolved.
|
|
||
|
|
||
| # Evaluator 5: Citation Format (Heuristic). | ||
| def citation_format_evaluator(run, example) -> Evaluator: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Type Hint Issue: Incorrect return type
The function is annotated to return Evaluator (which is a class/type), but it actually returns a dictionary with evaluation results. This will cause type checking failures.
Recommendation:
def citation_format_evaluator(run, example) -> dict:
"""Check if citations use proper HTML anchor tag format.
Args:
run: LangSmith run object containing outputs
example: LangSmith example object (unused)
Returns:
Dictionary with evaluation results
"""Same issue applies to tool_usage_evaluator (line 191) and performance_evaluator (line 222).
| """ | ||
| # Access trace to see which tools were called. | ||
| tool_calls = [] | ||
| for step in run.trace.get("steps", []): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential Issue: Accessing trace structure may fail
The code accesses run.trace.get("steps", []) but this assumes the trace has a specific structure. If the trace format changes or isn't available:
- This will return an empty list (silent failure)
- The evaluator will always give a score of 0.0
Recommendation:
Add defensive checks and logging:
if not hasattr(run, 'trace') or not run.trace:
return {
"key": "tool_usage",
"score": 0.0,
"comment": "No trace available for evaluation",
}
tool_calls = []
for step in run.trace.get("steps", []):
if step.get("type") == "tool":
tool_calls.append(step.get("name"))|
|
||
| # Run agent on the first question. | ||
| response = agent.invoke( | ||
| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Incorrect parameter names for agent invocation
The agent is being invoked with parameters input and chat_history (lines 43-44), but looking at the LangChainChatManager.generate_streaming_response() implementation, the agent expects:
messages(notinput)context(notchat_history)
This mismatch will cause the evaluation to fail at runtime.
Recommendation:
response = agent.invoke(
{
"messages": [HumanMessage(content=inputs["first_question"])],
"city": inputs["city"],
"state": inputs["state"],
}
)| ], | ||
| experiment_prefix=experiment_prefix, | ||
| max_concurrency=5, # Run 5 evaluations in parallel. | ||
| num_repetitions=num_samples, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Issue: Incorrect parameter name
The evaluate() function is called with num_repetitions=num_samples (line 92), but looking at the LangSmith API, this parameter is likely meant for running each example multiple times, not for sampling a subset.
To evaluate only a subset of examples, you should either:
- Use dataset slicing before passing to
evaluate() - Use the
max_examplesparameter if available
Also: The commented-out code (lines 97-102) should be removed or documented.
| / "generate_conversation" | ||
| / "tenant_questions_facts_full.csv" | ||
| ) | ||
| df = pd.read_csv(csv_path, encoding="cp1252") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Encoding Issue: Hardcoded cp1252 encoding
The CSV is read with encoding="cp1252" (Windows-1252), which is unusual for modern datasets and may cause issues with:
- Cross-platform compatibility (Mac/Linux developers)
- Special characters in legal text
- Future data updates
Recommendation:
Use UTF-8 encoding (standard) or auto-detect:
# Try UTF-8 first, fallback to cp1252 if needed
try:
df = pd.read_csv(csv_path, encoding="utf-8")
except UnicodeDecodeError:
df = pd.read_csv(csv_path, encoding="cp1252")Consider converting the source CSV to UTF-8 for consistency.
| ```bash | ||
| export LANGSMITH_API_KEY=your-api-key | ||
| ``` | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Documentation Issue: Incorrect environment variable name
The documentation lists LANGSMITH_TRACING_V2=true but LangSmith's current convention uses LANGCHAIN_TRACING_V2 (with "CHAIN" not "SMITH").
Please verify the correct variable name in the LangSmith documentation and update accordingly.
| pdf_buffer = BytesIO() | ||
| pisa_status = pisa.CreatePDF(html_content, dest=pdf_buffer) | ||
| if pisa_status.err: | ||
| if hasattr(pisa_status, "err"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good Fix: Improved error handling
Good defensive programming! The change from pisa_status.err to hasattr(pisa_status, "err") prevents AttributeError exceptions if the pisa library's API changes or returns unexpected types.
Minor suggestion: Consider also checking the truthiness:
if hasattr(pisa_status, "err") and pisa_status.err:
return None| "python-dotenv", | ||
| "pandas>=2.3.0", | ||
| "vertexai>=1.43.0", | ||
| "langchain>=1.1.0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dependency Management: Consider version pinning
The new LangChain dependencies use minimum version constraints (>=), which could lead to:
- Breaking changes in future updates
- Inconsistent behavior across environments
- Difficult-to-reproduce bugs
Recommendation:
Consider using more restrictive version constraints for critical dependencies:
"langchain>=1.1.0,<2.0.0",
"langchain-google-vertexai>=3.1.0,<4.0.0",
"langsmith>=0.4.47,<0.5.0",Or use lock files to ensure reproducibility (which you already have with uv.lock).
| **Architecture Type**: Agent-based RAG with tool calling | ||
| - **Framework**: LangChain 1.0.8+ (monolithic package) | ||
| - **LLM Integration**: ChatVertexAI (langchain-google-vertexai 3.0.3+) | ||
| - **Agent Pattern**: `create_tool_calling_agent()` with custom RAG tools |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Documentation Issue: Incorrect agent function name
The documentation mentions create_tool_calling_agent() but the actual code uses create_agent() from langchain.agents (see langchain_chat.py:153).
Please update to match the actual implementation.
|
|
||
| ### Environment Variables | ||
| ```bash | ||
| MODEL_NAME=gemini-2.5-pro # LLM model name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good Addition: Clear LangChain architecture documentation
Excellent documentation of the new architecture! The environment variables section is particularly helpful.
Minor suggestion: Consider adding a section about running the evaluation suite locally, since it's a key part of the quality assurance process:
### Running Evaluations
```bash
# Run LangSmith evaluations (requires LANGSMITH_API_KEY)
uv run python scripts/run_langsmith_evaluation.py --num-samples 20
What type of PR is this? (check all applicable)
Description
Related Tickets & Documents
QA Instructions, Screenshots, Recordings
Please replace this line with instructions on how to test your changes, a note on the devices and browsers this has been tested on, as well as any relevant images for UI changes.
Added/updated tests?
Documentation
Architecture.mdhas been updated[optional] Are there any post deployment tasks we need to perform?