Skip to content

Conversation

@yangm2
Copy link
Contributor

@yangm2 yangm2 commented Nov 25, 2025

What type of PR is this? (check all applicable)

  • Refactor
  • Feature
  • Bug Fix
  • Optimization
  • Documentation Update
  • Infrastructure
  • Maintenance

Description

Related Tickets & Documents

QA Instructions, Screenshots, Recordings

Please replace this line with instructions on how to test your changes, a note on the devices and browsers this has been tested on, as well as any relevant images for UI changes.

Added/updated tests?

  • Yes
  • No, and this is why: please replace this line with details on why tests have not been included
  • I need help with writing tests

Documentation

  • If this PR changes the system architecture, Architecture.md has been updated

[optional] Are there any post deployment tasks we need to perform?

github-actions bot and others added 6 commits November 24, 2025 21:07
Implements agent-based architecture using LangChain 1.0.8+ with automated
quality evaluation via LangSmith. This replaces the manual conversation
generation workflow with quantitative metrics and continuous monitoring.

Key changes:
- Add LangChain and LangSmith dependencies
- Implement LangChainChatManager with RAG tools
- Create automated evaluation scripts and evaluators
- Add comprehensive tests for new components
- Update documentation with migration guide and evaluation docs

Note: CI/CD workflow file (.github/workflows/langsmith_evaluation.yml)
needs to be added manually by a maintainer with workflow permissions.

This provides a path to enhanced observability, model flexibility, and
continuous quality monitoring while maintaining compatibility with
existing functionality.

Addresses codeforpdx#203

Co-authored-by: yangm2 <[email protected]>
@yangm2 yangm2 self-assigned this Nov 25, 2025
@yangm2 yangm2 added documentation Improvements or additions to documentation enhancement New feature or request backend Bot implementation and other backend concerns labels Nov 25, 2025
@github-actions
Copy link

github-actions bot commented Nov 25, 2025

Claude finished @yangm2's task —— View job


Code Review Complete ✅

I've completed a comprehensive review of this LangChain migration PR. This is an ambitious architectural change that modernizes the codebase with significant benefits, but there are several critical issues that need to be addressed before merging.


🔴 Critical Issues (Must Fix)

1. Hardcoded Placeholder in Production Code

  • Location: langchain_chat.py:42
  • Issue: endpoint_id="fix-me-later" will cause runtime failures
  • Impact: Vector store operations will fail
  • Action Required: Set proper VERTEX_AI_ENDPOINT_ID from environment

2. Broken Streaming Implementation

  • Location: langchain_chat.py:199-210
  • Issue: Incorrect stream_mode and chunk access pattern
  • Impact: Streaming responses won't work correctly
  • Action Required: Fix stream mode to "values" and handle message deltas properly

3. Non-Functional Tests

  • Location: test_langchain_chat.py:18, 33-47
  • Issue: Mocks reference non-existent code paths (rag.retrieve)
  • Impact: Tests pass but don't actually test the implementation
  • Action Required: Rewrite mocks to patch actual vector store methods

4. Evaluation Script Parameter Mismatch

  • Location: run_langsmith_evaluation.py:42-48
  • Issue: Agent invoked with wrong parameter names (input/chat_history vs messages/context)
  • Impact: Evaluation runs will fail at runtime
  • Action Required: Align parameters with actual agent interface

5. Missing Integration in Main Application

  • Location: app.py (unchanged)
  • Issue: New LangChainChatManager is never imported or used
  • Impact: The migration doesn't actually change the running application
  • Action Required: Update ChatView to use LangChainChatManager or document the rollout plan

🟡 High Priority Issues

Security

  1. Filter Injection Vulnerability (langchain_chat.py:63)
    • Unsanitized user input in filter strings (city/state)
    • Could break queries or enable filter injection
    • Recommendation: Add input validation and sanitization

Architecture

  1. Global State Initialization (langchain_chat.py:38)

    • Vector store initialized at module import time
    • Makes testing harder and could cause connection issues
    • Recommendation: Move to lazy initialization in __init__
  2. Inconsistent Default Locations (langchain_chat.py:27, 115)

    • Different defaults: us-west1-c vs us-central1
    • Could cause hard-to-debug region mismatch issues
    • Recommendation: Use single consistent default

Type Safety

  1. Incorrect Type Hints (langsmith_evaluators.py:151, 191, 222)
    • Functions return dict but annotated as returning Evaluator
    • Will cause mypy failures
    • Recommendation: Fix return type hints to dict

🟢 Code Quality Issues

Maintainability

  1. Commented Code (langchain_chat.py:142-150)

    • Large block of commented code reduces readability
    • Remove or document why it's kept
  2. Commented Output Code (run_langsmith_evaluation.py:97-102)

    • Production script with commented metrics output
    • Clean up or document
  3. Encoding Hardcoding (create_langsmith_dataset.py:24)

    • Hardcoded cp1252 encoding unusual for modern data
    • Consider UTF-8 with fallback

Error Handling

  1. Missing Defensive Checks (langsmith_evaluators.py:203)
    • Assumes trace structure exists without validation
    • Add checks before accessing run.trace

✅ Positive Observations

Excellent Documentation

  • Outstanding: Comprehensive documentation across multiple files
  • EVALUATION.md: Excellent guide to the evaluation framework
  • LANGCHAIN_MIGRATION.md: Clear migration phases and rollback plan
  • README_LANGCHAIN.md: Great developer documentation
  • Architecture diagram updates are clear and helpful

Good Engineering Practices

  1. Improved Error Handling (feedback.py:14)

    • Better defensive programming with hasattr() check
  2. Well-Structured Evaluation Framework

    • 7 comprehensive evaluators covering multiple quality dimensions
    • Good separation of concerns between heuristic and LLM-based evaluators
  3. Thoughtful Test Organization

    • Clear test structure (though implementations need fixes)
    • Good use of fixtures and mocking patterns

Strategic Value

  • Modernization to industry-standard patterns
  • Better observability with LangSmith integration
  • Automated quality evaluation replaces manual process
  • Positions codebase for future enhancements (model switching, advanced RAG)

📊 Test Coverage Analysis

Existing Tests

  • ✅ Unit tests for LangChainChatManager (needs mock fixes)
  • ✅ Unit tests for evaluators
  • ✅ Tests cover core functionality patterns

Missing Tests

  • ❌ Integration tests with real Vertex AI (or better mocks)
  • ❌ End-to-end streaming tests
  • ❌ Error handling tests (what happens when RAG fails?)
  • ❌ Location filtering accuracy tests
  • ❌ Agent tool selection tests

Recommendation: Add integration test that validates the agent actually uses retrieval tools and generates proper citations.


🔒 Security Assessment

Identified Issues

  1. Filter Injection Risk: Unsanitized city/state in filter strings
  2. Input Validation: No validation on city/state values before use

Mitigations Needed

  • Add input sanitization for filter construction
  • Validate city/state against allowlist if possible
  • Add rate limiting on evaluation endpoints (if exposed)

Existing Protections

  • ✅ Safety settings properly configured in LLM
  • ✅ Environment variable management for secrets
  • ✅ No hardcoded credentials

⚡ Performance Considerations

Potential Concerns

  1. Agent Overhead: Additional 100-200ms latency from agent reasoning
  2. Global Vector Store: All requests share single vector store instance
  3. Streaming Complexity: Current implementation may have inefficiencies

Recommendations

  • Profile the streaming performance once fixed
  • Consider connection pooling for vector store
  • Monitor token usage with LangSmith to optimize prompts

📋 Pre-Merge Checklist

Before merging this PR, please address:

  • Fix hardcoded endpoint_id - Replace "fix-me-later" with real endpoint
  • Fix streaming implementation - Correct stream_mode and chunk handling
  • Fix test mocks - Update to patch actual implementation
  • Fix evaluation script - Correct agent invocation parameters
  • Add input sanitization - Protect against filter injection
  • Resolve region inconsistency - Use single default location
  • Fix type hints - Correct evaluator return types
  • Clean up commented code - Remove or document
  • Update integration point - Either integrate in app.py or document phased rollout
  • Run full test suite - Verify all tests pass with fixes
  • Manual testing - Test streaming, tool usage, and citations end-to-end

🎯 Recommendation

Status: Needs Work ⚠️

This PR represents excellent architectural thinking and thorough documentation, but the critical implementation issues prevent it from being production-ready. The main concerns are:

  1. Core functionality won't work (hardcoded endpoint, broken streaming)
  2. Tests don't validate actual behavior (wrong mocks)
  3. Evaluation framework won't run (parameter mismatches)
  4. No clear integration path (not connected to main app)

Suggested Path Forward:

  1. Address the 5 critical issues listed above
  2. Add end-to-end integration test
  3. Test manually with real Vertex AI
  4. Document the deployment/rollout strategy
  5. Consider a phased rollout (feature flag to switch between old/new implementation)

Once these are resolved, this will be a valuable modernization that sets up the project for future success.


Review Summary by Category

Category Issues Found Severity
Critical Bugs 5 🔴 High
Security 1 🟡 Medium
Architecture 3 🟡 Medium
Code Quality 4 🟢 Low
Documentation 2 🟢 Low
Testing 5 🔴 High

Total Issues: 20 items requiring attention (5 critical, 4 high priority, 11 quality improvements)

I've left inline comments on specific lines for all issues. Please review and address these before merging. Happy to answer questions or provide additional guidance!


project_id=GOOGLE_CLOUD_PROJECT,
region=GOOGLE_CLOUD_LOCATION,
index_id=VERTEX_AI_DATASTORE,
endpoint_id="fix-me-later",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical Issue: Hardcoded placeholder value

The endpoint_id parameter is set to "fix-me-later" which will cause runtime errors when the vector store is actually used. This needs to be:

  1. Set to a real Vertex AI endpoint ID from your deployed vector search index
  2. Configured via environment variable like other GCP settings
  3. Properly documented in the setup guide

Recommendation:

VERTEX_AI_ENDPOINT_ID = os.getenv("VERTEX_AI_ENDPOINT_ID")
if VERTEX_AI_ENDPOINT_ID is None:
    raise ValueError("VERTEX_AI_ENDPOINT_ID environment variable is not set.")

vector_store = VectorSearchVectorStoreDatastore.from_components(
    project_id=GOOGLE_CLOUD_PROJECT,
    region=GOOGLE_CLOUD_LOCATION,
    index_id=VERTEX_AI_DATASTORE,
    endpoint_id=VERTEX_AI_ENDPOINT_ID,
)

MODEL = os.getenv("MODEL_NAME", "gemini-2.5-pro")
VERTEX_AI_DATASTORE = os.getenv("VERTEX_AI_DATASTORE")
GOOGLE_CLOUD_PROJECT = os.getenv("GOOGLE_CLOUD_PROJECT")
GOOGLE_CLOUD_LOCATION = os.getenv("GOOGLE_CLOUD_LOCATION", "us-west1-c")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: Inconsistent default location

GOOGLE_CLOUD_LOCATION defaults to "us-west1-c" here (line 27), but in the ChatVertexAI initialization (line 115) it defaults to "us-central1". This inconsistency could cause issues:

  • Vector store lookups might use a different region than the LLM
  • Configuration errors may be hard to debug

Recommendation:
Use a single default consistently across both, e.g., "us-central1" which is more common for Vertex AI.

if GOOGLE_CLOUD_LOCATION is None:
raise ValueError("GOOGLE_CLOUD_LOCATION environment variable is not set.")

vector_store = VectorSearchVectorStoreDatastore.from_components(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential Issue: Global vector store initialization

The vector store is initialized at module import time (line 38-43), which means:

  1. Import errors will occur if environment variables are missing, even for unrelated imports
  2. All processes share the same vector store instance (may cause issues with connection pooling)
  3. Testing becomes harder because you can't easily mock this

Recommendation:
Consider lazy initialization in the __init__ method of LangChainChatManager:

class LangChainChatManager:
    def __init__(self):
        self.vector_store = VectorSearchVectorStoreDatastore.from_components(...)
        # ... rest of init

This would also make mocking easier for tests.


rag = vector_store.as_retriever(
search_kwargs={"k": 5},
filter=f'city: ANY("{city.lower()}") AND state: ANY("{state.lower()}")',

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Security Concern: SQL Injection-style filter vulnerability

The filter strings are constructed using f-strings without any sanitization:

filter=f'city: ANY("{city.lower()}") AND state: ANY("{state.lower()}")'

If city or state contain quotes or special characters, this could:

  1. Break the filter syntax
  2. Potentially allow filter injection attacks

Recommendation:
Add input validation and sanitization:

import re

def sanitize_filter_value(value: str) -> str:
    """Escape special characters in filter values."""
    # Allow only alphanumeric, spaces, and hyphens
    if not re.match(r'^[a-zA-Z0-9\s\-]+$', value):
        raise ValueError(f"Invalid characters in filter value: {value}")
    return value.replace('"', '\\"')

# Then use:
filter=f'city: ANY("{sanitize_filter_value(city.lower())}") AND state: ANY("{sanitize_filter_value(state.lower())}")'

current_query = messages[-1]["content"]

# Stream the agent response.
for chunk in agent.stream(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical Issue: Incorrect stream mode and response handling

Several problems with the streaming implementation:

  1. Wrong stream_mode: Using stream_mode="messages" but the code expects "output" in chunks (line 209). The correct mode for agent streaming is stream_mode="values" or stream_mode="updates".

  2. Incorrect chunk access: The code accesses chunk["output"] but LangGraph agents don't return chunks in this format. You need to handle the actual message delta structure.

  3. Missing error handling: No try/except around the streaming logic.

Recommendation:

for chunk in agent.stream(
    {
        "messages": [HumanMessage(content=current_query)],
        "city": city,
        "state": state,
    },
    stream_mode="values",  # Get full state updates
):
    # Extract the latest AI message content
    if "messages" in chunk:
        messages = chunk["messages"]
        if messages and isinstance(messages[-1], AIMessage):
            yield messages[-1].content

Also note: This streaming implementation won't work correctly because you're yielding potentially duplicate content on each state update. Consider using stream_mode="messages" with proper delta handling.

Comment on lines +142 to +150
# # Create prompt template with system message and conversation history.
# prompt = ChatPromptTemplate.from_messages(
# [
# ("system", system_prompt.text()),
# MessagesPlaceholder(variable_name="chat_history", optional=True),
# ("human", "{input}"),
# MessagesPlaceholder(variable_name="agent_scratchpad"),
# ]
# )

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Smell: Commented-out code

This large block of commented-out code should either be:

  1. Removed if it's not needed
  2. Documented with a comment explaining why it's kept for reference

Keeping commented code in production reduces readability and creates confusion about the correct implementation.

@pytest.fixture
def mock_vertex_ai():
"""Mock Vertex AI RAG responses."""
with patch("tenantfirstaid.langchain_chat.rag.retrieve") as mock_retrieve:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test Issue: Mock references non-existent code

The mock patches tenantfirstaid.langchain_chat.rag.retrieve but looking at langchain_chat.py, there's no rag object with a retrieve method. The actual code uses vector_store.as_retriever().

This test will not actually mock the intended functionality and may give false confidence.

Recommendation:

@pytest.fixture
def mock_vertex_ai():
    """Mock Vertex AI RAG responses."""
    with patch("tenantfirstaid.langchain_chat.vector_store") as mock_store:
        mock_retriever = Mock()
        mock_doc = Mock()
        mock_doc.page_content = "ORS 90.427 requires 30 days notice..."
        mock_retriever.invoke.return_value = [mock_doc]
        mock_store.as_retriever.return_value = mock_retriever
        yield mock_store

Comment on lines +33 to +47
def test_retrieve_city_law_filters_correctly(mock_vertex_ai):
"""Test that city law retrieval uses correct filter."""
d: Dict[str, str] = {
"query": "eviction notice requirements",
"city": "portland",
"state": "or",
}

result = retrieve_city_law.invoke(d)

# Verify filter was constructed correctly.
call_args = mock_vertex_ai.call_args
assert 'city: ANY("portland")' in str(call_args)
assert 'state: ANY("or")' in str(call_args)
assert "ORS 90.427" in result

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test Issue: Assertions don't match actual implementation

The test checks mock_vertex_ai.call_args (line 44) but the mock was set up on rag.retrieve, not the retrieval tool functions. This test will likely fail or not test what it intends to test.

Also, the tool is invoked with a dictionary (line 35-39) but the actual tool signature is retrieve_city_law(query: str, city: str, state: str) - this mismatch needs to be resolved.



# Evaluator 5: Citation Format (Heuristic).
def citation_format_evaluator(run, example) -> Evaluator:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type Hint Issue: Incorrect return type

The function is annotated to return Evaluator (which is a class/type), but it actually returns a dictionary with evaluation results. This will cause type checking failures.

Recommendation:

def citation_format_evaluator(run, example) -> dict:
    """Check if citations use proper HTML anchor tag format.
    
    Args:
        run: LangSmith run object containing outputs
        example: LangSmith example object (unused)
    
    Returns:
        Dictionary with evaluation results
    """

Same issue applies to tool_usage_evaluator (line 191) and performance_evaluator (line 222).

"""
# Access trace to see which tools were called.
tool_calls = []
for step in run.trace.get("steps", []):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential Issue: Accessing trace structure may fail

The code accesses run.trace.get("steps", []) but this assumes the trace has a specific structure. If the trace format changes or isn't available:

  1. This will return an empty list (silent failure)
  2. The evaluator will always give a score of 0.0

Recommendation:
Add defensive checks and logging:

if not hasattr(run, 'trace') or not run.trace:
    return {
        "key": "tool_usage",
        "score": 0.0,
        "comment": "No trace available for evaluation",
    }
    
tool_calls = []
for step in run.trace.get("steps", []):
    if step.get("type") == "tool":
        tool_calls.append(step.get("name"))


# Run agent on the first question.
response = agent.invoke(
{

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Incorrect parameter names for agent invocation

The agent is being invoked with parameters input and chat_history (lines 43-44), but looking at the LangChainChatManager.generate_streaming_response() implementation, the agent expects:

  • messages (not input)
  • context (not chat_history)

This mismatch will cause the evaluation to fail at runtime.

Recommendation:

response = agent.invoke(
    {
        "messages": [HumanMessage(content=inputs["first_question"])],
        "city": inputs["city"],
        "state": inputs["state"],
    }
)

],
experiment_prefix=experiment_prefix,
max_concurrency=5, # Run 5 evaluations in parallel.
num_repetitions=num_samples,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: Incorrect parameter name

The evaluate() function is called with num_repetitions=num_samples (line 92), but looking at the LangSmith API, this parameter is likely meant for running each example multiple times, not for sampling a subset.

To evaluate only a subset of examples, you should either:

  1. Use dataset slicing before passing to evaluate()
  2. Use the max_examples parameter if available

Also: The commented-out code (lines 97-102) should be removed or documented.

/ "generate_conversation"
/ "tenant_questions_facts_full.csv"
)
df = pd.read_csv(csv_path, encoding="cp1252")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Encoding Issue: Hardcoded cp1252 encoding

The CSV is read with encoding="cp1252" (Windows-1252), which is unusual for modern datasets and may cause issues with:

  1. Cross-platform compatibility (Mac/Linux developers)
  2. Special characters in legal text
  3. Future data updates

Recommendation:
Use UTF-8 encoding (standard) or auto-detect:

# Try UTF-8 first, fallback to cp1252 if needed
try:
    df = pd.read_csv(csv_path, encoding="utf-8")
except UnicodeDecodeError:
    df = pd.read_csv(csv_path, encoding="cp1252")

Consider converting the source CSV to UTF-8 for consistency.

```bash
export LANGSMITH_API_KEY=your-api-key
```

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation Issue: Incorrect environment variable name

The documentation lists LANGSMITH_TRACING_V2=true but LangSmith's current convention uses LANGCHAIN_TRACING_V2 (with "CHAIN" not "SMITH").

Please verify the correct variable name in the LangSmith documentation and update accordingly.

pdf_buffer = BytesIO()
pisa_status = pisa.CreatePDF(html_content, dest=pdf_buffer)
if pisa_status.err:
if hasattr(pisa_status, "err"):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good Fix: Improved error handling

Good defensive programming! The change from pisa_status.err to hasattr(pisa_status, "err") prevents AttributeError exceptions if the pisa library's API changes or returns unexpected types.

Minor suggestion: Consider also checking the truthiness:

if hasattr(pisa_status, "err") and pisa_status.err:
    return None

"python-dotenv",
"pandas>=2.3.0",
"vertexai>=1.43.0",
"langchain>=1.1.0",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dependency Management: Consider version pinning

The new LangChain dependencies use minimum version constraints (>=), which could lead to:

  1. Breaking changes in future updates
  2. Inconsistent behavior across environments
  3. Difficult-to-reproduce bugs

Recommendation:
Consider using more restrictive version constraints for critical dependencies:

"langchain>=1.1.0,<2.0.0",
"langchain-google-vertexai>=3.1.0,<4.0.0",
"langsmith>=0.4.47,<0.5.0",

Or use lock files to ensure reproducibility (which you already have with uv.lock).

**Architecture Type**: Agent-based RAG with tool calling
- **Framework**: LangChain 1.0.8+ (monolithic package)
- **LLM Integration**: ChatVertexAI (langchain-google-vertexai 3.0.3+)
- **Agent Pattern**: `create_tool_calling_agent()` with custom RAG tools

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation Issue: Incorrect agent function name

The documentation mentions create_tool_calling_agent() but the actual code uses create_agent() from langchain.agents (see langchain_chat.py:153).

Please update to match the actual implementation.


### Environment Variables
```bash
MODEL_NAME=gemini-2.5-pro # LLM model name

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good Addition: Clear LangChain architecture documentation

Excellent documentation of the new architecture! The environment variables section is particularly helpful.

Minor suggestion: Consider adding a section about running the evaluation suite locally, since it's a key part of the quality assurance process:

### Running Evaluations
```bash
# Run LangSmith evaluations (requires LANGSMITH_API_KEY)
uv run python scripts/run_langsmith_evaluation.py --num-samples 20

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend Bot implementation and other backend concerns documentation Improvements or additions to documentation enhancement New feature or request Not ready

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant