Skip to content

Conversation

@ww2283
Copy link
Contributor

@ww2283 ww2283 commented Oct 23, 2025

What does this PR do?

Add --show-metadata flag to display file metadata in search results and fix ZMQ
linking issues in Python extension build system.

Related Issues

Fixes #144

Checklist

  • Tests pass (uv run pytest)
  • Code formatted (ruff format and ruff check)
  • Pre-commit hooks pass (pre-commit run --all-files)

@ww2283
Copy link
Contributor Author

ww2283 commented Oct 23, 2025

PR Summary: Metadata Output Feature

What We Fixed

1. Metadata Output Feature

Added --show-metadata flag to display file path, name, creation/modified dates with emoji formatting in search results.

2. ZMQ Linking Build System Fix

Fixed editable install (pip install -e) failures that were blocking testing of the metadata feature:

Why this was needed:

  • After implementing the metadata feature, testing with editable install failed with ImportError: symbol not found '_zmq_close'
  • PyPI wheel installations worked, but local development installs didn't
  • This blocked verification of the metadata feature changes

Root cause:
Python extension wasn't properly linking ZMQ library due to:

  1. Missing IMPORTED_TARGET in pkg_check_modules
  2. macOS Python::Module adds -undefined dynamic_lookup which defers symbol resolution
  3. ARM64 Mac was finding Intel Homebrew ZMQ instead of ARM64 version

Fixes applied:

  • Used pkg_check_modules IMPORTED_TARGET to create proper CMake targets
  • Set PKG_CONFIG_PATH to prioritize ARM64 Homebrew on Apple Silicon
  • Overrode macOS -undefined dynamic_lookup to force symbol resolution
  • Used PUBLIC linkage for ZMQ in faiss library for transitive dependencies

Known Limitations

1. AST Chunking Metadata Loss

--use-ast-chunking strips metadata because create_ast_chunks() returns plain strings instead of objects with metadata (line 118 in chunking_utils.py). Regular chunking preserves metadata correctly. Separate issue to be filed.

2. Test Failure

test_astchunk_integration.py::test_document_rag_with_ast_chunking fails due to missing metadata file - pre-existing issue, unrelated to these changes.

Test Results

11/12 tests pass. The one failure is because of that limitation I mentioned. If you could merge the changes, I can continue to fix the ast chunking part in this route.

@ww2283 ww2283 mentioned this pull request Oct 23, 2025
3 tasks
@yichuan-w yichuan-w requested a review from ASuresh0524 October 23, 2025 22:08
@ww2283
Copy link
Contributor Author

ww2283 commented Oct 25, 2025

Merge Conflict Resolution

This PR was rebased on the latest upstream/main which includes PR #149 ("Display context chunks in ask and search results").

Conflict Resolution Decision

Both PRs (#149 and this PR #150) solve the same problem: displaying chunk source information. However, they use different approaches:

PR #149 approach (upstream):

text_with_source = "Chunk source:" + source_path + "\n" + node.get_content().replace("\n", " ")
builder.add_text(chunk_text, {"source": chunk_source})
  • Embeds source info directly in chunk text
  • Requires string parsing to extract metadata
  • Loses newlines in content

This PR's approach:

chunk_metadata = {
    "file_path": file_path or source_path,
    "file_name": doc.metadata.get("file_name", ""),
    "creation_date": ...,
    "last_modified_date": ...
}
builder.add_text(chunk["text"], metadata=chunk["metadata"])
  • Separates content from metadata
  • Structured data (no parsing needed)
  • Preserves original content
  • Supports richer metadata (dates, file info)

Why This Approach is Superior

  1. Cleaner separation of concerns - content and metadata are independent
  2. More maintainable - no brittle string parsing
  3. More extensible - easy to add new metadata fields
  4. Preserves content integrity - no content transformation required
  5. Better for --show-metadata flag - natural fit for structured display

The conflict was resolved by keeping the structured approach while maintaining compatibility with the existing API.

Copy link
Collaborator

@ASuresh0524 ASuresh0524 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ww2283 Great work on the metadata feature! The structured approach is much cleaner than string embedding.

Before we can fully approve I think a few changes should be made:

  1. Please resolve the merge conflicts with main (PR #149 was merged). Not too sure if I am missing anything here.
  2. Consider fixing the AST chunking metadata loss issue
  3. The ZMQ fixes look solid - good catch on the linking issues

Integration Note: I made a PR for the token limit fixes for #153 that will complement both this PR and #152. If you can see the similarites between the two and see if we need to make any resolves, i think we can move forward. Once these are resolved, we can create a comprehensive solution for Ollama embeddings.

@ww2283
Copy link
Contributor Author

ww2283 commented Oct 28, 2025

Hi @ASuresh0524, thanks for the review!

Addressing your feedback:

1. Merge Conflicts (PR #149)

Resolved. I merged upstream/main into both PR #150 and #152. The cli.py conflicts were resolved by keeping the structured metadata approach.

The remaining conflict is .gitmodules - see explanation below.

2. .gitmodules Conflict Explanation

The conflict exists because these PRs depend on ZMQ linking fixes in the faiss submodule:

Action needed: Once the faiss PR is merged, I'll update .gitmodules back to yichuan-w/faiss and the conflict will be resolved.

3. AST Chunking Metadata Loss

I'm deferring this fix until after PR #152 merges. Here's why:

Better approach: Fix AST metadata handling in a follow-up PR that builds on both #150 and #152.

4. Proposed Roadmap

Step 1: Merge faiss PR #3 (ZMQ linking fix)

Step 2: Merge PR #152 (Ollama batching optimization)

Step 3: Fix AST metadata loss (new PR, which I will do it after 152 merging)

Step 4: Coordinate with PR #154 (token limit fixes, which I can help to resolve)

  • Review overlap in embedding_compute.py
  • Resolve any conflicts between Ollama batching and token truncation

Summary

The PRs are ready code-wise. The only blocker is the faiss submodule dependency. Once the faiss PR merges upstream, I'll update .gitmodules and both #150/#152 will merge cleanly. Hope this is clear.

@yichuan-w
Copy link
Owner

Step 3: Fix AST metadata loss (new PR, which I will do it after 152 merging)

Test with Ollama batch processing
Ensure compatibility with #152 changes
Step 4: Coordinate with PR #154 (token limit fixes, which I can help to resolve)

Review overlap in embedding_compute.py
Resolve any conflicts between Ollama batching and token truncation

Yeah, then we can start working on

@ww2283 , let me know whats your idea here, do you think #154 solve all of these?

@ww2283 ww2283 closed this Oct 31, 2025
@ww2283 ww2283 force-pushed the feature/add-metadata-output branch from 45b87ce to a85d0ad Compare October 31, 2025 15:15
@ww2283 ww2283 deleted the feature/add-metadata-output branch October 31, 2025 15:25
@ww2283
Copy link
Contributor Author

ww2283 commented Oct 31, 2025

@yichuan-w not really, I will help to resolve the conflicts from current 154. I'm working on a pr which should land soon (today or tomorrow) that integrates current 154 changes and resolve conflicts while get the ast chunking to work with metadata.

@yichuan-w
Copy link
Owner

Yeah, great, thanks for your contribution! Let me know when I need to review, and you can join our Slack, we can actively discuss there!!

@ASuresh0524
Copy link
Collaborator

just to clarify @ww2283 will you be fixing the error on 154 or do you want me to help on that end? but thanks for the help @yichuan-w!

@ASuresh0524
Copy link
Collaborator

@ww2283 Just made my fixes for #154 let me know if it helps in anyway! Think if you are able to focus on the AST chunking with the metadata that should be great, if you want to review my PR and let me know if there are any changes to look at, that would be great as well!

@ww2283
Copy link
Contributor Author

ww2283 commented Nov 1, 2025

@ww2283 Just made my fixes for #154 let me know if it helps in anyway! Think if you are able to focus on the AST chunking with the metadata that should be great, if you want to review my PR and let me know if there are any changes to look at, that would be great as well!

I have commented in 154 thread. Sorry for the slow reply.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEAT] Output metadata (files, locations in files) in the search results

3 participants