Skip to content

Conversation

@ww2283
Copy link
Contributor

@ww2283 ww2283 commented Oct 23, 2025

What does this PR do?

Related Issues

Fixes #

Checklist

  • Tests pass (uv run pytest): as mentioned in Feature/add metadata output #150 , ast chunking part is waiting to be synced with the fix in 150; diskann is not configure on my mac but this 151 is irrelevant regarding to the diskann backend. Other tests all pass
  • Code formatted (ruff format and ruff check)
  • Pre-commit hooks pass (pre-commit run --all-files)

Summary

Fixes #151

This PR implements true batch processing for Ollama embeddings by migrating from the deprecated /api/embeddings
endpoint to the modern /api/embed endpoint.

Note: This PR is based on feature/add-metadata-output and includes changes from PR #150 (metadata output + ZMQ
fixes). The Ollama batching optimization is in the latest commit (8bb7743). PR #150 can be merged first, then this PR
will cleanly apply the batching changes only.

Commits in this PR

  1. d6a3c28 - feat: add metadata output to search results (from Feature/add metadata output #150)
  2. 76e1633 - fix: resolve ZMQ linking issues (from Feature/add metadata output #150)
  3. 5073f31 - style: apply ruff formatting (from Feature/add metadata output #150)
  4. 8bb7743 - feat: implement true batch processing for Ollama embeddings ⭐ (NEW)

Changes (Ollama Batching)

  • Endpoint migration: /api/embeddings/api/embed
  • Parameter update: "prompt" (single text) → "input" (array of texts)
  • Response parsing: Updated to handle batch embeddings array
  • Timeout: Increased to 60s for batch processing
  • Error handling: Improved for batch request failures

Performance Impact

Before

  • Made 32 separate API calls per "batch" of 32 texts
  • Significant HTTP connection/handshake overhead
  • Used deprecated API endpoint

After

  • Makes 1 API call per batch of 32 texts
  • Reduces HTTP overhead by 32×
  • Uses modern, recommended API endpoint

Actual Results (tested on 2,374 chunks)

  • API calls reduced from ~76K to ~75 requests
  • Batch size: 32 for MPS/CPU, 128 for CUDA
  • Processing time: Moderate improvement from reduced HTTP overhead

Important Note

Through empirical testing on Ollama v0.12.6, we confirmed that Ollama processes embedding batch items sequentially
(not in parallel). The performance improvement comes from reduced HTTP overhead, not GPU parallelization.

For users requiring maximum performance, we recommend e.g. :

--embedding-mode sentence-transformers --embedding-model Alibaba-NLP/gte-Qwen2-1.5B-instruct

This provides true GPU batch parallelization and is 10-20× faster than Ollama for large embedding workloads.

Testing

Tested with:
- 2,374 chunks, avg 4,610 chars each
- Ollama v0.12.6
- Model: qwen3-embedding:0.6b
- Batch size: 32 texts per batch

Related Files

- packages/leann-core/src/leann/embedding_compute.py:570-861

- Add --show-metadata flag to display file paths in search results
- Preserve document metadata (file_path, file_name, timestamps) during chunking
- Update MCP tool schema to support show_metadata parameter
- Enhance CLI search output to display metadata when requested
- Fix pre-existing bug: args.backend -> args.backend_name

Resolves yichuan-w#144
- Use pkg_check_modules IMPORTED_TARGET to create PkgConfig::ZMQ
- Set PKG_CONFIG_PATH to prioritize ARM64 Homebrew on Apple Silicon
- Override macOS -undefined dynamic_lookup to force proper symbol resolution
- Use PUBLIC linkage for ZMQ in faiss library for transitive linking
- Mark cppzmq includes as SYSTEM to suppress warnings

Fixes editable install ZMQ symbol errors while maintaining compatibility
across Linux, macOS Intel, and macOS ARM64 platforms.
@ww2283
Copy link
Contributor Author

ww2283 commented Oct 23, 2025

fix for #151

@yichuan-w yichuan-w requested a review from ASuresh0524 October 23, 2025 22:08
Use ww2283/faiss fork with fix/zmq-linking branch to resolve CI checkout
failures. The ZMQ linking fixes are not yet merged upstream.
Resolved conflicts in cli.py by keeping structured metadata approach over
inline text concatenation from PR yichuan-w#149.

Our approach uses separate metadata dictionary which is cleaner and more
maintainable than parsing embedded strings.
Migrate from deprecated /api/embeddings to modern /api/embed endpoint
which supports batch inputs. This reduces HTTP overhead by sending
32 texts per request instead of making individual API calls.

Changes:
- Update endpoint from /api/embeddings to /api/embed
- Change parameter from 'prompt' (single) to 'input' (array)
- Update response parsing for batch embeddings array
- Increase timeout to 60s for batch processing
- Improve error handling for batch requests

Performance:
- Reduces API calls by 32x (batch size)
- Eliminates HTTP connection overhead per text
- Note: Ollama still processes batch items sequentially internally

Related: yichuan-w#151
@ww2283
Copy link
Contributor Author

ww2283 commented Oct 27, 2025

@ASuresh0524

Submodule Conflict Explanation

Why .gitmodules Has a Conflict

This PR temporarily changes the faiss submodule URL from yichuan-w/faiss to ww2283/faiss because:

  1. ZMQ Linking Fix Required: The changes in this PR depend on ZMQ linking fixes in the faiss submodule
  2. CI Was Failing: GitHub Actions couldn't checkout the faiss commit with the fix because it wasn't pushed upstream yet
  3. Temporary Fork: Using ww2283/faiss:fix/zmq-linking branch allows CI to pass while waiting for upstream merge

Upstream PR Created

Created PR to merge ZMQ fix upstream: yichuan-w/faiss#3

Resolution Plan

Once the faiss PR is merged:

  1. Update .gitmodules back to url = https://github.com/yichuan-w/faiss.git
  2. Remove the branch = fix/zmq-linking line
  3. Force push updated branches
  4. Conflict will be resolved

Alternative: Accept This Temporarily

If you'd prefer not to wait, this PR can be merged as-is:

  • ✅ All functionality works correctly
  • ✅ CI passes with the fork
  • ✅ Can switch back to upstream faiss in a follow-up PR after the fix is merged

Copy link
Collaborator

@ASuresh0524 ASuresh0524 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Positive Aspects:

  • 32x reduction in API calls - Major performance improvement
  • Modern API usage - Future-proof with /api/embed
  • Proper error handling - Graceful fallback to individual processing
  • Comprehensive testing - Tested with real workload (2,374 chunks)

Minor Suggestions:

  1. Token Limit Integration: This pairs well with token limit fixes for #153. Consider how these changes interact with token truncation.
  2. Error Logging: The batch error handling could benefit from more specific error messages for token limit violations (related to #153).

Approval Status:

LGTM - This is a solid performance improvement that should be merged.

Next Steps:

After this merges, I'll update my token limit fixes (#153) to work with the new /api/embed endpoint for a complete solution.

@ASuresh0524 ASuresh0524 mentioned this pull request Oct 28, 2025
3 tasks
@yichuan-w
Copy link
Owner

Hi @ww2283 can you change to the new Faiss version(i.e link to our backbone instead of yours) and then we can merge? I have merge the PR in faiss, thanks for the contribution!

response = requests.post(
f"{resolved_host}/api/embed",
json={"model": model_name, "input": truncated_texts},
timeout=60, # Increased timeout for batch processing
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if this will result in OOM?
If you test on a large scale, I think I am fine with this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will keep this in mind when doing next step as I closely monitor its behavior. thanks for the merge! I will check around to be sure about the conflict are resolved before next step. currently Ollama has its limitation, that is the batching is correctly received but not really properly batched in itself, which is not the same behavior as in other client e.g. lm studio. So lm studio is using openai mode endpoint, and it's not oom, so I assume that ollama should be fine, even when later they decide to do proper batching. but for now the batching is ready with ollama. sadly headless server autoloading and unloading model with proper JIT is still the most smooth with ollama. Or next close solution is llama-swap but not as convenient. currently, the most speedy solution in apple silicon is either ollama with moe embedding model, which we currently only have that nomad v2, or lm studio with embeddinggemma which can offer equivalent speed comparing to that ollama hosted moe. embeddinggemma has the great two advantages: longer sequence length support (2048 vs 512) and template prepending which should theoretically be important for better results.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on a side note, speed is important, at least to me, because I use a posttoolhook in claude code that will embed once see a git commit to keep the codebase indexing up to date. so the embedding in LEAAN has to be fast.

@yichuan-w
Copy link
Owner

Let's merge this PR. It is an important issue, let's merge ASAP
Thanks for your contribution!!!!

@yichuan-w yichuan-w merged commit a85d0ad into yichuan-w:main Oct 30, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Severe Performance Bottleneck in Ollama Embedding Generation due to Serial API Calls

3 participants