Feature/optimize ollama batching #152

ww2283 · 2025-10-23T15:56:18Z

What does this PR do?

Related Issues

Fixes #

Checklist

Tests pass (uv run pytest): as mentioned in Feature/add metadata output #150 , ast chunking part is waiting to be synced with the fix in 150; diskann is not configure on my mac but this 151 is irrelevant regarding to the diskann backend. Other tests all pass
Code formatted (ruff format and ruff check)
Pre-commit hooks pass (pre-commit run --all-files)

Summary

Fixes #151

This PR implements true batch processing for Ollama embeddings by migrating from the deprecated /api/embeddings
endpoint to the modern /api/embed endpoint.

Note: This PR is based on feature/add-metadata-output and includes changes from PR #150 (metadata output + ZMQ
fixes). The Ollama batching optimization is in the latest commit (8bb7743). PR #150 can be merged first, then this PR
will cleanly apply the batching changes only.

Commits in this PR

d6a3c28 - feat: add metadata output to search results (from Feature/add metadata output #150)
76e1633 - fix: resolve ZMQ linking issues (from Feature/add metadata output #150)
5073f31 - style: apply ruff formatting (from Feature/add metadata output #150)
8bb7743 - feat: implement true batch processing for Ollama embeddings ⭐ (NEW)

Changes (Ollama Batching)

Endpoint migration: /api/embeddings → /api/embed
Parameter update: "prompt" (single text) → "input" (array of texts)
Response parsing: Updated to handle batch embeddings array
Timeout: Increased to 60s for batch processing
Error handling: Improved for batch request failures

Performance Impact

Before

Made 32 separate API calls per "batch" of 32 texts
Significant HTTP connection/handshake overhead
Used deprecated API endpoint

After

Makes 1 API call per batch of 32 texts
Reduces HTTP overhead by 32×
Uses modern, recommended API endpoint

Actual Results (tested on 2,374 chunks)

API calls reduced from ~76K to ~75 requests
Batch size: 32 for MPS/CPU, 128 for CUDA
Processing time: Moderate improvement from reduced HTTP overhead

Important Note

Through empirical testing on Ollama v0.12.6, we confirmed that Ollama processes embedding batch items sequentially
(not in parallel). The performance improvement comes from reduced HTTP overhead, not GPU parallelization.

For users requiring maximum performance, we recommend e.g. :

--embedding-mode sentence-transformers --embedding-model Alibaba-NLP/gte-Qwen2-1.5B-instruct

This provides true GPU batch parallelization and is 10-20× faster than Ollama for large embedding workloads.

Testing

Tested with:
- 2,374 chunks, avg 4,610 chars each
- Ollama v0.12.6
- Model: qwen3-embedding:0.6b
- Batch size: 32 texts per batch

Related Files

- packages/leann-core/src/leann/embedding_compute.py:570-861

- Add --show-metadata flag to display file paths in search results - Preserve document metadata (file_path, file_name, timestamps) during chunking - Update MCP tool schema to support show_metadata parameter - Enhance CLI search output to display metadata when requested - Fix pre-existing bug: args.backend -> args.backend_name Resolves yichuan-w#144

- Use pkg_check_modules IMPORTED_TARGET to create PkgConfig::ZMQ - Set PKG_CONFIG_PATH to prioritize ARM64 Homebrew on Apple Silicon - Override macOS -undefined dynamic_lookup to force proper symbol resolution - Use PUBLIC linkage for ZMQ in faiss library for transitive linking - Mark cppzmq includes as SYSTEM to suppress warnings Fixes editable install ZMQ symbol errors while maintaining compatibility across Linux, macOS Intel, and macOS ARM64 platforms.

ww2283 · 2025-10-23T15:56:48Z

fix for #151

Use ww2283/faiss fork with fix/zmq-linking branch to resolve CI checkout failures. The ZMQ linking fixes are not yet merged upstream.

Resolved conflicts in cli.py by keeping structured metadata approach over inline text concatenation from PR yichuan-w#149. Our approach uses separate metadata dictionary which is cleaner and more maintainable than parsing embedded strings.

Migrate from deprecated /api/embeddings to modern /api/embed endpoint which supports batch inputs. This reduces HTTP overhead by sending 32 texts per request instead of making individual API calls. Changes: - Update endpoint from /api/embeddings to /api/embed - Change parameter from 'prompt' (single) to 'input' (array) - Update response parsing for batch embeddings array - Increase timeout to 60s for batch processing - Improve error handling for batch requests Performance: - Reduces API calls by 32x (batch size) - Eliminates HTTP connection overhead per text - Note: Ollama still processes batch items sequentially internally Related: yichuan-w#151

ww2283 · 2025-10-27T14:04:47Z

@ASuresh0524

Submodule Conflict Explanation

Why `.gitmodules` Has a Conflict

This PR temporarily changes the faiss submodule URL from yichuan-w/faiss to ww2283/faiss because:

ZMQ Linking Fix Required: The changes in this PR depend on ZMQ linking fixes in the faiss submodule
CI Was Failing: GitHub Actions couldn't checkout the faiss commit with the fix because it wasn't pushed upstream yet
Temporary Fork: Using ww2283/faiss:fix/zmq-linking branch allows CI to pass while waiting for upstream merge

Upstream PR Created

Created PR to merge ZMQ fix upstream: yichuan-w/faiss#3

Resolution Plan

Once the faiss PR is merged:

Update .gitmodules back to url = https://github.com/yichuan-w/faiss.git
Remove the branch = fix/zmq-linking line
Force push updated branches
Conflict will be resolved

Alternative: Accept This Temporarily

If you'd prefer not to wait, this PR can be merged as-is:

✅ All functionality works correctly
✅ CI passes with the fork
✅ Can switch back to upstream faiss in a follow-up PR after the fix is merged

ASuresh0524

LGTM

Positive Aspects:

32x reduction in API calls - Major performance improvement
Modern API usage - Future-proof with /api/embed
Proper error handling - Graceful fallback to individual processing
Comprehensive testing - Tested with real workload (2,374 chunks)

Minor Suggestions:

Token Limit Integration: This pairs well with token limit fixes for #153. Consider how these changes interact with token truncation.
Error Logging: The batch error handling could benefit from more specific error messages for token limit violations (related to #153).

Approval Status:

LGTM - This is a solid performance improvement that should be merged.

Next Steps:

After this merges, I'll update my token limit fixes (#153) to work with the new /api/embed endpoint for a complete solution.

yichuan-w · 2025-10-30T23:17:18Z

Hi @ww2283 can you change to the new Faiss version(i.e link to our backbone instead of yours) and then we can merge? I have merge the PR in faiss, thanks for the contribution!

yichuan-w · 2025-10-30T23:24:53Z

packages/leann-core/src/leann/embedding_compute.py

+                response = requests.post(
+                    f"{resolved_host}/api/embed",
+                    json={"model": model_name, "input": truncated_texts},
+                    timeout=60,  # Increased timeout for batch processing


I am wondering if this will result in OOM?
If you test on a large scale, I think I am fine with this

I will keep this in mind when doing next step as I closely monitor its behavior. thanks for the merge! I will check around to be sure about the conflict are resolved before next step. currently Ollama has its limitation, that is the batching is correctly received but not really properly batched in itself, which is not the same behavior as in other client e.g. lm studio. So lm studio is using openai mode endpoint, and it's not oom, so I assume that ollama should be fine, even when later they decide to do proper batching. but for now the batching is ready with ollama. sadly headless server autoloading and unloading model with proper JIT is still the most smooth with ollama. Or next close solution is llama-swap but not as convenient. currently, the most speedy solution in apple silicon is either ollama with moe embedding model, which we currently only have that nomad v2, or lm studio with embeddinggemma which can offer equivalent speed comparing to that ollama hosted moe. embeddinggemma has the great two advantages: longer sequence length support (2048 vs 512) and template prepending which should theoretically be important for better results.

on a side note, speed is important, at least to me, because I use a posttoolhook in claude code that will embed once see a git commit to keep the codebase indexing up to date. so the embedding in LEAAN has to be fast.

yichuan-w · 2025-10-30T23:29:45Z

Let's merge this PR. It is an important issue, let's merge ASAP
Thanks for your contribution!!!!

ww2283 added 3 commits October 22, 2025 14:10

style: apply ruff formatting

5073f31

yichuan-w requested a review from ASuresh0524 October 23, 2025 22:08

ww2283 added 3 commits October 25, 2025 10:44

chore: update faiss submodule to use ww2283 fork

585ef77

Use ww2283/faiss fork with fix/zmq-linking branch to resolve CI checkout failures. The ZMQ linking fixes are not yet merged upstream.

ww2283 force-pushed the feature/optimize-ollama-batching branch from ff24cf4 to d226f72 Compare October 25, 2025 14:58

ww2283 mentioned this pull request Oct 27, 2025

fix: use PkgConfig::ZMQ with PUBLIC linkage and override macOS dynami… yichuan-w/faiss#3

Merged

ASuresh0524 reviewed Oct 28, 2025

View reviewed changes

ASuresh0524 mentioned this pull request Oct 28, 2025

Feature/add metadata output #150

Closed

3 tasks

yichuan-w reviewed Oct 30, 2025

View reviewed changes

yichuan-w mentioned this pull request Oct 30, 2025

[BUG] Severe Performance Bottleneck in Ollama Embedding Generation due to Serial API Calls #151

Closed

fall back to original faiss as i merge the PR

6c88014

yichuan-w merged commit a85d0ad into yichuan-w:main Oct 30, 2025
2 checks passed

ASuresh0524 mentioned this pull request Nov 1, 2025

fixing chunking token issues within limit for embedding models #156

Closed

3 tasks

ww2283 mentioned this pull request Nov 3, 2025

metadata reveal for ast-chunking; smart detection of seq length in ollama; auto adjust chunk length for ast to prevent silent truncation #157

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature/optimize ollama batching #152

Feature/optimize ollama batching #152

Uh oh!

ww2283 commented Oct 23, 2025

Uh oh!

ww2283 commented Oct 23, 2025

Uh oh!

ww2283 commented Oct 27, 2025 •

edited

Loading

Uh oh!

ASuresh0524 left a comment

Uh oh!

yichuan-w commented Oct 30, 2025

Uh oh!

yichuan-w Oct 30, 2025

Uh oh!

ww2283 Oct 31, 2025

Uh oh!

ww2283 Oct 31, 2025

Uh oh!

yichuan-w commented Oct 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Feature/optimize ollama batching #152

Feature/optimize ollama batching #152

Uh oh!

Conversation

ww2283 commented Oct 23, 2025

What does this PR do?

Related Issues

Checklist

Summary

Commits in this PR

Changes (Ollama Batching)

Performance Impact

Before

After

Actual Results (tested on 2,374 chunks)

Important Note

Uh oh!

ww2283 commented Oct 23, 2025

Uh oh!

ww2283 commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Submodule Conflict Explanation

Why .gitmodules Has a Conflict

Upstream PR Created

Resolution Plan

Alternative: Accept This Temporarily

Uh oh!

ASuresh0524 left a comment

Choose a reason for hiding this comment

LGTM

Positive Aspects:

Minor Suggestions:

Approval Status:

Next Steps:

Uh oh!

yichuan-w commented Oct 30, 2025

Uh oh!

yichuan-w Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

ww2283 Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

ww2283 Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

yichuan-w commented Oct 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ww2283 commented Oct 27, 2025 •

edited

Loading

Why `.gitmodules` Has a Conflict