-
Notifications
You must be signed in to change notification settings - Fork 434
Feature/optimize ollama batching #152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/optimize ollama batching #152
Conversation
- Add --show-metadata flag to display file paths in search results - Preserve document metadata (file_path, file_name, timestamps) during chunking - Update MCP tool schema to support show_metadata parameter - Enhance CLI search output to display metadata when requested - Fix pre-existing bug: args.backend -> args.backend_name Resolves yichuan-w#144
- Use pkg_check_modules IMPORTED_TARGET to create PkgConfig::ZMQ - Set PKG_CONFIG_PATH to prioritize ARM64 Homebrew on Apple Silicon - Override macOS -undefined dynamic_lookup to force proper symbol resolution - Use PUBLIC linkage for ZMQ in faiss library for transitive linking - Mark cppzmq includes as SYSTEM to suppress warnings Fixes editable install ZMQ symbol errors while maintaining compatibility across Linux, macOS Intel, and macOS ARM64 platforms.
|
fix for #151 |
Use ww2283/faiss fork with fix/zmq-linking branch to resolve CI checkout failures. The ZMQ linking fixes are not yet merged upstream.
Resolved conflicts in cli.py by keeping structured metadata approach over inline text concatenation from PR yichuan-w#149. Our approach uses separate metadata dictionary which is cleaner and more maintainable than parsing embedded strings.
Migrate from deprecated /api/embeddings to modern /api/embed endpoint which supports batch inputs. This reduces HTTP overhead by sending 32 texts per request instead of making individual API calls. Changes: - Update endpoint from /api/embeddings to /api/embed - Change parameter from 'prompt' (single) to 'input' (array) - Update response parsing for batch embeddings array - Increase timeout to 60s for batch processing - Improve error handling for batch requests Performance: - Reduces API calls by 32x (batch size) - Eliminates HTTP connection overhead per text - Note: Ollama still processes batch items sequentially internally Related: yichuan-w#151
ff24cf4 to
d226f72
Compare
Submodule Conflict ExplanationWhy
|
ASuresh0524
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Positive Aspects:
- 32x reduction in API calls - Major performance improvement
- Modern API usage - Future-proof with
/api/embed - Proper error handling - Graceful fallback to individual processing
- Comprehensive testing - Tested with real workload (2,374 chunks)
Minor Suggestions:
- Token Limit Integration: This pairs well with token limit fixes for #153. Consider how these changes interact with token truncation.
- Error Logging: The batch error handling could benefit from more specific error messages for token limit violations (related to #153).
Approval Status:
LGTM - This is a solid performance improvement that should be merged.
Next Steps:
After this merges, I'll update my token limit fixes (#153) to work with the new /api/embed endpoint for a complete solution.
|
Hi @ww2283 can you change to the new Faiss version(i.e link to our backbone instead of yours) and then we can merge? I have merge the PR in faiss, thanks for the contribution! |
| response = requests.post( | ||
| f"{resolved_host}/api/embed", | ||
| json={"model": model_name, "input": truncated_texts}, | ||
| timeout=60, # Increased timeout for batch processing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering if this will result in OOM?
If you test on a large scale, I think I am fine with this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will keep this in mind when doing next step as I closely monitor its behavior. thanks for the merge! I will check around to be sure about the conflict are resolved before next step. currently Ollama has its limitation, that is the batching is correctly received but not really properly batched in itself, which is not the same behavior as in other client e.g. lm studio. So lm studio is using openai mode endpoint, and it's not oom, so I assume that ollama should be fine, even when later they decide to do proper batching. but for now the batching is ready with ollama. sadly headless server autoloading and unloading model with proper JIT is still the most smooth with ollama. Or next close solution is llama-swap but not as convenient. currently, the most speedy solution in apple silicon is either ollama with moe embedding model, which we currently only have that nomad v2, or lm studio with embeddinggemma which can offer equivalent speed comparing to that ollama hosted moe. embeddinggemma has the great two advantages: longer sequence length support (2048 vs 512) and template prepending which should theoretically be important for better results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on a side note, speed is important, at least to me, because I use a posttoolhook in claude code that will embed once see a git commit to keep the codebase indexing up to date. so the embedding in LEAAN has to be fast.
|
Let's merge this PR. It is an important issue, let's merge ASAP |
What does this PR do?
Related Issues
Fixes #
Checklist
uv run pytest): as mentioned in Feature/add metadata output #150 , ast chunking part is waiting to be synced with the fix in 150; diskann is not configure on my mac but this 151 is irrelevant regarding to the diskann backend. Other tests all passruff formatandruff check)pre-commit run --all-files)Summary
Fixes #151
This PR implements true batch processing for Ollama embeddings by migrating from the deprecated
/api/embeddingsendpoint to the modern
/api/embedendpoint.Commits in this PR
d6a3c28- feat: add metadata output to search results (from Feature/add metadata output #150)76e1633- fix: resolve ZMQ linking issues (from Feature/add metadata output #150)5073f31- style: apply ruff formatting (from Feature/add metadata output #150)8bb7743- feat: implement true batch processing for Ollama embeddings ⭐ (NEW)Changes (Ollama Batching)
/api/embeddings→/api/embed"prompt"(single text) →"input"(array of texts)Performance Impact
Before
After
Actual Results (tested on 2,374 chunks)
Important Note
Through empirical testing on Ollama v0.12.6, we confirmed that Ollama processes embedding batch items sequentially
(not in parallel). The performance improvement comes from reduced HTTP overhead, not GPU parallelization.
For users requiring maximum performance, we recommend e.g. :