-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Jinja template support #11016
base: master
Are you sure you want to change the base?
Add Jinja template support #11016
Conversation
Feel free to add the option to llama-run for basic testing also @ochafik |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I approve the llama-run parts at least, but the more code we can share with llama-server, etc. the better, there's probably room for more de-duplication
IMO we can extend |
@ngxson updated the code along those lines, ptal :-) |
common/common.h
Outdated
@@ -3,7 +3,9 @@ | |||
#pragma once | |||
|
|||
#include "llama-cpp.h" | |||
#include "chat-template.hpp" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Including this here will pollute a lot of source files with a json.hpp
dependency. We should avoid this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point thanks, now forward declared.
It's not working yet or anything but I intend on writing ollama2jinja and jinja2ollama tools https://github.com/ericcurtin/ollama2jinja |
@ericcurtin Wow, didn't realize ollama templates were a thing 😖. If you're keen it might make sense to incubate these within https://github.com/google/minja or using it as a dependency (it has a full-ish jinja ast, would be easy to add pretty print to it). |
So Ollama are basically forking a little bit of everything to try and achieve vendor lock-in. Some examples:
https://ollama.com/library/granite-code/blobs/977871d28ce4 etc. So we started a project called RamaLama to unfork all these bits (it can just be used as an Ollama replacement too, it can run models from Ollama), so people can just use their existing infrastructure to transport models, OCI registries. We are also trying to add all sorts of container features to RamaLama. podman support, docker support. An option to use vLLM inferencing alternatively (Ollama doesn't have this). This essentially is why we are creating tools like RamaLama, linenoise.cpp, ollama2jinja, llama-run, lm-pull, etc. Happy to contribute to minja too 😄 |
We are trying to push community friendly approaches basically and avoid vendor lock-in. I would have liked to contribute these features to Ollama and made that better, but quickly learned unless you have an @ollama.com email address you can't really get anything significant merged in that project. |
gemma.cpp looks interesting too FWIW I recently made a small contribution there |
After looking more into this, I think that the issue is different behavior of |
@slaren thanks for digging, I'd embarrassingly forgotten to finish that fix, done (google/minja#22) |
Maybe I can try and add this in, I think if one was pulling from Ollama it might be easiest to do a quick once off conversion to jinja post-pull: some-model.ollama.template -> some-model.jinja.template seems like it could be easier than maintaining a jinja parser and a ollama parser in minja, but I'm on the fence, what do you think? Another reason I like the once off conversion is vllm can parse jinja too, so a once off conversion is a solution for vllm also. |
Maybe we could do something like this:
|
@ericcurtin So, I took a look at ollama's templates and they seem to have done some creative approximations of the original templates when converting the jinjas to their templating system (compare jinja vs. ollama for llama 3.3 for instance). I think it would be quite suboptimal to convert their template back to jinja as it would incur hefty double losses in translation (go-templates don't seem trivially convertible to jinja at a glance). For a fraction of the energy, we could instead fetch the template from the original chat model. It looks like ollama's metadata only lists the repo_url of the base model, not that of the fine-tune (a very weird choice; maybe something they'd be keen to adjust?). One could probably approximate the fine tune repo as |
My biggest concern is when people publish to the Ollama registry and the template from the original chat model vs the Ollama template don't match. Publishers to Ollama registry assume the Ollama one will be used. But yes it all makes sense :) |
Subset of #9639 with just the Jinja templating support.
Proper tool support (grammar constraints, lazy grammar triggering, tool call parsing & stop reason) will come in a follow up PR.
--jinja
flag to llama-server, llama-cli, llama-run--chat-template-file
flag to llama-server, llama-cli (related: Added chat template support to llama-run #11215 )tokenizer.chat_template
(ortokenizer.chat_template.tool_use
if defined, only when the request has tools).trim_blocks = true, lstrip_blocks = true
)Example usage:
show output
TODO: