Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Jinja template support #11016

Open
wants to merge 33 commits into
base: master
Choose a base branch
from
Open

Add Jinja template support #11016

wants to merge 33 commits into from

Conversation

ochafik
Copy link
Collaborator

@ochafik ochafik commented Dec 30, 2024

Subset of #9639 with just the Jinja templating support.

Proper tool support (grammar constraints, lazy grammar triggering, tool call parsing & stop reason) will come in a follow up PR.

  • Copies minja.hpp & chat-template.hpp from google/minja (created for this 😅) at this commit
  • Adds --jinja flag to llama-server, llama-cli, llama-run
  • Adds --chat-template-file flag to llama-server, llama-cli (related: Added chat template support to llama-run #11215 )
  • Loads tokenizer.chat_template (or tokenizer.chat_template.tool_use if defined, only when the request has tools).
  • Dual testing in test-chat-template.cpp of legacy adhoc templating & jinja route. Wherever the expected outputs diverge, the jinja expectations should be more correct (note that templates are run w/ trim_blocks = true, lstrip_blocks = true)

Example usage:

# Launch in background
./build/bin/llama-server \
  -hfr bartowski/Qwen2.5-7B-Instruct-GGUF \
  -hff Qwen2.5-7B-Instruct-Q4_K_M.gguf \
  --jinja &

curl http://localhost:8080/v1/chat/completions \
  -d '{
    "model": "gpt-3.5-turbo",
    "tools": [
      {
        "type": "function",
        "function": {
          "name": "ipython",
          "description": "Runs code in an ipython interpreter and returns the result of the execution after 60 seconds.",
          "parameters": {
            "type": "object",
            "properties": {
              "code": {
                "type": "string",
                "description": "The code to run in the ipython interpreter."
              }
            },
            "required": ["code"]
          }
        }
      }
    ],
    "messages": [
      {
        "role": "user",
        "content": "Print a hello world message with python (using single quotes '"'"' for strings)."
      }
    ]
  }'
show output
{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "<tool_call>\n{\"name\": \"ipython\", \"arguments\": {\"code\": \"print('Hello world!')\"}}\n</tool_call>",
        "role": "assistant"
      }
    }
  ],
  "created": 1736811609,
  "model": "gpt-3.5-turbo",
  "system_fingerprint": "b4494-a57bb94e",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 25,
    "prompt_tokens": 205,
    "total_tokens": 230
  },
  "id": "chatcmpl-5YJXFVhvjoMDlLx1asuWNdSO3JVWWsUF",
  "timings": {
    "prompt_n": 1,
    "prompt_ms": 155.151,
    "prompt_per_token_ms": 155.151,
    "prompt_per_second": 6.445333900522716,
    "predicted_n": 25,
    "predicted_ms": 419.714,
    "predicted_per_token_ms": 16.78856,
    "predicted_per_second": 59.56437002339688
  }
}

TODO:

  • Add cross-testing in test-chat-template.cpp (note that minja is tested against a lot of templates in its own repo)
  • Add some instructions here
  • Add more server tests to exercise the template overrides.

@github-actions github-actions bot added script Script related examples python python script changes server labels Dec 30, 2024
@ericcurtin
Copy link
Collaborator

Feel free to add the option to llama-run for basic testing also @ochafik

@github-actions github-actions bot added the testing Everything test related label Jan 13, 2025
Copy link
Collaborator

@ericcurtin ericcurtin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I approve the llama-run parts at least, but the more code we can share with llama-server, etc. the better, there's probably room for more de-duplication

@ngxson
Copy link
Collaborator

ngxson commented Jan 14, 2025

IMO we can extend common_chat_apply_template to add bool jinja and reuse this function in other examples.

@ochafik
Copy link
Collaborator Author

ochafik commented Jan 18, 2025

IMO we can extend common_chat_apply_template to add bool jinja and reuse this function in other examples.

@ngxson updated the code along those lines, ptal :-)

common/common.h Outdated
@@ -3,7 +3,9 @@
#pragma once

#include "llama-cpp.h"
#include "chat-template.hpp"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Including this here will pollute a lot of source files with a json.hpp dependency. We should avoid this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point thanks, now forward declared.

@ericcurtin
Copy link
Collaborator

It's not working yet or anything but I intend on writing ollama2jinja and jinja2ollama tools https://github.com/ericcurtin/ollama2jinja

@ochafik
Copy link
Collaborator Author

ochafik commented Jan 18, 2025

It's not working yet or anything but I intend on writing ollama2jinja and jinja2ollama tools https://github.com/ericcurtin/ollama2jinja

@ericcurtin Wow, didn't realize ollama templates were a thing 😖. If you're keen it might make sense to incubate these within https://github.com/google/minja or using it as a dependency (it has a full-ish jinja ast, would be easy to add pretty print to it).

@ericcurtin
Copy link
Collaborator

ericcurtin commented Jan 18, 2025

So Ollama are basically forking a little bit of everything to try and achieve vendor lock-in. Some examples:

  1. The Ollama transport protocol, it just a slightly forked version of the OCI protocol (they are ex-Docker guys). Just forked enough so one can't use dockerhub, quay.io, etc. (so people will have to buy Ollama Enterprise servers or whatever).

  2. They have forked llama.cpp (I would much rather we upstreamed to llama.cpp than forked, like upstreamining to Linus's kernel tree).

  3. They don't use jinja like everyone else, they use this:

https://ollama.com/library/granite-code/blobs/977871d28ce4

etc.

So we started a project called RamaLama to unfork all these bits (it can just be used as an Ollama replacement too, it can run models from Ollama), so people can just use their existing infrastructure to transport models, OCI registries. We are also trying to add all sorts of container features to RamaLama. podman support, docker support. An option to use vLLM inferencing alternatively (Ollama doesn't have this).

This essentially is why we are creating tools like RamaLama, linenoise.cpp, ollama2jinja, llama-run, lm-pull, etc.

Happy to contribute to minja too 😄

@ericcurtin
Copy link
Collaborator

ericcurtin commented Jan 18, 2025

We are trying to push community friendly approaches basically and avoid vendor lock-in.

I would have liked to contribute these features to Ollama and made that better, but quickly learned unless you have an @ollama.com email address you can't really get anything significant merged in that project.

@ericcurtin
Copy link
Collaborator

gemma.cpp looks interesting too FWIW I recently made a small contribution there

@slaren
Copy link
Collaborator

slaren commented Jan 18, 2025

I can reproduce the test that fails on Windows. The output seems to be missing some new lines. Maybe caused by normalize_newlines?

image

@slaren
Copy link
Collaborator

slaren commented Jan 18, 2025

After looking more into this, I think that the issue is different behavior of ^ and $ in regex with MSVC (ref). Changing the implementation of strip in minja.hpp to the commented version seems to fix the issue. I am wondering if the use of different line endings on WIN32 was an attempt to workaround this? std::ostringstream shouldn't need to use \r\n line endings, this should only be relevant when using files in text mode.

@ochafik
Copy link
Collaborator Author

ochafik commented Jan 18, 2025

After looking more into this, I think that the issue is different behavior of ^ and $ in regex with MSVC (ref). Changing the implementation of strip in minja.hpp to the commented version seems to fix the issue. I am wondering if the use of different line endings on WIN32 was an attempt to workaround this? std::ostringstream shouldn't need to use \r\n line endings, this should only be relevant when using files in text mode.

@slaren thanks for digging, I'd embarrassingly forgotten to finish that fix, done (google/minja#22)

@ericcurtin
Copy link
Collaborator

ericcurtin commented Jan 19, 2025

It's not working yet or anything but I intend on writing ollama2jinja and jinja2ollama tools https://github.com/ericcurtin/ollama2jinja

@ericcurtin Wow, didn't realize ollama templates were a thing 😖. If you're keen it might make sense to incubate these within https://github.com/google/minja or using it as a dependency (it has a full-ish jinja ast, would be easy to add pretty print to it).

Maybe I can try and add this in, I think if one was pulling from Ollama it might be easiest to do a quick once off conversion to jinja post-pull:

some-model.ollama.template -> some-model.jinja.template

seems like it could be easier than maintaining a jinja parser and a ollama parser in minja, but I'm on the fence, what do you think?

Another reason I like the once off conversion is vllm can parse jinja too, so a once off conversion is a solution for vllm also.

@ericcurtin
Copy link
Collaborator

Maybe we could do something like this:

    const std::string tmpl = minja::Converter::ollama2jinja(some_ollama_string);

@ochafik
Copy link
Collaborator Author

ochafik commented Jan 20, 2025

@ericcurtin So, I took a look at ollama's templates and they seem to have done some creative approximations of the original templates when converting the jinjas to their templating system (compare jinja vs. ollama for llama 3.3 for instance).

I think it would be quite suboptimal to convert their template back to jinja as it would incur hefty double losses in translation (go-templates don't seem trivially convertible to jinja at a glance). For a fraction of the energy, we could instead fetch the template from the original chat model.

It looks like ollama's metadata only lists the repo_url of the base model, not that of the fine-tune (a very weird choice; maybe something they'd be keen to adjust?). One could probably approximate the fine tune repo as general.base_model.0.repo_url + '-' + general.finetune, which might work-ish for enough models? (definitely more reliably than trying to convert templates, and can special-case some mappings to the "right" repo). Then fetch like scripts/get_hf_chat_template.py does, although gated repos might get in the way.

@ericcurtin
Copy link
Collaborator

My biggest concern is when people publish to the Ollama registry and the template from the original chat model vs the Ollama template don't match. Publishers to Ollama registry assume the Ollama one will be used. But yes it all makes sense :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples python python script changes script Script related server testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants