Add Jinja template support #11016

ochafik · 2024-12-30T03:48:15Z

Subset of #9639 with just the Jinja templating support.

Proper tool support (grammar constraints, lazy grammar triggering, tool call parsing & stop reason) will come in a follow up PR.

Copies minja.hpp & chat-template.hpp from google/minja (created for this 😅) at this commit
Adds --jinja flag to llama-server, llama-cli, llama-run
Adds --chat-template-file flag to llama-server, llama-cli (related: Added chat template support to llama-run #11215 )
Loads tokenizer.chat_template (or tokenizer.chat_template.tool_use if defined, only when the request has tools).
Dual testing in test-chat-template.cpp of legacy adhoc templating & jinja route. Wherever the expected outputs diverge, the jinja expectations should be more correct (note that templates are run w/ trim_blocks = true, lstrip_blocks = true)
- Sent Refactor test-chat-template.cpp #11224 separately

Example usage:

# Launch in background
./build/bin/llama-server \
  -hfr bartowski/Qwen2.5-7B-Instruct-GGUF \
  -hff Qwen2.5-7B-Instruct-Q4_K_M.gguf \
  --jinja &

curl http://localhost:8080/v1/chat/completions \
  -d '{
    "model": "gpt-3.5-turbo",
    "tools": [
      {
        "type": "function",
        "function": {
          "name": "ipython",
          "description": "Runs code in an ipython interpreter and returns the result of the execution after 60 seconds.",
          "parameters": {
            "type": "object",
            "properties": {
              "code": {
                "type": "string",
                "description": "The code to run in the ipython interpreter."
              }
            },
            "required": ["code"]
          }
        }
      }
    ],
    "messages": [
      {
        "role": "user",
        "content": "Print a hello world message with python (using single quotes '"'"' for strings)."
      }
    ]
  }'

show output

{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "<tool_call>\n{\"name\": \"ipython\", \"arguments\": {\"code\": \"print('Hello world!')\"}}\n</tool_call>",
        "role": "assistant"
      }
    }
  ],
  "created": 1736811609,
  "model": "gpt-3.5-turbo",
  "system_fingerprint": "b4494-a57bb94e",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 25,
    "prompt_tokens": 205,
    "total_tokens": 230
  },
  "id": "chatcmpl-5YJXFVhvjoMDlLx1asuWNdSO3JVWWsUF",
  "timings": {
    "prompt_n": 1,
    "prompt_ms": 155.151,
    "prompt_per_token_ms": 155.151,
    "prompt_per_second": 6.445333900522716,
    "predicted_n": 25,
    "predicted_ms": 419.714,
    "predicted_per_token_ms": 16.78856,
    "predicted_per_second": 59.56437002339688
  }
}

TODO:

Add cross-testing in test-chat-template.cpp (note that minja is tested against a lot of templates in its own repo)
Add some instructions here
Add more server tests to exercise the template overrides.

ericcurtin · 2025-01-13T17:23:33Z

Feel free to add the option to llama-run for basic testing also @ochafik

ericcurtin

I approve the llama-run parts at least, but the more code we can share with llama-server, etc. the better, there's probably room for more de-duplication

ngxson · 2025-01-14T12:51:34Z

IMO we can extend common_chat_apply_template to add bool jinja and reuse this function in other examples.

… option

ochafik · 2025-01-18T00:47:42Z

IMO we can extend common_chat_apply_template to add bool jinja and reuse this function in other examples.

@ngxson updated the code along those lines, ptal :-)

ggerganov · 2025-01-18T08:00:04Z

common/common.h

@@ -3,7 +3,9 @@
 #pragma once

 #include "llama-cpp.h"
+#include "chat-template.hpp"


Including this here will pollute a lot of source files with a json.hpp dependency. We should avoid this.

Good point thanks, now forward declared.

ericcurtin · 2025-01-18T11:57:16Z

It's not working yet or anything but I intend on writing ollama2jinja and jinja2ollama tools https://github.com/ericcurtin/ollama2jinja

ochafik · 2025-01-18T12:15:22Z

It's not working yet or anything but I intend on writing ollama2jinja and jinja2ollama tools https://github.com/ericcurtin/ollama2jinja

@ericcurtin Wow, didn't realize ollama templates were a thing 😖. If you're keen it might make sense to incubate these within https://github.com/google/minja or using it as a dependency (it has a full-ish jinja ast, would be easy to add pretty print to it).

ericcurtin · 2025-01-18T14:29:34Z

So Ollama are basically forking a little bit of everything to try and achieve vendor lock-in. Some examples:

The Ollama transport protocol, it just a slightly forked version of the OCI protocol (they are ex-Docker guys). Just forked enough so one can't use dockerhub, quay.io, etc. (so people will have to buy Ollama Enterprise servers or whatever).
They have forked llama.cpp (I would much rather we upstreamed to llama.cpp than forked, like upstreamining to Linus's kernel tree).
They don't use jinja like everyone else, they use this:

https://ollama.com/library/granite-code/blobs/977871d28ce4

etc.

So we started a project called RamaLama to unfork all these bits (it can just be used as an Ollama replacement too, it can run models from Ollama), so people can just use their existing infrastructure to transport models, OCI registries. We are also trying to add all sorts of container features to RamaLama. podman support, docker support. An option to use vLLM inferencing alternatively (Ollama doesn't have this).

This essentially is why we are creating tools like RamaLama, linenoise.cpp, ollama2jinja, llama-run, lm-pull, etc.

Happy to contribute to minja too 😄

ericcurtin · 2025-01-18T14:32:40Z

We are trying to push community friendly approaches basically and avoid vendor lock-in.

I would have liked to contribute these features to Ollama and made that better, but quickly learned unless you have an @ollama.com email address you can't really get anything significant merged in that project.

ericcurtin · 2025-01-18T14:34:24Z

gemma.cpp looks interesting too FWIW I recently made a small contribution there

slaren · 2025-01-18T15:05:07Z

I can reproduce the test that fails on Windows. The output seems to be missing some new lines. Maybe caused by normalize_newlines?

slaren · 2025-01-18T15:51:17Z

After looking more into this, I think that the issue is different behavior of ^ and $ in regex with MSVC (ref). Changing the implementation of strip in minja.hpp to the commented version seems to fix the issue. I am wondering if the use of different line endings on WIN32 was an attempt to workaround this? std::ostringstream shouldn't need to use \r\n line endings, this should only be relevant when using files in text mode.

ochafik · 2025-01-18T18:00:14Z

After looking more into this, I think that the issue is different behavior of ^ and $ in regex with MSVC (ref). Changing the implementation of strip in minja.hpp to the commented version seems to fix the issue. I am wondering if the use of different line endings on WIN32 was an attempt to workaround this? std::ostringstream shouldn't need to use \r\n line endings, this should only be relevant when using files in text mode.

@slaren thanks for digging, I'd embarrassingly forgotten to finish that fix, done (google/minja#22)

ericcurtin · 2025-01-19T13:40:15Z

It's not working yet or anything but I intend on writing ollama2jinja and jinja2ollama tools https://github.com/ericcurtin/ollama2jinja

@ericcurtin Wow, didn't realize ollama templates were a thing 😖. If you're keen it might make sense to incubate these within https://github.com/google/minja or using it as a dependency (it has a full-ish jinja ast, would be easy to add pretty print to it).

Maybe I can try and add this in, I think if one was pulling from Ollama it might be easiest to do a quick once off conversion to jinja post-pull:

some-model.ollama.template -> some-model.jinja.template

seems like it could be easier than maintaining a jinja parser and a ollama parser in minja, but I'm on the fence, what do you think?

Another reason I like the once off conversion is vllm can parse jinja too, so a once off conversion is a solution for vllm also.

ericcurtin · 2025-01-19T14:00:07Z

Maybe we could do something like this:

    const std::string tmpl = minja::Converter::ollama2jinja(some_ollama_string);

ochafik · 2025-01-20T11:34:03Z

@ericcurtin So, I took a look at ollama's templates and they seem to have done some creative approximations of the original templates when converting the jinjas to their templating system (compare jinja vs. ollama for llama 3.3 for instance).

I think it would be quite suboptimal to convert their template back to jinja as it would incur hefty double losses in translation (go-templates don't seem trivially convertible to jinja at a glance). For a fraction of the energy, we could instead fetch the template from the original chat model.

It looks like ollama's metadata only lists the repo_url of the base model, not that of the fine-tune (a very weird choice; maybe something they'd be keen to adjust?). One could probably approximate the fine tune repo as general.base_model.0.repo_url + '-' + general.finetune, which might work-ish for enough models? (definitely more reliably than trying to convert templates, and can special-case some mappings to the "right" repo). Then fetch like scripts/get_hf_chat_template.py does, although gated repos might get in the way.

ericcurtin · 2025-01-20T12:13:31Z

My biggest concern is when people publish to the Ollama registry and the template from the original chat model vs the Ollama template don't match. Publishers to Ollama registry assume the Ollama one will be used. But yes it all makes sense :)

github-actions bot added script Script related examples python python script changes server labels Dec 30, 2024

ochafik added 2 commits December 30, 2024 03:50

Copy minja from google/minja@58f0ca6

abd274a

Add --jinja and --chat-template-file flags

e5113e8

ochafik force-pushed the jinja branch from 4ec6151 to e5113e8 Compare December 30, 2024 03:50

ochafik added 4 commits December 30, 2024 04:10

Add missing <optional> include

80138d9

Avoid print in get_hf_chat_template.py

06b5159

No designated initializers yet

ce48584

Try and work around msvc++ non-macro max resolution quirk

389d79b

ochafik force-pushed the jinja branch from c3b07a8 to 389d79b Compare December 30, 2024 04:50

Update test_chat_completion.py

238b968

ochafik mentioned this pull request Dec 30, 2024

Tool call support (Llama 3.x, Functionary v3, Hermes 2 Pro, Mistral Nemo, generic) w/ lazy grammars & minimalist Jinja engine #9639

Draft

34 tasks

slaren mentioned this pull request Dec 31, 2024

llama : add support for Cohere2ForCausalLM #10900

Merged

ngxson mentioned this pull request Jan 13, 2025

Added chat template support to llama-run #11215

Draft

ochafik added 4 commits January 13, 2025 19:56

Merge remote-tracking branch 'origin/master' into jinja

cb72cf1

Wire LLM_KV_TOKENIZER_CHAT_TEMPLATE_N in llama_model_chat_template

78861a3

Refactor test-chat-template

1aac99a

Test templates w/ minja

7c84ebc

github-actions bot added the testing Everything test related label Jan 13, 2025

ochafik added 8 commits January 13, 2025 21:30

Fix deprecation

18f257b

Add --jinja to llama-run

8dd4f33

Merge remote-tracking branch 'origin/master' into jinja

c04c50e

Update common_chat_format_example to use minja template wrapper

a6afb27

Test chat_template in e2e test

b4083e4

Update utils.py

b7e2171

Update test_chat_completion.py

a57bb94

Update run.cpp

4daae0b

ericcurtin approved these changes Jan 14, 2025

View reviewed changes

ochafik added 2 commits January 18, 2025 00:43

Refactor common_chat_* functions to accept minja template + use_jinja…

b75d062

… option

Merge remote-tracking branch 'origin/master' into jinja

40db789

ochafik added 3 commits January 18, 2025 00:56

Attempt to fix linkage of LLAMA_CHATML_TEMPLATE

81c0d43

Revert LLAMA_CHATML_TEMPLATE refactor

d5fa351

Normalize newlines in test-chat-templates for windows tests

ee1e10e

ggerganov reviewed Jan 18, 2025

View reviewed changes

ochafik added 4 commits January 18, 2025 10:37

Forward decl minja::chat_template to avoid eager json dep

e63520f

Flush stdout in chat template before potential crash

33322e8

Fix copy elision warning

5074e6f

Rm unused optional include

fc60802

Add missing optional include to server.cpp

0e74c9d

Disable jinja test that has a cryptic windows failure

e3c475c

ochafik mentioned this pull request Jan 18, 2025

Fix bofenghuang/vigogne-2-70b-chat on Windows google/minja#22

Merged

minja: fix vigogne (google/minja#22)

cc50356

ericcurtin mentioned this pull request Jan 20, 2025

Add Vulkan support to ollama ollama/ollama#5059

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Jinja template support #11016

Add Jinja template support #11016

ochafik commented Dec 30, 2024 •

edited

Loading

ericcurtin commented Jan 13, 2025

ericcurtin left a comment

ngxson commented Jan 14, 2025

ochafik commented Jan 18, 2025

ggerganov Jan 18, 2025

ochafik Jan 18, 2025

ericcurtin commented Jan 18, 2025

ochafik commented Jan 18, 2025

ericcurtin commented Jan 18, 2025 •

edited

Loading

ericcurtin commented Jan 18, 2025 •

edited

Loading

ericcurtin commented Jan 18, 2025

slaren commented Jan 18, 2025 •

edited

Loading

slaren commented Jan 18, 2025 •

edited

Loading

ochafik commented Jan 18, 2025

ericcurtin commented Jan 19, 2025 •

edited

Loading

ericcurtin commented Jan 19, 2025

ochafik commented Jan 20, 2025 •

edited

Loading

ericcurtin commented Jan 20, 2025

Add Jinja template support #11016

Are you sure you want to change the base?

Add Jinja template support #11016

Conversation

ochafik commented Dec 30, 2024 • edited Loading

ericcurtin commented Jan 13, 2025

ericcurtin left a comment

Choose a reason for hiding this comment

ngxson commented Jan 14, 2025

ochafik commented Jan 18, 2025

ggerganov Jan 18, 2025

Choose a reason for hiding this comment

ochafik Jan 18, 2025

Choose a reason for hiding this comment

ericcurtin commented Jan 18, 2025

ochafik commented Jan 18, 2025

ericcurtin commented Jan 18, 2025 • edited Loading

ericcurtin commented Jan 18, 2025 • edited Loading

ericcurtin commented Jan 18, 2025

slaren commented Jan 18, 2025 • edited Loading

slaren commented Jan 18, 2025 • edited Loading

ochafik commented Jan 18, 2025

ericcurtin commented Jan 19, 2025 • edited Loading

ericcurtin commented Jan 19, 2025

ochafik commented Jan 20, 2025 • edited Loading

ericcurtin commented Jan 20, 2025

ochafik commented Dec 30, 2024 •

edited

Loading

ericcurtin commented Jan 18, 2025 •

edited

Loading

ericcurtin commented Jan 18, 2025 •

edited

Loading

slaren commented Jan 18, 2025 •

edited

Loading

slaren commented Jan 18, 2025 •

edited

Loading

ericcurtin commented Jan 19, 2025 •

edited

Loading

ochafik commented Jan 20, 2025 •

edited

Loading