Skip to content

[BUG]Duplicate BOS token prepended when using Llama-3/3.1 chat interface #388

@rubik-hua

Description

@rubik-hua

Description
When using model.chat() with Llama-3/3.1 models, the framework inadvertently prepends two <|begin_of_text|> (BOS, token ID 128000) tokens to the prompt_token_ids. This shifts the RoPE positional encodings by 1, causing the greedy decoding output to diverge significantly from HuggingFace.

Root Cause
The root cause lies in the interaction between the Chat Template and the default behavior of the Llama Fast Tokenizer:

Chat Template adds BOS text: Llama-3/3.1’s tokenizer_config.json explicitly includes <|begin_of_text|> at the very beginning of its chat template. When the prompt is rendered, the string already starts with this special token.
Tokenizer adds BOS token again: Later, BasicLLMProcessor.call passes this rendered string to self.tokenizer(prompt), which defaults to add_special_tokens=True.
What add_special_tokens=True actually does for Llama: By default, AutoTokenizer loads the Fast Tokenizer:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("/data/rubik/models/Meta-Llama-3.1-8B-Instruct")
print(type(tokenizer))

Output: <class 'transformers.tokenization_utils_fast.PreTrainedTokenizerFast'>

If we look into the source code at transformers/models/llama/tokenization_llama_fast.py, the LlamaTokenizerFast is initialized with the default parameters add_bos_token=True and add_eos_token=False.

Therefore, when add_special_tokens=True is passed, the tokenizer’s Rust backend PostProcessor enforces this configuration and automatically prepends a second BOS token (128000) to the sequence, regardless of the text content.

Verification
Comparing the prompt_token_ids reveals the inconsistency:

HuggingFace (Expected): [128000, 128006, 9125, 128007, 271, ...] (Single BOS)

Image Image

InfiniLM (Actual): [128000, 128000, 128006, 9125, 128007, 271, ...] (Duplicate BOS)

Image Image

Because of this positional shift, greedy decoding (temperature=0.0, top_k=1) produces entirely different text from HF. (Manually forcing HF to use the double-BOS input reproduces InfiniLM’s output, confirming the duplicate token as the sole cause).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions