[BUG]Duplicate BOS token prepended when using Llama-3/3.1 chat interface

Description
When using model.chat() with Llama-3/3.1 models, the framework inadvertently prepends two <|begin_of_text|> (BOS, token ID 128000) tokens to the prompt_token_ids. This shifts the RoPE positional encodings by 1, causing the greedy decoding output to diverge significantly from HuggingFace.

Root Cause
The root cause lies in the interaction between the Chat Template and the default behavior of the Llama Fast Tokenizer:

Chat Template adds BOS text: Llama-3/3.1’s tokenizer_config.json explicitly includes <|begin_of_text|> at the very beginning of its chat template. When the prompt is rendered, the string already starts with this special token.
Tokenizer adds BOS token again: Later, BasicLLMProcessor.__call__ passes this rendered string to self.tokenizer(prompt), which defaults to add_special_tokens=True.
What add_special_tokens=True actually does for Llama: By default, AutoTokenizer loads the Fast Tokenizer:

from transformers import AutoTokenizer
   tokenizer = AutoTokenizer.from_pretrained("/data/rubik/models/Meta-Llama-3.1-8B-Instruct")
   print(type(tokenizer))
   # Output: <class 'transformers.tokenization_utils_fast.PreTrainedTokenizerFast'>
   
 If we look into the source code at transformers/models/llama/tokenization_llama_fast.py, the LlamaTokenizerFast is initialized with the default parameters add_bos_token=True and add_eos_token=False.

Therefore, when add_special_tokens=True is passed, the tokenizer’s Rust backend PostProcessor enforces this configuration and automatically prepends a second BOS token (128000) to the sequence, regardless of the text content.


Verification
Comparing the prompt_token_ids reveals the inconsistency:

HuggingFace (Expected): [128000, 128006, 9125, 128007, 271, ...] (Single BOS)


<img width="1448" height="459" alt="Image" src="https://github.com/user-attachments/assets/cfb10ea6-fceb-49b9-98ed-f022f91b4b70" />

<img width="1152" height="612" alt="Image" src="https://github.com/user-attachments/assets/76f2973b-dbe9-40bb-993e-34aa2ae0773c" />

InfiniLM (Actual): [128000, 128000, 128006, 9125, 128007, 271, ...] (Duplicate BOS)

<img width="1081" height="751" alt="Image" src="https://github.com/user-attachments/assets/df599a9b-5caf-4a4e-8914-0840d6ccbbdb" />

<img width="1468" height="503" alt="Image" src="https://github.com/user-attachments/assets/8828cb9a-4a0c-4e46-aebe-bc23094bbc33" />



Because of this positional shift, greedy decoding (temperature=0.0, top_k=1) produces entirely different text from HF. (Manually forcing HF to use the double-BOS input reproduces InfiniLM’s output, confirming the duplicate token as the sole cause).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]Duplicate BOS token prepended when using Llama-3/3.1 chat interface #388

Output: <class 'transformers.tokenization_utils_fast.PreTrainedTokenizerFast'>

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[BUG]Duplicate BOS token prepended when using Llama-3/3.1 chat interface #388

Description

Output: <class 'transformers.tokenization_utils_fast.PreTrainedTokenizerFast'>

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions