Phi 3 Mini 128K leads to Tokenization Mismatch #34

ritwickchaudhry · 2024-08-01T21:49:24Z

Hi!
Thanks for the amazing work. I am trying to use the Phi 3 Mini 128K model. Unfortunately, I get a tokenization mismatch error (relevant code). However, it gives an error even with the 4K model. Can you please guide on why the issue exists and/or the changes in preprocessing code that I need to do to support this? I think it's mainly got to do with the change in Phi 3 models made in July

ZichenMiao · 2024-08-19T01:29:59Z

Same problem here.

arvillion · 2024-10-17T11:25:29Z

I encountered the same type of error when training with phi-3 mini-4k model. Later I changed the following lines in train.py and conversations.py respectively and it seemed to work well.

# def preprocess_phi3(
- else:
-     round_len -= 2
-     instruction_len -= 2
+ else:
+     round_len += 1
+     instruction_len += +1

# conv_phi3_instruct = Conversation(
- roles=("\n<|user|>\n", "\n<|assistant|>\n"),
+ roles=("<|user|>", "<|assistant|>"),

lst627 · 2024-12-03T00:19:27Z

Same problem here. I tried the method of @arvillion but it did not work. So I tried using the old Phi-3 model (before July update) and it worked well.

hunarbatra · 2024-12-07T20:04:44Z

Faced the same problem! I've updated the phi-3 preprocessor as follows:

def preprocess_phi3(
    sources,
    tokenizer: transformers.PreTrainedTokenizer,
    has_image: bool = False
) -> Dict:
    conv = conversation_lib.conv_templates["phi3-instruct"].copy()
    roles = {"human": conv.roles[0], "gpt": conv.roles[1]}

    # Apply prompt templates
    conversations = []
    for i, source in enumerate(sources):
        if roles[source[0]["from"]] != conv.roles[0]:
            # Skip the first one if it is not from human
            source = source[1:]

        conv.messages = []
        for j, sentence in enumerate(source):
            role = roles[sentence["from"]]
            assert role == conv.roles[j % 2], f"{i}"
            conv.append_message(role, sentence["value"])
        conversations.append(conv.get_prompt())

    # Tokenize conversations
    if has_image:
        input_ids = torch.stack([tokenizer_image_token(prompt, tokenizer, return_tensors='pt') for prompt in conversations], dim=0)
    else:
        input_ids = tokenizer(
            conversations,
            return_tensors="pt",
            padding="longest",
            max_length=tokenizer.model_max_length,
            truncation=True,
        ).input_ids

    targets = input_ids.clone()
    assert conv.sep_style == conversation_lib.SeparatorStyle.MPT

    # Mask targets - mask system and user prompts with ignore index such that we train the model to optimize for generating the assistnant responses 
    # input ids consist of all the tokens in the conversation, including the system prompt and user prompt
    # only the targets are masked, the input ids are not modified 
    sep = '<|end|>' + conv.roles[1]
    
    for conversation, target in zip(conversations, targets):
        total_len = int(target.ne(tokenizer.pad_token_id).sum())

        rounds = conversation.split('<|end|>')
        re_rounds = ['<|end|>'.join(rounds[:3])]  # system + user + gpt
        for conv_idx in range(3, len(rounds), 2):
            re_rounds.append(conv.sep.join(rounds[conv_idx:conv_idx+2])) # user + gpt
            
        cur_len = 0
        target[:cur_len] = IGNORE_INDEX
        
        for conversation, target in zip(conversations, targets):
            total_len = int(target.ne(tokenizer.pad_token_id).sum())

            rounds = conversation.split('<|end|>')
            re_rounds = ['<|end|>'.join(rounds[:3])]  # system + user + gpt
            
            for conv_idx in range(3, len(rounds), 2):
                re_rounds.append(sep.join(rounds[conv_idx:conv_idx+2])) # user + gpt
                
            cur_len = 0
            target[:cur_len] = IGNORE_INDEX
            
            for i, rou in enumerate(re_rounds):
                if rou == "":
                    break

                parts = rou.split(sep)
                if len(parts) != 2:
                    break
                
                parts[0] += sep

                if has_image: 
                    round_len = len(tokenizer_image_token(rou, tokenizer))
                    instruction_len = len(tokenizer_image_token(parts[0], tokenizer)) - 1
                else:
                    round_len = len(tokenizer(rou).input_ids)
                    instruction_len = len(tokenizer(parts[0]).input_ids) - 1

                if i == 0:
                    round_len += 1
                    instruction_len += 1
                else:
                    round_len -= 2
                    instruction_len -= 2

                if i != 0 and getattr(tokenizer, 'legacy', False) and IS_TOKENIZER_GREATER_THAN_0_14:
                    round_len += 1
                    instruction_len += 1

                target[cur_len : cur_len + instruction_len] = IGNORE_INDEX

                cur_len += round_len

            target[cur_len:] = IGNORE_INDEX

            if cur_len < tokenizer.model_max_length:
                if cur_len != total_len:
                    target[:] = IGNORE_INDEX
                    print(
                        f"WARNING: tokenization mismatch: {cur_len} vs. {total_len}."
                        f" (ignored)"
                    )

    return dict(
        input_ids=input_ids,
        labels=targets,
    )

and this is the conv_template for phi-3 that I'm using:

conv_phi3_instruct = Conversation(
    system="""<|system|>\nA chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.""",
    roles=("\n<|user|>\n", "\n<|assistant|>\n"),
    version="phi3",
    messages=(),
    offset=0,
    sep_style=SeparatorStyle.MPT,
    sep="<|end|>",
)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phi 3 Mini 128K leads to Tokenization Mismatch #34

Phi 3 Mini 128K leads to Tokenization Mismatch #34

ritwickchaudhry commented Aug 1, 2024 •

edited

Loading

ZichenMiao commented Aug 19, 2024

arvillion commented Oct 17, 2024 •

edited

Loading

lst627 commented Dec 3, 2024 •

edited

Loading

hunarbatra commented Dec 7, 2024

Phi 3 Mini 128K leads to Tokenization Mismatch #34

Phi 3 Mini 128K leads to Tokenization Mismatch #34

Comments

ritwickchaudhry commented Aug 1, 2024 • edited Loading

ZichenMiao commented Aug 19, 2024

arvillion commented Oct 17, 2024 • edited Loading

lst627 commented Dec 3, 2024 • edited Loading

hunarbatra commented Dec 7, 2024

ritwickchaudhry commented Aug 1, 2024 •

edited

Loading

arvillion commented Oct 17, 2024 •

edited

Loading

lst627 commented Dec 3, 2024 •

edited

Loading