Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Phi 3 Mini 128K leads to Tokenization Mismatch #34

Open
ritwickchaudhry opened this issue Aug 1, 2024 · 4 comments
Open

Phi 3 Mini 128K leads to Tokenization Mismatch #34

ritwickchaudhry opened this issue Aug 1, 2024 · 4 comments

Comments

@ritwickchaudhry
Copy link

ritwickchaudhry commented Aug 1, 2024

Hi!
Thanks for the amazing work. I am trying to use the Phi 3 Mini 128K model. Unfortunately, I get a tokenization mismatch error (relevant code). However, it gives an error even with the 4K model. Can you please guide on why the issue exists and/or the changes in preprocessing code that I need to do to support this? I think it's mainly got to do with the change in Phi 3 models made in July

@ZichenMiao
Copy link

Same problem here.

@arvillion
Copy link

arvillion commented Oct 17, 2024

I encountered the same type of error when training with phi-3 mini-4k model. Later I changed the following lines in train.py and conversations.py respectively and it seemed to work well.

# def preprocess_phi3(
- else:
-     round_len -= 2
-     instruction_len -= 2
+ else:
+     round_len += 1
+     instruction_len += +1

# conv_phi3_instruct = Conversation(
- roles=("\n<|user|>\n", "\n<|assistant|>\n"),
+ roles=("<|user|>", "<|assistant|>"),

@lst627
Copy link

lst627 commented Dec 3, 2024

Same problem here. I tried the method of @arvillion but it did not work. So I tried using the old Phi-3 model (before July update) and it worked well.

@hunarbatra
Copy link

Faced the same problem! I've updated the phi-3 preprocessor as follows:

def preprocess_phi3(
    sources,
    tokenizer: transformers.PreTrainedTokenizer,
    has_image: bool = False
) -> Dict:
    conv = conversation_lib.conv_templates["phi3-instruct"].copy()
    roles = {"human": conv.roles[0], "gpt": conv.roles[1]}

    # Apply prompt templates
    conversations = []
    for i, source in enumerate(sources):
        if roles[source[0]["from"]] != conv.roles[0]:
            # Skip the first one if it is not from human
            source = source[1:]

        conv.messages = []
        for j, sentence in enumerate(source):
            role = roles[sentence["from"]]
            assert role == conv.roles[j % 2], f"{i}"
            conv.append_message(role, sentence["value"])
        conversations.append(conv.get_prompt())

    # Tokenize conversations
    if has_image:
        input_ids = torch.stack([tokenizer_image_token(prompt, tokenizer, return_tensors='pt') for prompt in conversations], dim=0)
    else:
        input_ids = tokenizer(
            conversations,
            return_tensors="pt",
            padding="longest",
            max_length=tokenizer.model_max_length,
            truncation=True,
        ).input_ids

    targets = input_ids.clone()
    assert conv.sep_style == conversation_lib.SeparatorStyle.MPT

    # Mask targets - mask system and user prompts with ignore index such that we train the model to optimize for generating the assistnant responses 
    # input ids consist of all the tokens in the conversation, including the system prompt and user prompt
    # only the targets are masked, the input ids are not modified 
    sep = '<|end|>' + conv.roles[1]
    
    for conversation, target in zip(conversations, targets):
        total_len = int(target.ne(tokenizer.pad_token_id).sum())

        rounds = conversation.split('<|end|>')
        re_rounds = ['<|end|>'.join(rounds[:3])]  # system + user + gpt
        for conv_idx in range(3, len(rounds), 2):
            re_rounds.append(conv.sep.join(rounds[conv_idx:conv_idx+2])) # user + gpt
            
        cur_len = 0
        target[:cur_len] = IGNORE_INDEX
        
        for conversation, target in zip(conversations, targets):
            total_len = int(target.ne(tokenizer.pad_token_id).sum())

            rounds = conversation.split('<|end|>')
            re_rounds = ['<|end|>'.join(rounds[:3])]  # system + user + gpt
            
            for conv_idx in range(3, len(rounds), 2):
                re_rounds.append(sep.join(rounds[conv_idx:conv_idx+2])) # user + gpt
                
            cur_len = 0
            target[:cur_len] = IGNORE_INDEX
            
            for i, rou in enumerate(re_rounds):
                if rou == "":
                    break

                parts = rou.split(sep)
                if len(parts) != 2:
                    break
                
                parts[0] += sep

                if has_image: 
                    round_len = len(tokenizer_image_token(rou, tokenizer))
                    instruction_len = len(tokenizer_image_token(parts[0], tokenizer)) - 1
                else:
                    round_len = len(tokenizer(rou).input_ids)
                    instruction_len = len(tokenizer(parts[0]).input_ids) - 1

                if i == 0:
                    round_len += 1
                    instruction_len += 1
                else:
                    round_len -= 2
                    instruction_len -= 2

                if i != 0 and getattr(tokenizer, 'legacy', False) and IS_TOKENIZER_GREATER_THAN_0_14:
                    round_len += 1
                    instruction_len += 1

                target[cur_len : cur_len + instruction_len] = IGNORE_INDEX

                cur_len += round_len

            target[cur_len:] = IGNORE_INDEX

            if cur_len < tokenizer.model_max_length:
                if cur_len != total_len:
                    target[:] = IGNORE_INDEX
                    print(
                        f"WARNING: tokenization mismatch: {cur_len} vs. {total_len}."
                        f" (ignored)"
                    )

    return dict(
        input_ids=input_ids,
        labels=targets,
    )

and this is the conv_template for phi-3 that I'm using:

conv_phi3_instruct = Conversation(
    system="""<|system|>\nA chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.""",
    roles=("\n<|user|>\n", "\n<|assistant|>\n"),
    version="phi3",
    messages=(),
    offset=0,
    sep_style=SeparatorStyle.MPT,
    sep="<|end|>",
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants