-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Phi 3 Mini 128K leads to Tokenization Mismatch #34
Comments
Same problem here. |
I encountered the same type of error when training with phi-3 mini-4k model. Later I changed the following lines in # def preprocess_phi3(
- else:
- round_len -= 2
- instruction_len -= 2
+ else:
+ round_len += 1
+ instruction_len += +1
# conv_phi3_instruct = Conversation(
- roles=("\n<|user|>\n", "\n<|assistant|>\n"),
+ roles=("<|user|>", "<|assistant|>"), |
Same problem here. I tried the method of @arvillion but it did not work. So I tried using the old Phi-3 model (before July update) and it worked well. |
Faced the same problem! I've updated the phi-3 preprocessor as follows: def preprocess_phi3(
sources,
tokenizer: transformers.PreTrainedTokenizer,
has_image: bool = False
) -> Dict:
conv = conversation_lib.conv_templates["phi3-instruct"].copy()
roles = {"human": conv.roles[0], "gpt": conv.roles[1]}
# Apply prompt templates
conversations = []
for i, source in enumerate(sources):
if roles[source[0]["from"]] != conv.roles[0]:
# Skip the first one if it is not from human
source = source[1:]
conv.messages = []
for j, sentence in enumerate(source):
role = roles[sentence["from"]]
assert role == conv.roles[j % 2], f"{i}"
conv.append_message(role, sentence["value"])
conversations.append(conv.get_prompt())
# Tokenize conversations
if has_image:
input_ids = torch.stack([tokenizer_image_token(prompt, tokenizer, return_tensors='pt') for prompt in conversations], dim=0)
else:
input_ids = tokenizer(
conversations,
return_tensors="pt",
padding="longest",
max_length=tokenizer.model_max_length,
truncation=True,
).input_ids
targets = input_ids.clone()
assert conv.sep_style == conversation_lib.SeparatorStyle.MPT
# Mask targets - mask system and user prompts with ignore index such that we train the model to optimize for generating the assistnant responses
# input ids consist of all the tokens in the conversation, including the system prompt and user prompt
# only the targets are masked, the input ids are not modified
sep = '<|end|>' + conv.roles[1]
for conversation, target in zip(conversations, targets):
total_len = int(target.ne(tokenizer.pad_token_id).sum())
rounds = conversation.split('<|end|>')
re_rounds = ['<|end|>'.join(rounds[:3])] # system + user + gpt
for conv_idx in range(3, len(rounds), 2):
re_rounds.append(conv.sep.join(rounds[conv_idx:conv_idx+2])) # user + gpt
cur_len = 0
target[:cur_len] = IGNORE_INDEX
for conversation, target in zip(conversations, targets):
total_len = int(target.ne(tokenizer.pad_token_id).sum())
rounds = conversation.split('<|end|>')
re_rounds = ['<|end|>'.join(rounds[:3])] # system + user + gpt
for conv_idx in range(3, len(rounds), 2):
re_rounds.append(sep.join(rounds[conv_idx:conv_idx+2])) # user + gpt
cur_len = 0
target[:cur_len] = IGNORE_INDEX
for i, rou in enumerate(re_rounds):
if rou == "":
break
parts = rou.split(sep)
if len(parts) != 2:
break
parts[0] += sep
if has_image:
round_len = len(tokenizer_image_token(rou, tokenizer))
instruction_len = len(tokenizer_image_token(parts[0], tokenizer)) - 1
else:
round_len = len(tokenizer(rou).input_ids)
instruction_len = len(tokenizer(parts[0]).input_ids) - 1
if i == 0:
round_len += 1
instruction_len += 1
else:
round_len -= 2
instruction_len -= 2
if i != 0 and getattr(tokenizer, 'legacy', False) and IS_TOKENIZER_GREATER_THAN_0_14:
round_len += 1
instruction_len += 1
target[cur_len : cur_len + instruction_len] = IGNORE_INDEX
cur_len += round_len
target[cur_len:] = IGNORE_INDEX
if cur_len < tokenizer.model_max_length:
if cur_len != total_len:
target[:] = IGNORE_INDEX
print(
f"WARNING: tokenization mismatch: {cur_len} vs. {total_len}."
f" (ignored)"
)
return dict(
input_ids=input_ids,
labels=targets,
) and this is the conv_template for phi-3 that I'm using: conv_phi3_instruct = Conversation(
system="""<|system|>\nA chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.""",
roles=("\n<|user|>\n", "\n<|assistant|>\n"),
version="phi3",
messages=(),
offset=0,
sep_style=SeparatorStyle.MPT,
sep="<|end|>",
) |
Hi!
Thanks for the amazing work. I am trying to use the Phi 3 Mini 128K model. Unfortunately, I get a tokenization mismatch error (relevant code). However, it gives an error even with the 4K model. Can you please guide on why the issue exists and/or the changes in preprocessing code that I need to do to support this? I think it's mainly got to do with the change in Phi 3 models made in July
The text was updated successfully, but these errors were encountered: