-
Notifications
You must be signed in to change notification settings - Fork 173
T5 Tokenizer prepending extra space on decode #329
Description
When decoding a response from a fine-tuned T5 tokenizer, a leading space is prepended.
tokenIds: [571, 33, 25, 58]
tokenStrings ["▁How", "▁are", "▁you", "?"]
decoded [" How", " are", " you", "?"]
The issue is probably how the MetaspaceDecoder is initialized:
| addPrefixSpace = config.addPrefixSpace.boolean(or: false) |
Or possibly how the MetaspaceTokenizer is initialized and handles legacy keys:
| // prepend_scheme supersedes add_prefix_space per tokenizers PR #1357. |
Newer versions of python transformers are dropping/rewriting keys in the tokenizer.json. The decoder looks only for addPrefixSpace which is never derived from prepend_scheme when a new tokenizer config is specified.
I don't know the correct derivation, or the general plans for tokenizer migration, so hopefully this is enough information to make the correct fix. The python version of T5Tokenizer has some fallback code on initialization which obscures the data model change, but it wasn't clear if that appropriate here.
This will generate a tokenizer that shows the problem - this is transformers-5.3.
mpath = "google-t5/t5-small"
tokenizer = T5Tokenizer.from_pretrained(mpath)
tokenizer.save_pretrained("./t5-fine-tune")
Here's the key changes:
> diff -u <(jq -S . <./google-t5/t5-small/tokenizer.json) <(jq -S . < ./t5-fine-tune/tokenizer.json)
--- /dev/fd/63 2026-03-07 06:55:12
+++ /dev/fd/62 2026-03-07 06:55:12
@@ -929,12 +929,14 @@
}
],
"decoder": {
- "add_prefix_space": true,
+ "prepend_scheme": "always",
"replacement": "▁",
- "str_rep": "▁",
+ "split": true,
"type": "Metaspace"
},
"model": {
+ "byte_fallback": false,
+ "type": "Unigram",
"unk_id": 2,
"vocab": [
[
@@ -129404,9 +129406,9 @@
"type": "WhitespaceSplit"
},
{
- "add_prefix_space": true,
+ "prepend_scheme": "always",
"replacement": "▁",
- "str_rep": "▁",
+ "split": true,
"type": "Metaspace"
}
],