Skip to content

T5 Tokenizer prepending extra space on decode #329

@msolo

Description

@msolo

When decoding a response from a fine-tuned T5 tokenizer, a leading space is prepended.

tokenIds: [571, 33, 25, 58]
tokenStrings ["▁How", "▁are", "▁you", "?"]
decoded [" How", " are", " you", "?"]

The issue is probably how the MetaspaceDecoder is initialized:

addPrefixSpace = config.addPrefixSpace.boolean(or: false)

Or possibly how the MetaspaceTokenizer is initialized and handles legacy keys:

// prepend_scheme supersedes add_prefix_space per tokenizers PR #1357.

Newer versions of python transformers are dropping/rewriting keys in the tokenizer.json. The decoder looks only for addPrefixSpace which is never derived from prepend_scheme when a new tokenizer config is specified.

I don't know the correct derivation, or the general plans for tokenizer migration, so hopefully this is enough information to make the correct fix. The python version of T5Tokenizer has some fallback code on initialization which obscures the data model change, but it wasn't clear if that appropriate here.

This will generate a tokenizer that shows the problem - this is transformers-5.3.

mpath = "google-t5/t5-small"
tokenizer = T5Tokenizer.from_pretrained(mpath)
tokenizer.save_pretrained("./t5-fine-tune")

Here's the key changes:

> diff -u <(jq -S . <./google-t5/t5-small/tokenizer.json) <(jq -S . < ./t5-fine-tune/tokenizer.json) 
--- /dev/fd/63	2026-03-07 06:55:12
+++ /dev/fd/62	2026-03-07 06:55:12
@@ -929,12 +929,14 @@
     }
   ],
   "decoder": {
-    "add_prefix_space": true,
+    "prepend_scheme": "always",
     "replacement": "▁",
-    "str_rep": "▁",
+    "split": true,
     "type": "Metaspace"
   },
   "model": {
+    "byte_fallback": false,
+    "type": "Unigram",
     "unk_id": 2,
     "vocab": [
       [
@@ -129404,9 +129406,9 @@
         "type": "WhitespaceSplit"
       },
       {
-        "add_prefix_space": true,
+        "prepend_scheme": "always",
         "replacement": "▁",
-        "str_rep": "▁",
+        "split": true,
         "type": "Metaspace"
       }
     ],

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions