Improve tokenizer decode #403

vturrisi · 2023-02-02T10:53:46Z

Right now the tokenizer decode method supports only a single instance at a time. I think it would be good to have batch_decode function and also support skip_special_tokens and clean_up_tokenization_spaces as in huggingface.

The text was updated successfully, but these errors were encountered:

gpucce · 2023-02-07T16:51:18Z

@vturrisi I'll get to this as soon as I manage, what is the skip_special_tokens arg meant to do?

vturrisi · 2023-02-07T16:52:52Z

No worries @gpucce. It basically removes the sos and eos tokens and padding from the decoded string. https://huggingface.co/docs/transformers/main_classes/tokenizer

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve tokenizer decode #403

Improve tokenizer decode #403

vturrisi commented Feb 2, 2023

gpucce commented Feb 7, 2023

vturrisi commented Feb 7, 2023

Improve tokenizer decode #403

Improve tokenizer decode #403

Comments

vturrisi commented Feb 2, 2023

gpucce commented Feb 7, 2023

vturrisi commented Feb 7, 2023