Extract pre-tokenization out of tokenization models #441

robertknight · 2024-12-04T09:08:04Z

As part of #427, extract the input splitting logic out of the WordPiece and Bpe tokenization models and make it a separate pre-tokenization step in the pipeline executed by Tokenizer.

TODO:

Decide on pre_tokenizer vs pretokenizer in naming
Parse pretokenizer field from tokenizer.json files, for already-supported pre-tokenizers
Refactor away duplication in tokenization pipeline for the first and second sequences in a pair

This will be useful for getting the offsets of subslices yielded by pre-tokenization of the normalized input.

Move the regex splitting pre-tokenization out of the BPE model and into a separate pipeline step called by `Tokenizer` after normalization. This makes this pre-tokenization method usable with other tokenization models and is part of aligning the tokenization pipeline in rten-text with Hugging Face Tokenizers.

Align with the naming convention in the tokenizers crate.

This is now subsumed by `TokenizerError::PreTokenizeFailed`.

robertknight added 6 commits December 4, 2024 08:14

Add backport of subslice_range to SliceExt trait

0a4fdc9

This will be useful for getting the offsets of subslices yielded by pre-tokenization of the normalized input.

Extract BERT pre-tokenization out of WordPiece model

2aff905

Refactor duplicated code for tokenizing sequences in a pair

c95936d

Add initial support for parsing pretokenizer field in tokenizer.json

61835a2

Rename pretokenizers -> pre_tokenizers

d49d6b2

Align with the naming convention in the tokenizers crate.

robertknight force-pushed the byte-level-pre-tokenizer branch from a69e7dc to d49d6b2 Compare December 4, 2024 20:04

Remove unused TokenizerError::RegexSplitFailed variant

154df81

This is now subsumed by `TokenizerError::PreTokenizeFailed`.

robertknight marked this pull request as ready for review December 4, 2024 20:57

robertknight merged commit d51994e into main Dec 4, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract pre-tokenization out of tokenization models #441

Extract pre-tokenization out of tokenization models #441

robertknight commented Dec 4, 2024 •

edited

Loading

Extract pre-tokenization out of tokenization models #441

Extract pre-tokenization out of tokenization models #441

Conversation

robertknight commented Dec 4, 2024 • edited Loading

robertknight commented Dec 4, 2024 •

edited

Loading