Align tokenizer pipeline and terminology with Hugging Face tokenizers #427

robertknight · 2024-12-02T08:06:17Z

The main function of the rten-text crate is to provide a relatively lightweight Rust-only implementation of popular tokenizers (WordPiece, BPE etc.).

The separation of concerns is currently a little muddled. Since most projects and examples instantiate tokenizers from Hugging Face tokenizer.json files, it would make sense to align the crate with the tokenization pipeline that these files capture.

The sub-tasks are not well-defined yet, but the outcome should be that rten-text's tokenizers are a fairly straightforward implementation of this pipeline.

The text was updated successfully, but these errors were encountered:

Add a `Normalizer` trait which provides the common interface for normalizers and rename the previous `Normalizer` struct to `BertNormalizer`, to align with the name in tokenizer.json files. Change other parts of the tokenization process to use the normalizer via a trait object. Part of #427

Rename `Encoder` to `Model` to align with the terminology used in Hugging Face tokenizer.json files. Part of #427.

Part of #427.

robertknight added the tokenizers label Dec 2, 2024

robertknight mentioned this issue Dec 2, 2024

Convert Normalizer into a trait #428

Merged

robertknight added a commit that referenced this issue Dec 3, 2024

Rename Encoder trait to Model in rten-text

6bcdedb

Rename `Encoder` to `Model` to align with the terminology used in Hugging Face tokenizer.json files. Part of #427.

robertknight added a commit that referenced this issue Dec 3, 2024

Extract tokenization models into models module

d78da72

Part of #427.

robertknight added a commit that referenced this issue Dec 3, 2024

Extract tokenization models into models module

8248b0e

Part of #427.

This was referenced Dec 3, 2024

Rename Encoder trait in rten-text to Model #430

Merged

Move normalization from model into Tokenizer #440

Merged

Extract pre-tokenization out of tokenization models #441

Merged

Update the front page documentation for rten-text #452

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Align tokenizer pipeline and terminology with Hugging Face tokenizers #427

Align tokenizer pipeline and terminology with Hugging Face tokenizers #427

robertknight commented Dec 2, 2024

Align tokenizer pipeline and terminology with Hugging Face tokenizers #427

Align tokenizer pipeline and terminology with Hugging Face tokenizers #427

Comments

robertknight commented Dec 2, 2024