De/tokenization on CUDA

Could at least de-tokenization be done directly on CUDA? Like in my hack `bpedecode_vec` in https://github.com/pytorch/pytorch/issues/135704#issue-2520180382 which indexes into a detokenization vocab byte table via `repeat_interleave`

Also, maybe for better CUDAGraph-ability / no CPU syncs, there should be some static-sized pre-allocated `out=` version, like `torch.nonzero_static`?

---
Offtopic: it's also a bit inconsistent naming to have `batch_decode` and `batch_encode_plus`... What is the motivation for the `_plus` suffix?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

De/tokenization on CUDA #1919

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

De/tokenization on CUDA #1919

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions