Skip to content

Commit 4a13290

Browse files
committed
add tokenizer tutorial for FastBPE
1 parent edb3531 commit 4a13290

File tree

7 files changed

+21020
-10
lines changed

7 files changed

+21020
-10
lines changed

.github/workflows/main.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -48,4 +48,4 @@ jobs:
4848
# Run unittest
4949
- name: Test
5050
run: |
51-
python -m pytest
51+
python -m unittest

.pylintrc

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ unsafe-load-any-extension=no
3434
# A comma-separated list of package or module names from where C extensions may
3535
# be loaded. Extensions are loading into the active Python interpreter and may
3636
# run arbitrary code
37-
extension-pkg-whitelist=
37+
extension-pkg-whitelist=fastBPE
3838

3939
[MESSAGES CONTROL]
4040

README.md

Lines changed: 22 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,19 @@
11
#   ![Joey-NMT](joey2-small.png) Joey NMT
2-
[![build](https://github.com/joeynmt/joeynmt/actions/workflows/main.yml/badge.svg)](https://github.com/joeynmt/joeynmt/actions/workflows/main.yml)
2+
[![build](https://github.com/may-/joeynmt/actions/workflows/main.yml/badge.svg)](https://github.com/may-/joeynmt/actions/workflows/main.yml)
33
[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
44

5+
## What's new
6+
- 4th September 2022: upgraded to JoeyNMT v2.1.0
7+
8+
- 17th Augst 2022: [Joey S2T Hands-on Tutorial](https://github.com/may-/joeys2t/blob/main/notebooks/joeyS2T_ASR_tutorial.ipynb)
9+
[@IT](https://atmarkit.itmedia.co.jp) にて、 [「Python+Pytorch」と「JoeyNMT」で学ぶニューラル機械翻訳 第三回](https://atmarkit.itmedia.co.jp/ait/articles/2208/17/news002.html) の記事が公開されました。
10+
11+
- 21st July 2022: [Joey NMT v2.0 Hands-on Tutorial](notebooks/tokenizer_tutorial_ja.ipynb) (in Japanese)
12+
[@IT](https://atmarkit.itmedia.co.jp) にて、 [「Python+Pytorch」と「JoeyNMT」で学ぶニューラル機械翻訳 第二回](https://atmarkit.itmedia.co.jp/ait/articles/2207/21/news006.html) の記事が公開されました。
13+
14+
- 29th June 2022: [Joey NMT v2.0 Hands-on Tutorial](notebooks/fine_tuning_tutorial_enja.ipynb) (in Japanese)
15+
[@IT](https://atmarkit.itmedia.co.jp) にて、 [「Python+Pytorch」と「JoeyNMT」で学ぶニューラル機械翻訳 第一回](https://atmarkit.itmedia.co.jp/ait/articles/2206/29/news008.html) の記事が公開されました。
16+
517

618
## Goal and Purpose
719
:koala: Joey NMT framework is developed for educational purposes.
@@ -44,7 +56,7 @@ Joey NMT implements the following features (aka the minimalist toolkit of NMT :w
4456

4557
## Installation
4658
Joey NMT is built on [PyTorch](https://pytorch.org/). Please make sure you have a compatible environment.
47-
We tested Joey NMT 2.0 with
59+
We tested Joey NMT v2.1 with
4860
- python 3.10
4961
- torch 1.12.1
5062
- cuda 11.6
@@ -70,7 +82,7 @@ $ pip install joeynmt
7082
### B. From source (for local development)
7183
1. Clone this repository:
7284
```bash
73-
$ git clone https://github.com/joeynmt/joeynmt.git
85+
$ git clone https://github.com/may-/joeynmt.git
7486
$ cd joeynmt
7587
```
7688
2. Install Joey NMT and it's requirements:
@@ -88,9 +100,10 @@ $ pip install joeynmt
88100
- upgrade to python 3.10, torch 1.12
89101
- replace Automated Mixed Precision from NVIDA's amp to Pytorch's amp package
90102
- replace [discord.py](https://github.com/Rapptz/discord.py) with [pycord](https://github.com/Pycord-Development/pycord) in the Discord Bot demo
91-
- Data Iterator refactoring
103+
- data iterator refactoring
92104
- add wmt14 ende / deen benchmark trained on v2 from scratch
93-
- bugfixes
105+
- add tokenizer tutorial
106+
- minor bugfixes
94107

95108
<details><summary>previous releases</summary>
96109

@@ -133,8 +146,9 @@ We also updated the [documentation](https://joeynmt.readthedocs.io) thoroughly f
133146
For details, follow the tutorials in [notebooks](notebooks) dir.
134147
#### v2.x
135148
- [quick start with joeynmt2](notebooks/joey_v2_demo.ipynb)
136-
- [tokenizer tutorial](https://github.com/may-/joeynmt/blob/main/notebooks/tokenizer_tutorial_en.ipynb)
137-
- [joeyS2T ASR tutorial](https://github.com/may-/joeynmt/blob/joeyS2T/notebooks/joeyS2T_ASR_tutorial.ipynb)
149+
- [fine tuning tutorial](notebooks/fine_tuning_tutorial_enja.ipynb)
150+
- [tokenizer tutorial](notebooks/tokenizer_tutorial_en.ipynb)
151+
- [joeyS2T ASR tutorial](https://github.com/may-/joeys2t/blob/main/notebooks/joeyS2T_ASR_tutorial.ipynb)
138152

139153
#### v1.x
140154
- [demo notebook](notebooks/joey_v1_demo.ipynb)
@@ -348,7 +362,7 @@ Here we'll collect projects and repositories that are based on Joey NMT, so you
348362
inspiration and examples on how to modify and extend the code.
349363
350364
### Joey NMT v2.x
351-
- :ear: **JoeyS2T**. Joey NMT is extended for Speech-to-Text tasks! [Code](https://github.com/may-/joeynmt/tree/joeyS2T)
365+
- :ear: **JoeyS2T**. Joey NMT is extended for Speech-to-Text tasks! [Code](https://github.com/may-/joeys2t)
352366
- :right_anger_bubble: **Discord Joey**. This script demonstrates how to deploy Joey NMT models as a Chatbot on Discord. [Code](scripts/discord_joey.py)
353367
354368
### Joey NMT v1.x

joeynmt/tokenizers.py

Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -312,6 +312,49 @@ def __repr__(self):
312312
f"separator={self.separator}, dropout={self.dropout})")
313313

314314

315+
class FastBPETokenizer(SubwordNMTTokenizer):
316+
317+
def __init__(
318+
self,
319+
level: str = "bpe",
320+
lowercase: bool = False,
321+
normalize: bool = False,
322+
max_length: int = -1,
323+
min_length: int = -1,
324+
**kwargs,
325+
):
326+
try:
327+
import fastBPE # pylint: disable=import-outside-toplevel
328+
except ImportError as e:
329+
logger.error(e)
330+
raise ImportError from e
331+
super(SubwordNMTTokenizer, self).__init__(level, lowercase, normalize,
332+
max_length, min_length, **kwargs)
333+
assert self.level == "bpe"
334+
335+
# set codes file path
336+
self.codes: Path = Path(kwargs["codes"])
337+
assert self.codes.is_file(), f"codes file {self.codes} not found."
338+
339+
# instantiate fastBPE object
340+
self.bpe = fastBPE.fastBPE(self.codes.as_posix())
341+
self.separator = "@@"
342+
self.dropout = 0.0
343+
344+
def __call__(self, raw_input: str, is_train: bool = False) -> List[str]:
345+
# fastBPE.apply()
346+
tokenized = self.bpe.apply([raw_input])
347+
tokenized = tokenized[0].strip().split()
348+
349+
# check if the input sequence length stays within the valid length range
350+
if is_train and self._filter_by_length(len(tokenized)):
351+
return None
352+
return tokenized
353+
354+
def set_vocab(self, itos: List[str]) -> None:
355+
pass
356+
357+
315358
def _build_tokenizer(cfg: Dict) -> BasicTokenizer:
316359
"""Builds tokenizer."""
317360
tokenizer = None
@@ -352,6 +395,16 @@ def _build_tokenizer(cfg: Dict) -> BasicTokenizer:
352395
min_length=cfg.get("min_length", -1),
353396
**tokenizer_cfg,
354397
)
398+
elif tokenizer_type == "fastbpe":
399+
assert "codes" in tokenizer_cfg
400+
tokenizer = FastBPETokenizer(
401+
level=cfg["level"],
402+
lowercase=cfg.get("lowercase", False),
403+
normalize=cfg.get("normalize", False),
404+
max_length=cfg.get("max_length", -1),
405+
min_length=cfg.get("min_length", -1),
406+
**tokenizer_cfg,
407+
)
355408
else:
356409
raise ConfigurationError(f"{tokenizer_type}: Unknown tokenizer type.")
357410
else:

0 commit comments

Comments
 (0)