Skip to content

Commit adc8f2b

Browse files
committed
Merge branch 'main' of https://github.com/may-/joeynmt into main
2 parents 4dcddd5 + 4a13290 commit adc8f2b

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

46 files changed

+27183
-6725
lines changed

.github/workflows/main.yml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -35,8 +35,8 @@ jobs:
3535
- name: Install dependencies
3636
run: |
3737
python -m pip install --upgrade pip
38-
pip install --upgrade torch==1.11.0+cu115 torchaudio==0.11.0 --extra-index-url https://download.pytorch.org/whl/cu115
39-
pip install -e .
38+
python -m pip install --upgrade torch torchaudio --extra-index-url https://download.pytorch.org/whl/cpu
39+
python -m pip install -e .
4040
4141
# Check code format
4242
- name: Lint
@@ -48,4 +48,4 @@ jobs:
4848
# Run unittest
4949
- name: Test
5050
run: |
51-
python -m pytest
51+
python -m unittest

.pylintrc

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ unsafe-load-any-extension=no
3434
# A comma-separated list of package or module names from where C extensions may
3535
# be loaded. Extensions are loading into the active Python interpreter and may
3636
# run arbitrary code
37-
extension-pkg-whitelist=
37+
extension-pkg-whitelist=fastBPE
3838

3939
[MESSAGES CONTROL]
4040

README.md

Lines changed: 8 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ Joey S2T implements the following features:
1414
- CMVN, SpecAugment
1515
- WER evaluation
1616

17-
Furthermore, all the functionalities in JoeyNMT v2.0 are also available from JoeyS2T:
17+
Furthermore, all the functionalities in JoeyNMT v2 are also available from JoeyS2T:
1818
- BLEU and ChrF evaluation
1919
- BPE tokenization (with BPE dropout option)
2020
- Beam search and greedy decoding (with repetition penalty, ngram blocker)
@@ -26,31 +26,30 @@ Furthermore, all the functionalities in JoeyNMT v2.0 are also available from Joe
2626

2727

2828
## Installation
29+
2930
JoeyS2T is built on [PyTorch](https://pytorch.org/). Please make sure you have a compatible environment.
3031
We tested JoeyS2T with
3132
- python 3.10
32-
- torch 1.11.0
33-
- cuda 11.5
33+
- torch 1.12.1
34+
- cuda 11.6
3435

3536
Clone this repository and install via pip:
3637
```bash
3738
$ git clone https://github.com/may-/joeys2t.git
3839
$ cd joeynmt
39-
$ pip install . -e
40-
```
41-
Run the unit tests:
42-
```bash
43-
$ python -m unittest
40+
$ pip install -e .
4441
```
4542

4643

44+
4745
## Documentation & Tutorials
4846

4947
Please check the JoeyNMT's [documentation](https://joeynmt.readthedocs.io) first, if you are not familiar with JoeyNMT yet.
5048

5149
For details, follow the tutorials in [notebooks](notebooks) dir.
50+
5251
- [quick-start-with-joeynmt2](notebooks/quick-start-with-joeynmt2.ipynb)
53-
- [speech-to-text-with-joeynmt2](notebooks/joeyS2T_ASR_tutorial.ipynb)
52+
- [speech-to-text-with-joeynmt2](notebooks/joeyS2T_ASR_tutorial.ipynb)
5453

5554

5655

@@ -67,5 +66,3 @@ Please leave an issue if you have found a bug in the code.
6766

6867
For general questions, email me at `ohta <at> cl.uni-heidelberg.de`.
6968

70-
71-

configs/iwslt14_deen_sp.yaml

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -17,10 +17,10 @@ data:
1717
level: "bpe"
1818
voc_limit: 32000
1919
voc_min_freq: 1
20-
voc_file: "test/data/iwslt14_sp.vocab"
20+
voc_file: "test/data/iwslt14/sp.vocab"
2121
tokenizer_type: "sentencepiece"
2222
tokenizer_cfg:
23-
model_file: "test/data/iwslt14_sp.model"
23+
model_file: "test/data/iwslt14/sp.model"
2424
model_type: "unigram"
2525
character_coverage: 1.0
2626
alpha: 0.1
@@ -33,10 +33,10 @@ data:
3333
level: "bpe"
3434
voc_limit: 32000
3535
voc_min_freq: 1
36-
voc_file: "test/data/iwslt14_sp.vocab"
36+
voc_file: "test/data/iwslt14/sp.vocab"
3737
tokenizer_type: "sentencepiece"
3838
tokenizer_cfg:
39-
model_file: "test/data/iwslt14_sp.model"
39+
model_file: "test/data/iwslt14/sp.model"
4040
model_type: "unigram"
4141
character_coverage: 1.0
4242
alpha: 0.1
@@ -82,6 +82,7 @@ training:
8282
overwrite: False
8383
shuffle: True
8484
use_cuda: True
85+
fp16: True
8586
print_valid_sents: [0, 1, 2, 3]
8687
keep_best_ckpts: 5
8788

configs/jparacrawl_enja_sp.yaml

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ data:
88
test: "../datasets/datasets/kftt"
99
dataset_type: "huggingface"
1010
dataset_cfg:
11-
name: "en-ja"
11+
name: "ja-en"
1212
sample_train_subset: -1
1313
sample_dev_subset: 200
1414
src:
@@ -19,10 +19,10 @@ data:
1919
level: "bpe"
2020
voc_limit: 32000
2121
voc_min_freq: 1
22-
voc_file: "/scratch5t/ohta/jparacrawl_v3/spm_en.vocab"
22+
voc_file: "subwords/jparacrawl_en.vocab"
2323
tokenizer_type: "sentencepiece"
2424
tokenizer_cfg:
25-
model_file: "/scratch5t/ohta/jparacrawl_v3/spm_en.model"
25+
model_file: "subwords/jparacrawl_en.model"
2626
model_type: "unigram"
2727
character_coverage: 1.0
2828
nbest_size: 10
@@ -35,10 +35,10 @@ data:
3535
level: "bpe"
3636
voc_limit: 32000
3737
voc_min_freq: 1
38-
voc_file: "/scratch5t/ohta/jparacrawl_v3/spm_ja.vocab"
38+
voc_file: "subwords/jparacrawl_ja.vocab"
3939
tokenizer_type: "sentencepiece"
4040
tokenizer_cfg:
41-
model_file: "/scratch5t/ohta/jparacrawl_v3/spm_ja.model"
41+
model_file: "subwords/jparacrawl_ja.model"
4242
model_type: "unigram"
4343
character_coverage: 0.995
4444
nbest_size: 10
@@ -61,8 +61,8 @@ testing:
6161
tokenize: "ja-mecab"
6262

6363
training:
64-
#load_model: "/workspace/mitarb/ohta/models/jparacrawl_enja_seed456/best.ckpt"
65-
random_seed: 456
64+
#load_model: "models/jparacrawl_enja/best.ckpt"
65+
random_seed: 42
6666
optimizer: "adam"
6767
normalization: "tokens"
6868
adam_betas: [0.9, 0.98]

configs/jparacrawl_jaen_sp.yaml

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,9 @@ name: "jparacrawl_jaen_sp"
22
joeynmt_version: "2.0.0"
33

44
data:
5-
train: "../datasets/datasets/jparacrawl"
6-
dev: "../datasets/datasets/wmt21"
7-
test: "../datasets/datasets/kftt"
5+
train: "jparacrawl"
6+
dev: "wmt21"
7+
test: "kftt"
88
dataset_type: "huggingface"
99
dataset_cfg:
1010
name: "ja-en"
@@ -18,10 +18,10 @@ data:
1818
level: "bpe"
1919
voc_limit: 32000
2020
voc_min_freq: 1
21-
voc_file: "data/jparacrawl_v3/spm_ja.vocab"
21+
voc_file: "subwords/jparacrawl_ja.vocab"
2222
tokenizer_type: "sentencepiece"
2323
tokenizer_cfg:
24-
model_file: "data/jparacrawl_v3/spm_ja.model"
24+
model_file: "subwords/jparacrawl_ja.model"
2525
model_type: "unigram"
2626
character_coverage: 1.0
2727
nbest_size: 10
@@ -34,10 +34,10 @@ data:
3434
level: "bpe"
3535
voc_limit: 32000
3636
voc_min_freq: 1
37-
voc_file: "data/jparacrawl_v3/spm_en.vocab"
37+
voc_file: "subwords/jparacrawl_en.vocab"
3838
tokenizer_type: "sentencepiece"
3939
tokenizer_cfg:
40-
model_file: "data/jparacrawl_v3/spm_en.model"
40+
model_file: "subwords/jparacrawl_en.model"
4141
model_type: "unigram"
4242
character_coverage: 1.0
4343
nbest_size: 10
@@ -60,7 +60,7 @@ testing:
6060
tokenize: "intl"
6161

6262
training:
63-
#load_model: "models/jparacrawl_enja/best.ckpt"
63+
#load_model: "models/jparacrawl_jaen/best.ckpt"
6464
random_seed: 42
6565
optimizer: "adam"
6666
normalization: "tokens"

configs/rnn_small.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -113,8 +113,8 @@ model: # specify your model architecture here
113113
initializer: "xavier_uniform" # initializer for all trainable weights (xavier_uniform, xavier_normal, zeros, normal, uniform)
114114
init_weight: 0.01 # weight to initialize; for uniform, will use [-weight, weight]
115115
init_gain: 1.0 # gain for Xavier initializer (default: 1.0)
116-
bias_initializer: "zeros" # initializer for bias terms (xavier, zeros, normal, uniform)
117-
embed_initializer: "normal" # initializer for embeddings (xavier, zeros, normal, uniform)
116+
bias_initializer: "zeros" # initializer for bias terms (xavier_uniform, xavier_normal, zeros, normal, uniform)
117+
embed_initializer: "normal" # initializer for embeddings (xavier_uniform, xavier_normal, zeros, normal, uniform)
118118
embed_init_weight: 0.1 # weight to initialize; for uniform, will use [-weight, weight]
119119
embed_init_gain: 1.0 # gain for Xavier initializer for embeddings (default: 1.0)
120120
init_rnn_orthogonal: False # use orthogonal initialization for recurrent weights (default: False)

configs/transformer_reverse.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -59,9 +59,9 @@ model:
5959
initializer: "xavier_uniform" # initializer for all trainable weights (xavier_uniform, xavier_normal, zeros, normal, uniform)
6060
init_gain: 1.0 # gain for Xavier initializer (default: 1.0)
6161
bias_initializer: "zeros" # initializer for bias terms (xavier_uniform, xavier_normal, zeros, normal, uniform)
62-
embed_initializer: "xavier_uniform" # initializer for embeddings (xavier_uniform, xavier_normal, zeros, normal, uniform)
62+
embed_initializer: "xavier_uniform" # initializer for embeddings (xavier_uniform, xavier_normal, zeros, normal, uniform)
6363
embed_init_gain: 1.0 # gain for Xavier initializer for embeddings (default: 1.0)
64-
tied_embeddings: True # tie src and trg embeddings, only applicable if vocabularies are the same, default: False
64+
tied_embeddings: True # tie src and trg embeddings, only applicable if vocabularies are the same, default: False
6565
tied_softmax: True
6666
encoder:
6767
type: "transformer"

configs/transformer_small.yaml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -114,7 +114,11 @@ model: # specify your model architecture here
114114
initializer: "xavier_uniform" # initializer for all trainable weights (xavier_uniform, xavier_normal, zeros, normal, uniform)
115115
init_gain: 1.0 # gain for Xavier initializer (default: 1.0)
116116
bias_initializer: "zeros" # initializer for bias terms (xavier_uniform, xavier_normal, zeros, normal, uniform)
117+
<<<<<<< HEAD
117118
embed_initializer: "xavier_uniform" # initializer for embeddings (xavier_uniform, xavier_normal, zeros, normal, uniform)
119+
=======
120+
embed_initializer: "xavier_uniform" # initializer for embeddings (xavier_uniform, xavier_normal, zeros, normal, uniform)
121+
>>>>>>> 4a132900d3ae55d5df9bae11196ae32a5014efd1
118122
embed_init_gain: 1.0 # gain for Xavier initializer for embeddings (default: 1.0)
119123
tied_embeddings: False # tie src and trg embeddings, only applicable if vocabularies are the same, default: False
120124
tied_softmax: True

configs/wmt17_ende_bpe.yaml

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -17,11 +17,11 @@ data:
1717
normalize: True
1818
level: "bpe"
1919
voc_min_freq: 1
20-
voc_file: "data/subwords/wmt17_bpe.vocab"
20+
voc_file: "subwords/wmt17_bpe.vocab"
2121
tokenizer_type: "subword-nmt"
2222
tokenizer_cfg:
2323
num_merges: 32000
24-
codes: "data/subwords/wmt17_bpe.codes"
24+
codes: "subwords/wmt17_bpe.codes"
2525
dropout: 0.1
2626
pretokenizer: "moses"
2727
trg:
@@ -32,11 +32,11 @@ data:
3232
level: "bpe"
3333
voc_limit: 32000
3434
voc_min_freq: 1
35-
voc_file: "data/subwords/wmt17_bpe.vocab"
35+
voc_file: "subwords/wmt17_bpe.vocab"
3636
tokenizer_type: "subword-nmt"
3737
tokenizer_cfg:
3838
num_merges: 32000
39-
codes: "data/subwords/wmt17_bpe.codes"
39+
codes: "subwords/wmt17_bpe.codes"
4040
dropout: 0.1
4141
pretokenizer: "moses"
4242

configs/wmt17_ende_sp.yaml

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -18,10 +18,10 @@ data:
1818
level: "bpe"
1919
voc_limit: 32000
2020
voc_min_freq: 1
21-
voc_file: "data/spm/wmt17_sp.vocab"
21+
voc_file: "subwords/wmt17_sp.vocab"
2222
tokenizer_type: "sentencepiece"
2323
tokenizer_cfg:
24-
model_file: "data/spm/wmt17_sp.model"
24+
model_file: "subwords/wmt17_sp.model"
2525
model_type: "unigram"
2626
character_coverage: 1.0
2727
nbest_size: 10
@@ -35,10 +35,10 @@ data:
3535
level: "bpe"
3636
voc_limit: 32000
3737
voc_min_freq: 1
38-
voc_file: "data/spm/wmt17_sp.vocab"
38+
voc_file: "subwords/wmt17_sp.vocab"
3939
tokenizer_type: "sentencepiece"
4040
tokenizer_cfg:
41-
model_file: "data/spm/wmt17_sp.model"
41+
model_file: "subwords/wmt17_sp.model"
4242
model_type: "unigram"
4343
character_coverage: 1.0
4444
nbest_size: 10

0 commit comments

Comments
 (0)