forked from munael/fairseq
-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Update simultaneous translation docs (facebookresearch#1767)
Summary: Pull Request resolved: fairinternal/fairseq-py#1767 Test Plan: Imported from GitHub, without a `Test Plan:` line. This pull request contains - An example of EN_JA simul t2t model - Reorganizing simul trans docs - Removal of out-of-date files Reviewed By: jmp84 Differential Revision: D27467907 Pulled By: xutaima fbshipit-source-id: 137165b007cf5301bdc51a0a277ba91cbf733092
- Loading branch information
1 parent
579a48f
commit 14807a3
Showing
20 changed files
with
454 additions
and
1,324 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,106 +1,5 @@ | ||
# Simultaneous Machine Translation | ||
|
||
This directory contains the code for the paper [Monotonic Multihead Attention](https://openreview.net/forum?id=Hyg96gBKPS) | ||
|
||
## Prepare Data | ||
|
||
[Please follow the instructions to download and preprocess the WMT'15 En-De dataset.](https://github.com/pytorch/fairseq/tree/simulastsharedtask/examples/translation#prepare-wmt14en2desh) | ||
|
||
## Training | ||
|
||
- MMA-IL | ||
|
||
```shell | ||
fairseq-train \ | ||
data-bin/wmt15_en_de_32k \ | ||
--simul-type infinite_lookback \ | ||
--user-dir $FAIRSEQ/example/simultaneous_translation \ | ||
--mass-preservation \ | ||
--criterion latency_augmented_label_smoothed_cross_entropy \ | ||
--latency-weight-avg 0.1 \ | ||
--max-update 50000 \ | ||
--arch transformer_monotonic_iwslt_de_en save_dir_key=lambda \ | ||
--optimizer adam --adam-betas '(0.9, 0.98)' \ | ||
--lr-scheduler 'inverse_sqrt' \ | ||
--warmup-init-lr 1e-7 --warmup-updates 4000 \ | ||
--lr 5e-4 --stop-min-lr 1e-9 --clip-norm 0.0 --weight-decay 0.0001\ | ||
--dropout 0.3 \ | ||
--label-smoothing 0.1\ | ||
--max-tokens 3584 | ||
``` | ||
|
||
- MMA-H | ||
|
||
```shell | ||
fairseq-train \ | ||
data-bin/wmt15_en_de_32k \ | ||
--simul-type hard_aligned \ | ||
--user-dir $FAIRSEQ/example/simultaneous_translation \ | ||
--mass-preservation \ | ||
--criterion latency_augmented_label_smoothed_cross_entropy \ | ||
--latency-weight-var 0.1 \ | ||
--max-update 50000 \ | ||
--arch transformer_monotonic_iwslt_de_en save_dir_key=lambda \ | ||
--optimizer adam --adam-betas '(0.9, 0.98)' \ | ||
--lr-scheduler 'inverse_sqrt' \ | ||
--warmup-init-lr 1e-7 --warmup-updates 4000 \ | ||
--lr 5e-4 --stop-min-lr 1e-9 --clip-norm 0.0 --weight-decay 0.0001\ | ||
--dropout 0.3 \ | ||
--label-smoothing 0.1\ | ||
--max-tokens 3584 | ||
``` | ||
|
||
- wait-k | ||
|
||
```shell | ||
fairseq-train \ | ||
data-bin/wmt15_en_de_32k \ | ||
--simul-type wait-k \ | ||
--waitk-lagging 3 \ | ||
--user-dir $FAIRSEQ/example/simultaneous_translation \ | ||
--mass-preservation \ | ||
--criterion latency_augmented_label_smoothed_cross_entropy \ | ||
--max-update 50000 \ | ||
--arch transformer_monotonic_iwslt_de_en save_dir_key=lambda \ | ||
--optimizer adam --adam-betas '(0.9, 0.98)' \ | ||
--lr-scheduler 'inverse_sqrt' \ | ||
--warmup-init-lr 1e-7 --warmup-updates 4000 \ | ||
--lr 5e-4 --stop-min-lr 1e-9 --clip-norm 0.0 --weight-decay 0.0001\ | ||
--dropout 0.3 \ | ||
--label-smoothing 0.1\ | ||
--max-tokens 3584 | ||
``` | ||
|
||
|
||
## Evaluation | ||
|
||
More details on evaluation can be found [here](https://github.com/pytorch/fairseq/blob/simulastsharedtask/examples/simultaneous_translation/docs/evaluation.md) | ||
|
||
### Start the server | ||
|
||
```shell | ||
python ./eval/server.py \ | ||
--src-file $SRC_FILE \ | ||
--ref-file $TGT_FILE | ||
``` | ||
|
||
### Run the client | ||
|
||
```shell | ||
python ./evaluate.py \ | ||
--data-bin data-bin/wmt15_en_de_32k \ | ||
--model-path ./checkpoints/checkpoint_best.pt | ||
--scores --output $RESULT_DIR | ||
``` | ||
|
||
### Run evaluation locally without server | ||
|
||
```shell | ||
python ./eval/evaluate.py | ||
--local \ | ||
--src-file $SRC_FILE \ | ||
--tgt-file $TGT_FILE \ | ||
--data-bin data-bin/wmt15_en_de_32k \ | ||
--model-path ./checkpoints/checkpoint_best.pt \ | ||
--scores --output $RESULT_DIR | ||
``` | ||
# Simultaneous Translation | ||
Examples of simultaneous translation in fairseq | ||
- [English-to-Japanese text-to-text wait-k model](docs/enja-waitk.md) | ||
- [English-to-Germen text-to-text monotonic multihead attention model](docs/ende-mma.md) | ||
- [English-to-Germen speech-to-text simultaneous translation model](../speech_to_text/docs/simulst_mustc_example.md) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,74 @@ | ||
# Simultaneous Machine Translation | ||
|
||
This directory contains the code for the paper [Monotonic Multihead Attention](https://openreview.net/forum?id=Hyg96gBKPS) | ||
|
||
## Prepare Data | ||
|
||
[Please follow the instructions to download and preprocess the WMT'15 En-De dataset.](https://github.com/pytorch/fairseq/tree/simulastsharedtask/examples/translation#prepare-wmt14en2desh) | ||
|
||
Another example of training an English to Japanese model can be found [here](docs/enja.md) | ||
|
||
## Training | ||
|
||
- MMA-IL | ||
|
||
```shell | ||
fairseq-train \ | ||
data-bin/wmt15_en_de_32k \ | ||
--simul-type infinite_lookback \ | ||
--user-dir $FAIRSEQ/example/simultaneous_translation \ | ||
--mass-preservation \ | ||
--criterion latency_augmented_label_smoothed_cross_entropy \ | ||
--latency-weight-avg 0.1 \ | ||
--max-update 50000 \ | ||
--arch transformer_monotonic_iwslt_de_en save_dir_key=lambda \ | ||
--optimizer adam --adam-betas '(0.9, 0.98)' \ | ||
--lr-scheduler 'inverse_sqrt' \ | ||
--warmup-init-lr 1e-7 --warmup-updates 4000 \ | ||
--lr 5e-4 --stop-min-lr 1e-9 --clip-norm 0.0 --weight-decay 0.0001\ | ||
--dropout 0.3 \ | ||
--label-smoothing 0.1\ | ||
--max-tokens 3584 | ||
``` | ||
|
||
- MMA-H | ||
|
||
```shell | ||
fairseq-train \ | ||
data-bin/wmt15_en_de_32k \ | ||
--simul-type hard_aligned \ | ||
--user-dir $FAIRSEQ/example/simultaneous_translation \ | ||
--mass-preservation \ | ||
--criterion latency_augmented_label_smoothed_cross_entropy \ | ||
--latency-weight-var 0.1 \ | ||
--max-update 50000 \ | ||
--arch transformer_monotonic_iwslt_de_en save_dir_key=lambda \ | ||
--optimizer adam --adam-betas '(0.9, 0.98)' \ | ||
--lr-scheduler 'inverse_sqrt' \ | ||
--warmup-init-lr 1e-7 --warmup-updates 4000 \ | ||
--lr 5e-4 --stop-min-lr 1e-9 --clip-norm 0.0 --weight-decay 0.0001\ | ||
--dropout 0.3 \ | ||
--label-smoothing 0.1\ | ||
--max-tokens 3584 | ||
``` | ||
|
||
- wait-k | ||
|
||
```shell | ||
fairseq-train \ | ||
data-bin/wmt15_en_de_32k \ | ||
--simul-type wait-k \ | ||
--waitk-lagging 3 \ | ||
--user-dir $FAIRSEQ/example/simultaneous_translation \ | ||
--mass-preservation \ | ||
--criterion latency_augmented_label_smoothed_cross_entropy \ | ||
--max-update 50000 \ | ||
--arch transformer_monotonic_iwslt_de_en save_dir_key=lambda \ | ||
--optimizer adam --adam-betas '(0.9, 0.98)' \ | ||
--lr-scheduler 'inverse_sqrt' \ | ||
--warmup-init-lr 1e-7 --warmup-updates 4000 \ | ||
--lr 5e-4 --stop-min-lr 1e-9 --clip-norm 0.0 --weight-decay 0.0001\ | ||
--dropout 0.3 \ | ||
--label-smoothing 0.1\ | ||
--max-tokens 3584 | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,106 @@ | ||
# An example of English to Japaneses Simultaneous Translation System | ||
|
||
This is an example of training and evaluating a transformer *wait-k* English to Japanese simultaneous text-to-text translation model. | ||
|
||
## Data Preparation | ||
This section introduces the data preparation for training and evaluation. | ||
If you only want to evaluate the model, please jump to [Inference & Evaluation](#inference-&-evaluation) | ||
|
||
For illustration, we only use the following subsets of the available data from [WMT20 news translation task](http://www.statmt.org/wmt20/translation-task.html), which results in 7,815,391 sentence pairs. | ||
- News Commentary v16 | ||
- Wiki Titles v3 | ||
- WikiMatrix V1 | ||
- Japanese-English Subtitle Corpus | ||
- The Kyoto Free Translation Task Corpus | ||
|
||
We use WMT20 development data as development set. Training `transformer_vaswani_wmt_en_de_big` model on such amount of data will result in 17.3 BLEU with greedy search and 19.7 with beam (10) search. Notice that a better performance can be achieved with the full WMT training data. | ||
|
||
We use [sentencepiece](https://github.com/google/sentencepiece) toolkit to tokenize the data with a vocabulary size of 32000. | ||
Additionally, we filtered out the sentences longer than 200 words after tokenization. | ||
Assuming the tokenized text data is saved at `${DATA_DIR}`, | ||
we prepare the data binary with the following command. | ||
|
||
```bash | ||
fairseq-preprocess \ | ||
--source-lang en --target-lang ja \ | ||
--trainpref ${DATA_DIR}/train \ | ||
--validpref ${DATA_DIR}/dev \ | ||
--testpref ${DATA_DIR}/test \ | ||
--destdir ${WMT20_ENJA_DATA_BIN} \ | ||
--nwordstgt 32000 --nwordssrc 32000 \ | ||
--workers 20 | ||
``` | ||
|
||
## Simultaneous Translation Model Training | ||
To train a wait-k `(k=10)` model. | ||
```bash | ||
fairseq-train ${WMT20_ENJA_DATA_BIN} \ | ||
--save-dir ${SAVEDIR} | ||
--simul-type waitk \ | ||
--waitk-lagging 10 \ | ||
--max-epoch 70 \ | ||
--arch transformer_monotonic_vaswani_wmt_en_de_big \ | ||
--optimizer adam \ | ||
--adam-betas '(0.9, 0.98)' \ | ||
--lr-scheduler inverse_sqrt \ | ||
--warmup-init-lr 1e-07 \ | ||
--warmup-updates 4000 \ | ||
--lr 0.0005 \ | ||
--stop-min-lr 1e-09 \ | ||
--clip-norm 10.0 \ | ||
--dropout 0.3 \ | ||
--weight-decay 0.0 \ | ||
--criterion label_smoothed_cross_entropy \ | ||
--label-smoothing 0.1 \ | ||
--max-tokens 3584 | ||
``` | ||
This command is for training on 8 GPUs. Equivalently, the model can be trained on one GPU with `--update-freq 8`. | ||
|
||
## Inference & Evaluation | ||
First of all, install [SimulEval](https://github.com/facebookresearch/SimulEval) for evaluation. | ||
|
||
```bash | ||
git clone https://github.com/facebookresearch/SimulEval.git | ||
cd SimulEval | ||
pip install -e . | ||
``` | ||
|
||
The following command is for the evaluation. | ||
Assuming the source and reference files are `${SRC_FILE}` and `${REF_FILE}`, the sentencepiece model file for English is saved at `${SRC_SPM_PATH}` | ||
|
||
|
||
```bash | ||
simuleval \ | ||
--source ${SRC_FILE} \ | ||
--target ${TGT_FILE} \ | ||
--data-bin ${WMT20_ENJA_DATA_BIN} \ | ||
--sacrebleu-tokenizer ja-mecab \ | ||
--eval-latency-unit char \ | ||
--no-space \ | ||
--src-splitter-type sentencepiecemodel \ | ||
--src-splitter-path ${SRC_SPM_PATH} \ | ||
--agent ${FAIRSEQ}/examples/simultaneous_translation/agents/simul_trans_text_agent_enja.py \ | ||
--model-path ${SAVE_DIR}/${CHECKPOINT_FILENAME} \ | ||
--output ${OUTPUT} \ | ||
--scores | ||
``` | ||
|
||
The `--data-bin` should be the same in previous sections if you prepare the data from the scratch. | ||
If only for evaluation, a prepared data directory can be found [here](https://dl.fbaipublicfiles.com/simultaneous_translation/wmt20_enja_medium_databin.tgz) and a pretrained checkpoint (wait-k=10 model) can be downloaded from [here](https://dl.fbaipublicfiles.com/simultaneous_translation/wmt20_enja_medium_wait10_ckpt.pt). | ||
|
||
The output should look like this: | ||
```bash | ||
{ | ||
"Quality": { | ||
"BLEU": 11.442253287568398 | ||
}, | ||
"Latency": { | ||
"AL": 8.6587861866951, | ||
"AP": 0.7863304776251316, | ||
"DAL": 9.477850951194764 | ||
} | ||
} | ||
``` | ||
The latency is evaluated by characters (`--eval-latency-unit`) on the target side. The latency is evaluated with `sacrebleu` with `MeCab` tokenizer `--sacrebleu-tokenizer ja-mecab`. `--no-space` indicates that do not add space when merging the predicted words. | ||
|
||
If `--output ${OUTPUT}` option is used, the detailed log and scores will be stored under the `${OUTPUT}` directory. |
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
Oops, something went wrong.