Update simultaneous translation docs (facebookresearch#1767)

Summary: Pull Request resolved: fairinternal/fairseq-py#1767 Test Plan: Imported from GitHub, without a `Test Plan:` line. This pull request contains - An example of EN_JA simul t2t model - Reorganizing simul trans docs - Removal of out-of-date files Reviewed By: jmp84 Differential Revision: D27467907 Pulled By: xutaima fbshipit-source-id: 137165b007cf5301bdc51a0a277ba91cbf733092
pnunna93 · Apr 1, 2021 · 14807a3 · 14807a3
1 parent 579a48f
commit 14807a3
Show file tree

Hide file tree

Showing 20 changed files with 454 additions and 1,324 deletions.
diff --git a/examples/simultaneous_translation/README.md b/examples/simultaneous_translation/README.md
@@ -1,106 +1,5 @@
-# Simultaneous Machine Translation
-
-This directory contains the code for the paper [Monotonic Multihead Attention](https://openreview.net/forum?id=Hyg96gBKPS)
-
-## Prepare Data
-
-[Please follow the instructions to download and preprocess the WMT'15 En-De dataset.](https://github.com/pytorch/fairseq/tree/simulastsharedtask/examples/translation#prepare-wmt14en2desh)
-
-## Training
-
-- MMA-IL
-
-```shell
-fairseq-train \
-    data-bin/wmt15_en_de_32k \
-    --simul-type infinite_lookback \
-    --user-dir $FAIRSEQ/example/simultaneous_translation \
-    --mass-preservation \
-    --criterion latency_augmented_label_smoothed_cross_entropy \
-    --latency-weight-avg  0.1 \
-    --max-update 50000 \
-    --arch transformer_monotonic_iwslt_de_en save_dir_key=lambda \
-    --optimizer adam --adam-betas '(0.9, 0.98)' \
-    --lr-scheduler 'inverse_sqrt' \
-    --warmup-init-lr 1e-7  --warmup-updates 4000 \
-    --lr 5e-4 --stop-min-lr 1e-9 --clip-norm 0.0 --weight-decay 0.0001\
-    --dropout 0.3 \
-    --label-smoothing 0.1\
-    --max-tokens 3584
-```
-
-- MMA-H
-
-```shell
-fairseq-train \
-    data-bin/wmt15_en_de_32k \
-    --simul-type hard_aligned \
-    --user-dir $FAIRSEQ/example/simultaneous_translation \
-    --mass-preservation \
-    --criterion latency_augmented_label_smoothed_cross_entropy \
-    --latency-weight-var  0.1 \
-    --max-update 50000 \
-    --arch transformer_monotonic_iwslt_de_en save_dir_key=lambda \
-    --optimizer adam --adam-betas '(0.9, 0.98)' \
-    --lr-scheduler 'inverse_sqrt' \
-    --warmup-init-lr 1e-7  --warmup-updates 4000 \
-    --lr 5e-4 --stop-min-lr 1e-9 --clip-norm 0.0 --weight-decay 0.0001\
-    --dropout 0.3 \
-    --label-smoothing 0.1\
-    --max-tokens 3584
-```
-
-- wait-k
-
-```shell
-fairseq-train \
-    data-bin/wmt15_en_de_32k \
-    --simul-type wait-k \
-    --waitk-lagging 3 \
-    --user-dir $FAIRSEQ/example/simultaneous_translation \
-    --mass-preservation \
-    --criterion latency_augmented_label_smoothed_cross_entropy \
-    --max-update 50000 \
-    --arch transformer_monotonic_iwslt_de_en save_dir_key=lambda \
-    --optimizer adam --adam-betas '(0.9, 0.98)' \
-    --lr-scheduler 'inverse_sqrt' \
-    --warmup-init-lr 1e-7  --warmup-updates 4000 \
-    --lr 5e-4 --stop-min-lr 1e-9 --clip-norm 0.0 --weight-decay 0.0001\
-    --dropout 0.3 \
-    --label-smoothing 0.1\
-    --max-tokens 3584
-```
-
-
-## Evaluation
-
-More details on evaluation can be found [here](https://github.com/pytorch/fairseq/blob/simulastsharedtask/examples/simultaneous_translation/docs/evaluation.md)
-
-### Start the server
-
-```shell
-python ./eval/server.py \
-    --src-file $SRC_FILE \
-    --ref-file $TGT_FILE
-```
-
-### Run the client
-
-```shell
-python ./evaluate.py \
-    --data-bin data-bin/wmt15_en_de_32k \
-    --model-path ./checkpoints/checkpoint_best.pt
-    --scores --output $RESULT_DIR
-```
-
-### Run evaluation locally without server
-
-```shell
-python ./eval/evaluate.py
-    --local \
-    --src-file $SRC_FILE \
-    --tgt-file $TGT_FILE \
-    --data-bin data-bin/wmt15_en_de_32k \
-    --model-path ./checkpoints/checkpoint_best.pt \
-    --scores --output $RESULT_DIR
-```
+# Simultaneous Translation
+Examples of simultaneous translation in fairseq
+- [English-to-Japanese text-to-text wait-k model](docs/enja-waitk.md)
+- [English-to-Germen text-to-text monotonic multihead attention model](docs/ende-mma.md)
+- [English-to-Germen speech-to-text simultaneous translation model](../speech_to_text/docs/simulst_mustc_example.md)
diff --git a/examples/simultaneous_translation/docs/ende-mma.md b/examples/simultaneous_translation/docs/ende-mma.md
@@ -0,0 +1,74 @@
+# Simultaneous Machine Translation
+
+This directory contains the code for the paper [Monotonic Multihead Attention](https://openreview.net/forum?id=Hyg96gBKPS)
+
+## Prepare Data
+
+[Please follow the instructions to download and preprocess the WMT'15 En-De dataset.](https://github.com/pytorch/fairseq/tree/simulastsharedtask/examples/translation#prepare-wmt14en2desh)
+
+Another example of training an English to Japanese model can be found [here](docs/enja.md)
+
+## Training
+
+- MMA-IL
+
+```shell
+fairseq-train \
+    data-bin/wmt15_en_de_32k \
+    --simul-type infinite_lookback \
+    --user-dir $FAIRSEQ/example/simultaneous_translation \
+    --mass-preservation \
+    --criterion latency_augmented_label_smoothed_cross_entropy \
+    --latency-weight-avg  0.1 \
+    --max-update 50000 \
+    --arch transformer_monotonic_iwslt_de_en save_dir_key=lambda \
+    --optimizer adam --adam-betas '(0.9, 0.98)' \
+    --lr-scheduler 'inverse_sqrt' \
+    --warmup-init-lr 1e-7  --warmup-updates 4000 \
+    --lr 5e-4 --stop-min-lr 1e-9 --clip-norm 0.0 --weight-decay 0.0001\
+    --dropout 0.3 \
+    --label-smoothing 0.1\
+    --max-tokens 3584
+```
+
+- MMA-H
+
+```shell
+fairseq-train \
+    data-bin/wmt15_en_de_32k \
+    --simul-type hard_aligned \
+    --user-dir $FAIRSEQ/example/simultaneous_translation \
+    --mass-preservation \
+    --criterion latency_augmented_label_smoothed_cross_entropy \
+    --latency-weight-var  0.1 \
+    --max-update 50000 \
+    --arch transformer_monotonic_iwslt_de_en save_dir_key=lambda \
+    --optimizer adam --adam-betas '(0.9, 0.98)' \
+    --lr-scheduler 'inverse_sqrt' \
+    --warmup-init-lr 1e-7  --warmup-updates 4000 \
+    --lr 5e-4 --stop-min-lr 1e-9 --clip-norm 0.0 --weight-decay 0.0001\
+    --dropout 0.3 \
+    --label-smoothing 0.1\
+    --max-tokens 3584
+```
+
+- wait-k
+
+```shell
+fairseq-train \
+    data-bin/wmt15_en_de_32k \
+    --simul-type wait-k \
+    --waitk-lagging 3 \
+    --user-dir $FAIRSEQ/example/simultaneous_translation \
+    --mass-preservation \
+    --criterion latency_augmented_label_smoothed_cross_entropy \
+    --max-update 50000 \
+    --arch transformer_monotonic_iwslt_de_en save_dir_key=lambda \
+    --optimizer adam --adam-betas '(0.9, 0.98)' \
+    --lr-scheduler 'inverse_sqrt' \
+    --warmup-init-lr 1e-7  --warmup-updates 4000 \
+    --lr 5e-4 --stop-min-lr 1e-9 --clip-norm 0.0 --weight-decay 0.0001\
+    --dropout 0.3 \
+    --label-smoothing 0.1\
+    --max-tokens 3584
+```
diff --git a/examples/simultaneous_translation/docs/enja-waitk.md b/examples/simultaneous_translation/docs/enja-waitk.md
@@ -0,0 +1,106 @@
+# An example of English to Japaneses Simultaneous Translation System
+
+This is an example of training and evaluating a transformer *wait-k* English to Japanese simultaneous text-to-text translation model.
+
+## Data Preparation
+This section introduces the data preparation for training and evaluation.
+If you only want to evaluate the model, please jump to [Inference & Evaluation](#inference-&-evaluation)
+
+For illustration, we only use the following subsets of the available data from [WMT20 news translation task](http://www.statmt.org/wmt20/translation-task.html), which results in 7,815,391 sentence pairs.
+- News Commentary v16
+- Wiki Titles v3
+- WikiMatrix V1
+- Japanese-English Subtitle Corpus
+- The Kyoto Free Translation Task Corpus
+
+We use WMT20 development data as development set. Training `transformer_vaswani_wmt_en_de_big` model on such amount of data will result in 17.3 BLEU with greedy search and 19.7 with beam (10) search. Notice that a better performance can be achieved with the full WMT training data.
+
+We use [sentencepiece](https://github.com/google/sentencepiece) toolkit to tokenize the data with a vocabulary size of 32000.
+Additionally, we filtered out the sentences longer than 200 words after tokenization.
+Assuming the tokenized text data is saved at `${DATA_DIR}`,
+we prepare the data binary with the following command.
+
+```bash
+fairseq-preprocess \
+    --source-lang en --target-lang ja \
+    --trainpref ${DATA_DIR}/train \
+    --validpref ${DATA_DIR}/dev \
+    --testpref ${DATA_DIR}/test \
+    --destdir ${WMT20_ENJA_DATA_BIN} \
+    --nwordstgt 32000 --nwordssrc 32000 \
+    --workers 20
+```
+
+## Simultaneous Translation Model Training
+To train a wait-k `(k=10)` model.
+```bash
+fairseq-train ${WMT20_ENJA_DATA_BIN}  \
+    --save-dir ${SAVEDIR}
+    --simul-type waitk  \
+    --waitk-lagging 10  \
+    --max-epoch 70  \
+    --arch transformer_monotonic_vaswani_wmt_en_de_big \
+    --optimizer adam  \
+    --adam-betas '(0.9, 0.98)'  \
+    --lr-scheduler inverse_sqrt  \
+    --warmup-init-lr 1e-07  \
+    --warmup-updates 4000  \
+    --lr 0.0005  \
+    --stop-min-lr 1e-09  \
+    --clip-norm 10.0  \
+    --dropout 0.3  \
+    --weight-decay 0.0  \
+    --criterion label_smoothed_cross_entropy  \
+    --label-smoothing 0.1  \
+    --max-tokens 3584
+```
+This command is for training on 8 GPUs. Equivalently, the model can be trained on one GPU with `--update-freq 8`.
+
+## Inference & Evaluation
+First of all, install [SimulEval](https://github.com/facebookresearch/SimulEval) for evaluation.
+
+```bash
+git clone https://github.com/facebookresearch/SimulEval.git
+cd SimulEval
+pip install -e .
+```
+
+The following command is for the evaluation.
+Assuming the source and reference files are `${SRC_FILE}` and `${REF_FILE}`, the sentencepiece model file for English is saved at `${SRC_SPM_PATH}`
+
+
+```bash
+simuleval \
+    --source ${SRC_FILE} \
+    --target ${TGT_FILE} \
+    --data-bin ${WMT20_ENJA_DATA_BIN} \
+    --sacrebleu-tokenizer ja-mecab \
+    --eval-latency-unit char \
+    --no-space \
+    --src-splitter-type sentencepiecemodel \
+    --src-splitter-path ${SRC_SPM_PATH} \
+    --agent ${FAIRSEQ}/examples/simultaneous_translation/agents/simul_trans_text_agent_enja.py \
+    --model-path ${SAVE_DIR}/${CHECKPOINT_FILENAME} \
+    --output ${OUTPUT} \
+    --scores
+```
+
+The `--data-bin` should be the same in previous sections if you prepare the data from the scratch.
+If only for evaluation, a prepared data directory can be found [here](https://dl.fbaipublicfiles.com/simultaneous_translation/wmt20_enja_medium_databin.tgz) and a pretrained checkpoint (wait-k=10 model) can be downloaded from [here](https://dl.fbaipublicfiles.com/simultaneous_translation/wmt20_enja_medium_wait10_ckpt.pt).
+
+The output should look like this:
+```bash
+{
+    "Quality": {
+        "BLEU": 11.442253287568398
+    },
+    "Latency": {
+        "AL": 8.6587861866951,
+        "AP": 0.7863304776251316,
+        "DAL": 9.477850951194764
+    }
+}
+```
+The latency is evaluated by characters (`--eval-latency-unit`) on the target side. The latency is evaluated with `sacrebleu` with `MeCab` tokenizer `--sacrebleu-tokenizer ja-mecab`. `--no-space` indicates that do not add space when merging the predicted words.
+
+If `--output ${OUTPUT}` option is used, the detailed log and scores will be stored under the `${OUTPUT}` directory.
diff --git a/examples/simultaneous_translation/eval/__init__.py b/examples/simultaneous_translation/eval/__init__.py
diff --git a/examples/simultaneous_translation/eval/agents/__init__.py b/examples/simultaneous_translation/eval/agents/__init__.py
diff --git a/examples/simultaneous_translation/eval/agents/agent.py b/examples/simultaneous_translation/eval/agents/agent.py