Skip to content

Commit

Permalink
Update simultaneous translation docs (facebookresearch#1767)
Browse files Browse the repository at this point in the history
Summary: Pull Request resolved: fairinternal/fairseq-py#1767

Test Plan:
Imported from GitHub, without a `Test Plan:` line.

This pull request contains
- An example of EN_JA simul t2t model
- Reorganizing simul trans docs
- Removal of out-of-date files

Reviewed By: jmp84

Differential Revision: D27467907

Pulled By: xutaima

fbshipit-source-id: 137165b007cf5301bdc51a0a277ba91cbf733092
  • Loading branch information
xutaima authored and facebook-github-bot committed Apr 1, 2021
1 parent 579a48f commit 14807a3
Show file tree
Hide file tree
Showing 20 changed files with 454 additions and 1,324 deletions.
111 changes: 5 additions & 106 deletions examples/simultaneous_translation/README.md
Original file line number Diff line number Diff line change
@@ -1,106 +1,5 @@
# Simultaneous Machine Translation

This directory contains the code for the paper [Monotonic Multihead Attention](https://openreview.net/forum?id=Hyg96gBKPS)

## Prepare Data

[Please follow the instructions to download and preprocess the WMT'15 En-De dataset.](https://github.com/pytorch/fairseq/tree/simulastsharedtask/examples/translation#prepare-wmt14en2desh)

## Training

- MMA-IL

```shell
fairseq-train \
data-bin/wmt15_en_de_32k \
--simul-type infinite_lookback \
--user-dir $FAIRSEQ/example/simultaneous_translation \
--mass-preservation \
--criterion latency_augmented_label_smoothed_cross_entropy \
--latency-weight-avg 0.1 \
--max-update 50000 \
--arch transformer_monotonic_iwslt_de_en save_dir_key=lambda \
--optimizer adam --adam-betas '(0.9, 0.98)' \
--lr-scheduler 'inverse_sqrt' \
--warmup-init-lr 1e-7 --warmup-updates 4000 \
--lr 5e-4 --stop-min-lr 1e-9 --clip-norm 0.0 --weight-decay 0.0001\
--dropout 0.3 \
--label-smoothing 0.1\
--max-tokens 3584
```

- MMA-H

```shell
fairseq-train \
data-bin/wmt15_en_de_32k \
--simul-type hard_aligned \
--user-dir $FAIRSEQ/example/simultaneous_translation \
--mass-preservation \
--criterion latency_augmented_label_smoothed_cross_entropy \
--latency-weight-var 0.1 \
--max-update 50000 \
--arch transformer_monotonic_iwslt_de_en save_dir_key=lambda \
--optimizer adam --adam-betas '(0.9, 0.98)' \
--lr-scheduler 'inverse_sqrt' \
--warmup-init-lr 1e-7 --warmup-updates 4000 \
--lr 5e-4 --stop-min-lr 1e-9 --clip-norm 0.0 --weight-decay 0.0001\
--dropout 0.3 \
--label-smoothing 0.1\
--max-tokens 3584
```

- wait-k

```shell
fairseq-train \
data-bin/wmt15_en_de_32k \
--simul-type wait-k \
--waitk-lagging 3 \
--user-dir $FAIRSEQ/example/simultaneous_translation \
--mass-preservation \
--criterion latency_augmented_label_smoothed_cross_entropy \
--max-update 50000 \
--arch transformer_monotonic_iwslt_de_en save_dir_key=lambda \
--optimizer adam --adam-betas '(0.9, 0.98)' \
--lr-scheduler 'inverse_sqrt' \
--warmup-init-lr 1e-7 --warmup-updates 4000 \
--lr 5e-4 --stop-min-lr 1e-9 --clip-norm 0.0 --weight-decay 0.0001\
--dropout 0.3 \
--label-smoothing 0.1\
--max-tokens 3584
```


## Evaluation

More details on evaluation can be found [here](https://github.com/pytorch/fairseq/blob/simulastsharedtask/examples/simultaneous_translation/docs/evaluation.md)

### Start the server

```shell
python ./eval/server.py \
--src-file $SRC_FILE \
--ref-file $TGT_FILE
```

### Run the client

```shell
python ./evaluate.py \
--data-bin data-bin/wmt15_en_de_32k \
--model-path ./checkpoints/checkpoint_best.pt
--scores --output $RESULT_DIR
```

### Run evaluation locally without server

```shell
python ./eval/evaluate.py
--local \
--src-file $SRC_FILE \
--tgt-file $TGT_FILE \
--data-bin data-bin/wmt15_en_de_32k \
--model-path ./checkpoints/checkpoint_best.pt \
--scores --output $RESULT_DIR
```
# Simultaneous Translation
Examples of simultaneous translation in fairseq
- [English-to-Japanese text-to-text wait-k model](docs/enja-waitk.md)
- [English-to-Germen text-to-text monotonic multihead attention model](docs/ende-mma.md)
- [English-to-Germen speech-to-text simultaneous translation model](../speech_to_text/docs/simulst_mustc_example.md)
74 changes: 74 additions & 0 deletions examples/simultaneous_translation/docs/ende-mma.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
# Simultaneous Machine Translation

This directory contains the code for the paper [Monotonic Multihead Attention](https://openreview.net/forum?id=Hyg96gBKPS)

## Prepare Data

[Please follow the instructions to download and preprocess the WMT'15 En-De dataset.](https://github.com/pytorch/fairseq/tree/simulastsharedtask/examples/translation#prepare-wmt14en2desh)

Another example of training an English to Japanese model can be found [here](docs/enja.md)

## Training

- MMA-IL

```shell
fairseq-train \
data-bin/wmt15_en_de_32k \
--simul-type infinite_lookback \
--user-dir $FAIRSEQ/example/simultaneous_translation \
--mass-preservation \
--criterion latency_augmented_label_smoothed_cross_entropy \
--latency-weight-avg 0.1 \
--max-update 50000 \
--arch transformer_monotonic_iwslt_de_en save_dir_key=lambda \
--optimizer adam --adam-betas '(0.9, 0.98)' \
--lr-scheduler 'inverse_sqrt' \
--warmup-init-lr 1e-7 --warmup-updates 4000 \
--lr 5e-4 --stop-min-lr 1e-9 --clip-norm 0.0 --weight-decay 0.0001\
--dropout 0.3 \
--label-smoothing 0.1\
--max-tokens 3584
```

- MMA-H

```shell
fairseq-train \
data-bin/wmt15_en_de_32k \
--simul-type hard_aligned \
--user-dir $FAIRSEQ/example/simultaneous_translation \
--mass-preservation \
--criterion latency_augmented_label_smoothed_cross_entropy \
--latency-weight-var 0.1 \
--max-update 50000 \
--arch transformer_monotonic_iwslt_de_en save_dir_key=lambda \
--optimizer adam --adam-betas '(0.9, 0.98)' \
--lr-scheduler 'inverse_sqrt' \
--warmup-init-lr 1e-7 --warmup-updates 4000 \
--lr 5e-4 --stop-min-lr 1e-9 --clip-norm 0.0 --weight-decay 0.0001\
--dropout 0.3 \
--label-smoothing 0.1\
--max-tokens 3584
```

- wait-k

```shell
fairseq-train \
data-bin/wmt15_en_de_32k \
--simul-type wait-k \
--waitk-lagging 3 \
--user-dir $FAIRSEQ/example/simultaneous_translation \
--mass-preservation \
--criterion latency_augmented_label_smoothed_cross_entropy \
--max-update 50000 \
--arch transformer_monotonic_iwslt_de_en save_dir_key=lambda \
--optimizer adam --adam-betas '(0.9, 0.98)' \
--lr-scheduler 'inverse_sqrt' \
--warmup-init-lr 1e-7 --warmup-updates 4000 \
--lr 5e-4 --stop-min-lr 1e-9 --clip-norm 0.0 --weight-decay 0.0001\
--dropout 0.3 \
--label-smoothing 0.1\
--max-tokens 3584
```
106 changes: 106 additions & 0 deletions examples/simultaneous_translation/docs/enja-waitk.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
# An example of English to Japaneses Simultaneous Translation System

This is an example of training and evaluating a transformer *wait-k* English to Japanese simultaneous text-to-text translation model.

## Data Preparation
This section introduces the data preparation for training and evaluation.
If you only want to evaluate the model, please jump to [Inference & Evaluation](#inference-&-evaluation)

For illustration, we only use the following subsets of the available data from [WMT20 news translation task](http://www.statmt.org/wmt20/translation-task.html), which results in 7,815,391 sentence pairs.
- News Commentary v16
- Wiki Titles v3
- WikiMatrix V1
- Japanese-English Subtitle Corpus
- The Kyoto Free Translation Task Corpus

We use WMT20 development data as development set. Training `transformer_vaswani_wmt_en_de_big` model on such amount of data will result in 17.3 BLEU with greedy search and 19.7 with beam (10) search. Notice that a better performance can be achieved with the full WMT training data.

We use [sentencepiece](https://github.com/google/sentencepiece) toolkit to tokenize the data with a vocabulary size of 32000.
Additionally, we filtered out the sentences longer than 200 words after tokenization.
Assuming the tokenized text data is saved at `${DATA_DIR}`,
we prepare the data binary with the following command.

```bash
fairseq-preprocess \
--source-lang en --target-lang ja \
--trainpref ${DATA_DIR}/train \
--validpref ${DATA_DIR}/dev \
--testpref ${DATA_DIR}/test \
--destdir ${WMT20_ENJA_DATA_BIN} \
--nwordstgt 32000 --nwordssrc 32000 \
--workers 20
```

## Simultaneous Translation Model Training
To train a wait-k `(k=10)` model.
```bash
fairseq-train ${WMT20_ENJA_DATA_BIN} \
--save-dir ${SAVEDIR}
--simul-type waitk \
--waitk-lagging 10 \
--max-epoch 70 \
--arch transformer_monotonic_vaswani_wmt_en_de_big \
--optimizer adam \
--adam-betas '(0.9, 0.98)' \
--lr-scheduler inverse_sqrt \
--warmup-init-lr 1e-07 \
--warmup-updates 4000 \
--lr 0.0005 \
--stop-min-lr 1e-09 \
--clip-norm 10.0 \
--dropout 0.3 \
--weight-decay 0.0 \
--criterion label_smoothed_cross_entropy \
--label-smoothing 0.1 \
--max-tokens 3584
```
This command is for training on 8 GPUs. Equivalently, the model can be trained on one GPU with `--update-freq 8`.

## Inference & Evaluation
First of all, install [SimulEval](https://github.com/facebookresearch/SimulEval) for evaluation.

```bash
git clone https://github.com/facebookresearch/SimulEval.git
cd SimulEval
pip install -e .
```

The following command is for the evaluation.
Assuming the source and reference files are `${SRC_FILE}` and `${REF_FILE}`, the sentencepiece model file for English is saved at `${SRC_SPM_PATH}`


```bash
simuleval \
--source ${SRC_FILE} \
--target ${TGT_FILE} \
--data-bin ${WMT20_ENJA_DATA_BIN} \
--sacrebleu-tokenizer ja-mecab \
--eval-latency-unit char \
--no-space \
--src-splitter-type sentencepiecemodel \
--src-splitter-path ${SRC_SPM_PATH} \
--agent ${FAIRSEQ}/examples/simultaneous_translation/agents/simul_trans_text_agent_enja.py \
--model-path ${SAVE_DIR}/${CHECKPOINT_FILENAME} \
--output ${OUTPUT} \
--scores
```

The `--data-bin` should be the same in previous sections if you prepare the data from the scratch.
If only for evaluation, a prepared data directory can be found [here](https://dl.fbaipublicfiles.com/simultaneous_translation/wmt20_enja_medium_databin.tgz) and a pretrained checkpoint (wait-k=10 model) can be downloaded from [here](https://dl.fbaipublicfiles.com/simultaneous_translation/wmt20_enja_medium_wait10_ckpt.pt).

The output should look like this:
```bash
{
"Quality": {
"BLEU": 11.442253287568398
},
"Latency": {
"AL": 8.6587861866951,
"AP": 0.7863304776251316,
"DAL": 9.477850951194764
}
}
```
The latency is evaluated by characters (`--eval-latency-unit`) on the target side. The latency is evaluated with `sacrebleu` with `MeCab` tokenizer `--sacrebleu-tokenizer ja-mecab`. `--no-space` indicates that do not add space when merging the predicted words.

If `--output ${OUTPUT}` option is used, the detailed log and scores will be stored under the `${OUTPUT}` directory.
4 changes: 0 additions & 4 deletions examples/simultaneous_translation/eval/__init__.py

This file was deleted.

24 changes: 0 additions & 24 deletions examples/simultaneous_translation/eval/agents/__init__.py

This file was deleted.

67 changes: 0 additions & 67 deletions examples/simultaneous_translation/eval/agents/agent.py

This file was deleted.

Loading

0 comments on commit 14807a3

Please sign in to comment.