Tags: #paper-review #machine-learning #nlp
"Attention is All You Need" by Vaswani et al (2017) introduced the [[Transformer Architecture|Transformer architecture]] for NLP tasks, replacing recurrent network architectures (e.g. RNNs, LSTMs, and GRUs).
Reports [[Bilingual Evaluation Understudy (BLEU) Metric|BLEU]] of 28.4 for [[WMT 2014 Language Translation Dataset|WMT 2014]] English-to-German translation (4.5M sentences), and 41.8 for WMT 2014 English-to-French translation (36M sentences). Scores are from supervised training since the source and target language sentences are available.
F1 score of 92.7.
Self-attention provides several key benefits:
- Removes distance limitations on positions' influence. The first symbol can influence the last symbol (and all in between) regardless of sequence length.
- Attention heads are matrix multiplications and element-wise operations. Very efficient to execute.
- All symbols in the sequence are processed simultaneously.
Symbol position is encoded as a sinusoidal embedding that is summed into the input symbol. A learned position encoding did not appear to change the model performance either way.
Tensor2Tensor implementation in TensorFlow. This integrates a LOT of features and is not easy to readily understand - start with bin/t2t_trainer.py.
Baseline model is $\approx 65$M parameters and has 6 layers, each with 8 self-attention heads using an embedding size of 512. BLEU scores of 27.3 and 38.1 (EN-DE and EN-FR). Trains 100,000 steps in 12 hours (0.4 s/step) with
The large model is
Training was done on 8x NVIDIA P100 GPUs. Baseline model in 12 hours, big model in 3.5 days. The big model was 4x faster than SOTA at the time.
Model averaging was performed over the last N checkpoints, with checkpoints dropped every 10 minutes. Baseline model averaged over the 5 checkpoints (7500 steps) and the big model averaged over the last 20 checkpoints (12000 steps).
[[Label Smoothing|Label smoothing]] was used with
Dropout was used on residual connections with
Learning rate schedule has a linear warmup and then a
The baseline architecture has
Too few layers hurt the most (
Varying
Making embedding sizes larger improved performance.