New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Can Active Memory Replace Attention? #2

Open

YeonwooSung opened this issue Aug 25, 2020 · 0 comments

Labels

Attention mechanism CNN NMT RNN

Owner

YeonwooSung commented Aug 25, 2020

Abstract

Yes for the case of soft attention: somewhat mixed result across tasks.
Active memory operates on all of the memory in parallel in a uniform way, bringing improvement in the algorithmic task, image processing, and generative modelings.
Does active memory perform well in machine translation? [YES]

Details

Attention

Only a small part of the memory changes at every step, or the memory remains constant.
Important limitation in attention mechanism is that it can only focus on a single element of the memory due to its nature of softmax.

Active Memory

Any model where every part of the memory undergoes an active change at every step.

NMT with Neural GPU

parallel encoding and decoding
BLEU < 5
conditional dependence between outputs are not considered

NMT with Markovian Neural GPU

parallel encoding and 1-step conditioned decoding
BLEU < 5
Perhaps, Markovian dependence of the outputs is too weak for this problem - a full recurrent dependence of the state is needed for good performance

NMT with Extended Neural GPU

parallel encoding and sequential decoding
BLEU = 29.6 (WMT 14 En-Fr)
active memory decoder (d) holds a recurrent state of decoding and output tape tensor (p) holds past decoded logits, going through CGRU^d.

CGRU

the convolutional operation followed by recurrent operation
stack of CGRU expands receptive field of conv operation
output tape tensor acts as external memory of decoded logits

Personal Thoughts

Same architecture, but encoder and decoder hidden states may be doing different things
- encoder: embed semantic locally
- decoder : track how much it has decoded, use tape tensor to hold information of what it has decoded
Will it work for languages with different sentence order?
What part of the translation problem can we treat as convolutional?
Is "Transformer" a combination of attention and active memory?

Link: https://arxiv.org/pdf/1610.08613.pdf
Authors: Lukas Kaiser (Google Brain) et al. 2017

YeonwooSung added CNN Attention mechanism NMT RNN labels

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment