Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can Active Memory Replace Attention? #2

Open
YeonwooSung opened this issue Aug 25, 2020 · 0 comments
Open

Can Active Memory Replace Attention? #2

YeonwooSung opened this issue Aug 25, 2020 · 0 comments

Comments

@YeonwooSung
Copy link
Owner

Abstract

  • Yes for the case of soft attention: somewhat mixed result across tasks.
  • Active memory operates on all of the memory in parallel in a uniform way, bringing improvement in the algorithmic task, image processing, and generative modelings.
  • Does active memory perform well in machine translation? [YES]

Details

Attention

  • Only a small part of the memory changes at every step, or the memory remains constant.
  • Important limitation in attention mechanism is that it can only focus on a single element of the memory due to its nature of softmax.

Active Memory

  • Any model where every part of the memory undergoes an active change at every step.

NMT with Neural GPU

  • parallel encoding and decoding
  • BLEU < 5
  • conditional dependence between outputs are not considered

NMT with Markovian Neural GPU

  • parallel encoding and 1-step conditioned decoding
  • BLEU < 5
  • Perhaps, Markovian dependence of the outputs is too weak for this problem - a full recurrent dependence of the state is needed for good performance

NMT with Extended Neural GPU

  • parallel encoding and sequential decoding
  • BLEU = 29.6 (WMT 14 En-Fr)
  • active memory decoder (d) holds a recurrent state of decoding and output tape tensor (p) holds past decoded logits, going through CGRU^d.

CGRU

  • the convolutional operation followed by recurrent operation
  • stack of CGRU expands receptive field of conv operation
  • output tape tensor acts as external memory of decoded logits

Personal Thoughts

  • Same architecture, but encoder and decoder hidden states may be doing different things
    • encoder: embed semantic locally
    • decoder : track how much it has decoded, use tape tensor to hold information of what it has decoded
  • Will it work for languages with different sentence order?
  • What part of the translation problem can we treat as convolutional?
  • Is "Transformer" a combination of attention and active memory?

Link: https://arxiv.org/pdf/1610.08613.pdf
Authors: Lukas Kaiser (Google Brain) et al. 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant