You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+9-2Lines changed: 9 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,7 +6,7 @@ Implementation of <a href="https://arxiv.org/abs/2203.07852">Block Recurrent Tra
6
6
7
7
This design is SOTA for recurrent transformers line of research, afaict.
8
8
9
-
It will also include <ahref="https://arxiv.org/abs/2205.14135">flash attention</a> as well as <ahref="https://arxiv.org/abs/2203.08913">KNN attention layers</a>
9
+
It will also include <ahref="https://arxiv.org/abs/2205.14135">flash attention</a> as well as routed memories of up to 250k tokens using ideas from <ahref="https://github.com/lucidrains/CoLT5-attention">this paper</a>
-[ ]add ability to gate in memorizing transformers knn attention layers
73
+
-[ ]try routing long distance memories of up to 250k using coordinate descent (Wright et al.)
74
74
75
75
## Citations
76
76
@@ -111,5 +111,12 @@ $ python train.py
111
111
}
112
112
```
113
113
114
+
```bibtex
115
+
@inproceedings{Ainslie2023CoLT5FL,
116
+
title = {CoLT5: Faster Long-Range Transformers with Conditional Computation},
117
+
author = {Joshua Ainslie and Tao Lei and Michiel de Jong and Santiago Ontan'on and Siddhartha Brahma and Yury Zemlyanskiy and David Uthus and Mandy Guo and James Lee-Thorp and Yi Tay and Yun-Hsuan Sung and Sumit Sanghai},
0 commit comments