In the original paper(weighted transformer), the author mentioned that "all bounds are respected during each training step by projection."
I have no idea what "by project" means and don't know how to keep the constrains of sum(k)=1 and sum(α)=1.
It seems there is no particular processing in this repository except for initialization. Could you please explain?