The torch-scatter dependency is a huge PITA.
We currently use it for reducing across sequences in a ragged batch to get inputs to the value head:
https://github.com/entity-neural-network/incubator/blob/85cd666f3401ca0d9eebfd0b6603e14de2311b4a/rogue_net/rogue_net/actor.py#L110-L112
We could add a less efficient pure torch implementation of this operation. We then dynamically detect if torch-scatter is installed, and fall back onto the pure torch implementation if not. That way we can make the torch-scatter dependency optional.