We should have a good theoretical understanding and/or empirical answers to the following questions:
- When a policy controls multiple entities, should the policy/entropy losses from each entity be summed, averaged, or combined in some other way?
- How do we combine losses from different actions?
- Do we need a different weight for losses from different action types and is there a good way to find the weights?