-
Notifications
You must be signed in to change notification settings - Fork 328
Open
Description
For Mujoco envs, i's a standard practice to normalize rewards by a running estimate of their standard deviation (e.g. VecNormalize in baselines, NormalizedEnv in rllab). Without it, performance is noticeably worse - for example, in the current PPO implementation, the value function fails to converge since the return magnitudes are too high, and the algorithm takes around 3x as many iterations to converge compared to the normalized implementations.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels