Normalizing environment wrapper

For Mujoco envs, i's a standard practice to normalize rewards by a running estimate of their standard deviation (e.g. VecNormalize in baselines, NormalizedEnv in rllab). Without it, performance is noticeably worse - for example, in the current PPO implementation, the value function fails to converge  since the return magnitudes are too high, and the algorithm takes around 3x as many iterations to converge compared to the normalized implementations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Normalizing environment wrapper #115

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Normalizing environment wrapper #115

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions