Skip to content

Normalize rewards by standard deviation of discounted return in MuJoCo#149

Open
vzhuang wants to merge 1 commit intoastooke:masterfrom
vzhuang:normalize_rewards
Open

Normalize rewards by standard deviation of discounted return in MuJoCo#149
vzhuang wants to merge 1 commit intoastooke:masterfrom
vzhuang:normalize_rewards

Conversation

@vzhuang
Copy link

@vzhuang vzhuang commented Apr 21, 2020

Averaged results over 10 runs for PPO on Walker2d-v3:

walker2dv3normtest

@vzhuang
Copy link
Author

vzhuang commented Apr 21, 2020

#115

@codecov-io
Copy link

codecov-io commented Apr 21, 2020

Codecov Report

Merging #149 into master will decrease coverage by 0.00%.
The diff coverage is 20.58%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #149      +/-   ##
==========================================
- Coverage   22.56%   22.56%   -0.01%     
==========================================
  Files         128      128              
  Lines        7987     8014      +27     
==========================================
+ Hits         1802     1808       +6     
- Misses       6185     6206      +21     
Flag Coverage Δ
#unittests 22.56% <20.58%> (-0.01%) ⬇️
Impacted Files Coverage Δ
rlpyt/algos/pg/a2c.py 0.00% <0.00%> (ø)
rlpyt/algos/pg/base.py 0.00% <0.00%> (ø)
rlpyt/algos/pg/ppo.py 0.00% <0.00%> (ø)
rlpyt/experiments/configs/mujoco/pg/mujoco_a2c.py 0.00% <ø> (ø)
rlpyt/experiments/configs/mujoco/pg/mujoco_ppo.py 0.00% <ø> (ø)
rlpyt/samplers/base.py 80.00% <ø> (ø)
rlpyt/samplers/collections.py 96.29% <ø> (ø)
rlpyt/samplers/collectors.py 81.03% <ø> (ø)
rlpyt/samplers/parallel/gpu/collectors.py 0.00% <0.00%> (ø)
rlpyt/samplers/serial/sampler.py 97.72% <ø> (ø)
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 668290d...a15a93b. Read the comment docs.

@astooke
Copy link
Owner

astooke commented Jun 30, 2020

OK this is interesting and I think it can be made a lot simpler. As far as I can tell from the PR, the same could be achieved by changing process_returns() of the policy gradient class, right after the following lines

reward, done, value, bv = (samples.env.reward, samples.env.done,
samples.agent.agent_info.value, samples.agent.bootstrap_value)
done = done.type(reward.dtype)

by inserting:

if self.normalize_reward:
  return_ = discount_return(reward, done, 0., self.discount)  # NO boostrapping of value
  self.rets_rms.update(return_.view(-1, 1))  # matching the shape you used, not sure if the extra dim is needed?
  std_dev = torch.sqrt(self.rets_rms.var)
  reward = torch.div(reward, std_dev)

# proceed with computing discounted returns or GAE returns using the scaled reward

I think that accomplishes the same math? And doesn't need to change any files in the sampler :)

@astooke
Copy link
Owner

astooke commented Sep 5, 2020

Any more comment? Anyone else used this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants