-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simple policy gradient: mean over episodes? #59
Comments
@neighthan loss = - tf.reduce_sum(weights_ph * log_probs) / number_of_trajectories_in_the_epoch In which the number_of_trajectories_in_the_epoch can be calculated by counting how many times the code run into the if done part if done:
number_of_trajectories_in_the_epoch += 1
# if episode is over, record info about episode
ep_ret, ep_len = sum(ep_rews), len(ep_rews)
batch_rets.append(ep_ret)
batch_lens.append(ep_len) ... I have tried my understanding in the code but the performance is downgraded. |
My understanding is:
sounds right, since we don't care about the loss itself, we only care about parameters that minimize the loss. Parameters that minimize (since the difference is just the constant multiplied as mentioned above) Would be good to know if this was chosen because of simplicity or some other reason. |
As others have mentioned, the two methods have the same optima, since there's only a multiplicative scalar offset. However, I wanted to point out that many notations of policy gradient use an expectation over actions sampled from the current policy and states sampled from the environment (under the current policy). In this case, the mean is over number of steps, not number of trajectories. The samples don't necessarily even need to come from a trajectory at all, e.g. you could downsample from your trajectories to reduce temporal correlations before sending them to the gradient. I think the stranger issue is that the batch is only checked for completion when For non-vanilla PG methods, this issue actually becomes more complicated, since we are estimating values/advantages. The original PPO paper mentions that the rollout length T should be "much less than the episode length" (apparently quoting the A3C paper, but I don't see this mentioned there?), though I see this violated all the time in practice. EDIT: Also noticed that the PPO introduction (see Algorithm 1) divides by the number of trajectories multiplied by the trajectory length (fixed in this case, since rollouts are of fixed length), i.e. total number of steps. EDIT 2: This paper mentions that the number of independent samples from the policy is the number of trajectories (not steps). That doesn't really answer the original question, but it's related to the idea of steps vs episodes vs fixed-length trajectories. Relevant quote from the paper:
|
Agreed. But for beginners, we'd better not confuse them. Just add
|
In the simple policy gradient implementation here, all of the observations, actions, and rewards for one "epoch" (potentially consisting of multiple episodes) are gathered into the same list). These are then all passed into the MLP that parametrizes the policy. Then we have this line
where
log_probs
is just a 1D vector of the probabilities of all actions in this epoch (regardless of which episode they were from). But in the page in the introduction, we have this equationIt seems like the implementation takes the mean over the number of elements in both sums (the number of total actions) instead of just over the number of elements in the first sum (the number of episodes). Am I misunderstanding this or is this an error? Is it just implemented like this for simplicity? (since scaling the loss by a constant factor will just scale the gradient but not change the direction) It's also clearly nicer to store all observations, etc. in single vectors because there will probably be a different number of steps in different episodes, so we can't, e.g., have a matrix of vectors of actions for each episode without padding. But we could, e.g., pass in the number of episodes directly to the model and then even if all actions, etc. are in single vectors, we could still get the proper result shown in the equation above by summing across the vector and then dividing by the number of episodes given.
Thanks for your help understanding this!
The text was updated successfully, but these errors were encountered: