Name	Name	Last commit message	Last commit date
parent directory ..
Paper.pdf	Paper.pdf
README.md	README.md

Proximal Policy Optimization Algorithms

Paper

Summary

A new family of policy gradient methods for reinforcement learning which alternate between data through interaction with the environment, and optimizing a “surrogate” objective function using stochastic gradient ascent.

Q-learning fails on many simple problems and is poorly understood, vanilla policy gradient methods have poor data effiency and robustness; and trust region policy optimization (TRPO) is relatively complicated, and is not compatible with architectures that include noise (such as dropout) or parameter sharing.

Standard policy gradient methods perform one gradient update per data sample, but a novel objective function is proposed that enables multiple epochs of minibatch updates with clipped probability ratios. To optimize policies, sampling data from the policy and performed several epochs of optimizations are alternated.

Background: Policy Optimization

Policy Gradient Methods

Policy gradient methods work by computing an estimator of the policy gradient and plugging it into a stochastic gradient ascent algorithm. The most commonly used gradient estimator has the form , where is a stochastic policy and is an estimator of the advantage function at timestep t. The expectation indicates the empirical average over a finite batch of samples.

An objective function whose gradient is the policy gradient estimator is constructed and the estimator is obtained by differentiating the objective .

Trust Region Methods

In TRPOm an obective function is maximized subject to a constraint on the size of policy update. Specifically, subject to .

The is the vector of policy parameters before the update. This problem can efficiently be approximately solved using the conjugate gradient algorithm, after making a linear approximation to the objective and a quadratic approximation to the constraint.

The theory justifying TRPO actually suggests using a penalty instead of a constraint, i.e., solving the unconstrained optimization problem, for some coefficient . This follows from the fact that a certain surrogate objective forms a lower bound on the policy.

Clipped Surrogate Objective

Let denote the probability ratio . TRPO maximizes a "surrogate" objective:

The superscript CPI refers to conservative policy iteration. Without a constraint, maximization of would lead to a excessive large policy update. The main objective therefore is , where is a hyperparameter.

The clip modifies the surrogate objectibe by clipping the proabability ratio, which removes the incentive for moving outside of the interval . Finally, we take the minimum of the clipped and unclipped objective, so the final objective is a lower bound on the unclipped objective.

Adaptive KL Penalty Coefficient

A KL divergance can be used to adapt the penalty coefficient, to achieve target value of the KL divergance each policy update.

Use several epochs of minibatch SGD, optimize the KL-penalized objective
Compute
- If
- If

The updated beta is used for next policy update.

Algorithm

The surrogate losses from the previous sections can be computed and differentiated with a minor change to a typical policy gradient implementation. Most techniques for computing variance-reduced advantage-function estimators make use a learned state-value function .

If using a neural network architecture that shares parameters between the policy and value function, a loss funciton that combines the policy surrogate must be used. This objective can further be augmented by adding an entropy bonus to ensure sufficient exploration. Combining the terms, the following objective is obained, which is (approximately) maximized each iteration: where are coefficients and S denotes entropy bonus and is a squared-error loss .

The advantage estimator runs the policy for T timesteps and is given by . Generalizing this choice to a truncated version of generalized advantage estimation, we get where .

A proximal policy optimization (PPO) algorithm that uses fixed-length trajectory segments. Each iteration, each of N parallel actors collect T timesteps of data. A surrogate loss is constructed on these NT timesteps and optimized using SGD for K epochs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PPO

PPO

README.md

Proximal Policy Optimization Algorithms

Contents

Summary

Background: Policy Optimization

Policy Gradient Methods

Trust Region Methods

Clipped Surrogate Objective

Adaptive KL Penalty Coefficient

Algorithm

Files

PPO

Directory actions

More options

Directory actions

More options

Latest commit

History

PPO

Folders and files

parent directory

README.md

Proximal Policy Optimization Algorithms

Contents

Summary

Background: Policy Optimization

Policy Gradient Methods

Trust Region Methods

Clipped Surrogate Objective

Adaptive KL Penalty Coefficient

Algorithm