Skip to content

Latest commit

 

History

History

PPO

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 

Proximal Policy Optimization Algorithms

Contents

Summary

A new family of policy gradient methods for reinforcement learning which alternate between data through interaction with the environment, and optimizing a “surrogate” objective function using stochastic gradient ascent.

Q-learning fails on many simple problems and is poorly understood, vanilla policy gradient methods have poor data effiency and robustness; and trust region policy optimization (TRPO) is relatively complicated, and is not compatible with architectures that include noise (such as dropout) or parameter sharing.

Standard policy gradient methods perform one gradient update per data sample, but a novel objective function is proposed that enables multiple epochs of minibatch updates with clipped probability ratios. To optimize policies, sampling data from the policy and performed several epochs of optimizations are alternated.

Background: Policy Optimization

Policy Gradient Methods

Policy gradient methods work by computing an estimator of the policy gradient and plugging it into a stochastic gradient ascent algorithm. The most commonly used gradient estimator has the form policy, where pi_theta is a stochastic policy and Ahat is an estimator of the advantage function at timestep t. The expectation indicates the empirical average over a finite batch of samples.

An objective function whose gradient is the policy gradient estimator is constructed and the estimator is obtained by differentiating the objective objective.

Trust Region Methods

In TRPOm an obective function is maximized subject to a constraint on the size of policy update. Specifically, objective subject to inequality.

The theta_old is the vector of policy parameters before the update. This problem can efficiently be approximately solved using the conjugate gradient algorithm, after making a linear approximation to the objective and a quadratic approximation to the constraint.

The theory justifying TRPO actually suggests using a penalty instead of a constraint, i.e., solving the unconstrained optimization problem, penalty for some coefficient beta. This follows from the fact that a certain surrogate objective forms a lower bound on the policy.

Clipped Surrogate Objective

Let probability denote the probability ratio ratio. TRPO maximizes a "surrogate" objective:

Objective

The superscript CPI refers to conservative policy iteration. Without a constraint, maximization of L would lead to a excessive large policy update. The main objective therefore is main, where epsilon is a hyperparameter.

The clip modifies the surrogate objectibe by clipping the proabability ratio, which removes the incentive for moving reward outside of the interval interval. Finally, we take the minimum of the clipped and unclipped objective, so the final objective is a lower bound on the unclipped objective.

Adaptive KL Penalty Coefficient

A KL divergance can be used to adapt the penalty coefficient, to achieve target value of the KL divergance each policy update.

  • Use several epochs of minibatch SGD, optimize the KL-penalized objective objective

  • Compute d

    • If update
    • If update

The updated beta is used for next policy update.

Algorithm

The surrogate losses from the previous sections can be computed and differentiated with a minor change to a typical policy gradient implementation. Most techniques for computing variance-reduced advantage-function estimators make use a learned state-value function value.

If using a neural network architecture that shares parameters between the policy and value function, a loss funciton that combines the policy surrogate must be used. This objective can further be augmented by adding an entropy bonus to ensure sufficient exploration. Combining the terms, the following objective is obained, which is (approximately) maximized each iteration: combined where c are coefficients and S denotes entropy bonus and LVF is a squared-error loss loss.

The advantage estimator runs the policy for T timesteps and is given by estimator. Generalizing this choice to a truncated version of generalized advantage estimation, we get estimator where delta.

A proximal policy optimization (PPO) algorithm that uses fixed-length trajectory segments. Each iteration, each of N parallel actors collect T timesteps of data. A surrogate loss is constructed on these NT timesteps and optimized using SGD for K epochs.