Why is there a minus sign in the loss function of simplest policy gradient in part 3? #436

inv1s10n · 2025-01-09T06:59:36Z

I don't understand why there's a minus sign here, and there's no minus sign in my formula

burichh · 2025-01-30T16:33:16Z

In supervised learning (so NOT reinforcement learning) the network is trained with gradient descent. This means, that during optimization we want to decrease the specified loss function. The way this is done is by changing the parameters according to the negative gradients, because you want to decrease the loss. Mathematically (for the sake of example, this is the simplest SGD algo):

$$\theta_{i+1} = \theta_i - \alpha \cdot \nabla L(\theta) | _{\theta_i}$$

where $$\theta_i$$ are the parameters of the network at step $$i$$, $$\alpha$$ is the learning rate and $$\nabla L(\theta) | _{\theta_i}$$ is the gradient of the loss with respect to the parameters. Note the negative sign! This is due to the fact that the gradient vector shows the direction of greatest increase, but want to decrease the loss, thus we take a step in the opposite direction.

Now, anytime when you use torch.optimizers.SGD or torch.optimizers.Adam, or any algo from the optimizer package, you basically use this formula: update the parameters by adding the negative gradient (times learning rate). In other words, when you use any of the torch.optimizers it will accumulate the negative gradient, because it was designed to minimize the loss function.

NOW!

In reinforcement learning, we want to increase the expected reward!

The story is very similar: take the parameters, and update them according to the gradients, but in this case, use the positive gradient!

$$\theta_{i+1} = \theta_i + \alpha \cdot \nabla J(\theta) | _{\theta_i}$$

where $$\nabla J(\theta) | _{\theta_i}$$ is the derivative of the expected reward with respect to the policy network's parameters. Not the PLUS sign, this is a gradient ascent, not descent.

With the trick described in SpinningUp's docu part 3, we can calculate a very close approximation of $$\nabla J(\theta) | _{\theta_i}$$ by multiplying the derivative of log probabilities with the rewards. Nice, we have a good approximation of $$\nabla J(\theta) | _{\theta_i}$$, but the problem is that when we call the torch.optimizer on the model parateres $$\theta$$ it will subtract this gradient, not add. Thus we need a minus sign there, to finally add the gradients (times learning rate) to create gradient ascent, instead of descent.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why is there a minus sign in the loss function of simplest policy gradient in part 3? #436

Why is there a minus sign in the loss function of simplest policy gradient in part 3? #436

inv1s10n commented Jan 9, 2025

burichh commented Jan 30, 2025

Why is there a minus sign in the loss function of simplest policy gradient in part 3? #436

Why is there a minus sign in the loss function of simplest policy gradient in part 3? #436

Comments

inv1s10n commented Jan 9, 2025

burichh commented Jan 30, 2025