Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why is there a minus sign in the loss function of simplest policy gradient in part 3? #436

Open
inv1s10n opened this issue Jan 9, 2025 · 1 comment

Comments

@inv1s10n
Copy link

inv1s10n commented Jan 9, 2025

image

I don't understand why there's a minus sign here, and there's no minus sign in my formula

@burichh
Copy link

burichh commented Jan 30, 2025

In supervised learning (so NOT reinforcement learning) the network is trained with gradient descent. This means, that during optimization we want to decrease the specified loss function. The way this is done is by changing the parameters according to the negative gradients, because you want to decrease the loss. Mathematically (for the sake of example, this is the simplest SGD algo):

$$\theta_{i+1} = \theta_i - \alpha \cdot \nabla L(\theta) | _{\theta_i}$$

where $$\theta_i$$ are the parameters of the network at step $$i$$, $$\alpha$$ is the learning rate and $$\nabla L(\theta) | _{\theta_i}$$ is the gradient of the loss with respect to the parameters. Note the negative sign! This is due to the fact that the gradient vector shows the direction of greatest increase, but want to decrease the loss, thus we take a step in the opposite direction.

Now, anytime when you use torch.optimizers.SGD or torch.optimizers.Adam, or any algo from the optimizer package, you basically use this formula: update the parameters by adding the negative gradient (times learning rate). In other words, when you use any of the torch.optimizers it will accumulate the negative gradient, because it was designed to minimize the loss function.

NOW!

In reinforcement learning, we want to increase the expected reward!

The story is very similar: take the parameters, and update them according to the gradients, but in this case, use the positive gradient!

$$\theta_{i+1} = \theta_i + \alpha \cdot \nabla J(\theta) | _{\theta_i}$$

where $$\nabla J(\theta) | _{\theta_i}$$ is the derivative of the expected reward with respect to the policy network's parameters. Not the PLUS sign, this is a gradient ascent, not descent.

With the trick described in SpinningUp's docu part 3, we can calculate a very close approximation of $$\nabla J(\theta) | _{\theta_i}$$ by multiplying the derivative of log probabilities with the rewards. Nice, we have a good approximation of $$\nabla J(\theta) | _{\theta_i}$$, but the problem is that when we call the torch.optimizer on the model parateres $$\theta$$ it will subtract this gradient, not add. Thus we need a minus sign there, to finally add the gradients (times learning rate) to create gradient ascent, instead of descent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants