You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In supervised learning (so NOT reinforcement learning) the network is trained with gradient descent. This means, that during optimization we want to decrease the specified loss function. The way this is done is by changing the parameters according to the negative gradients, because you want to decrease the loss. Mathematically (for the sake of example, this is the simplest SGD algo):
where $$\theta_i$$ are the parameters of the network at step $$i$$, $$\alpha$$ is the learning rate and $$\nabla L(\theta) | _{\theta_i}$$ is the gradient of the loss with respect to the parameters. Note the negative sign! This is due to the fact that the gradient vector shows the direction of greatest increase, but want to decrease the loss, thus we take a step in the opposite direction.
Now, anytime when you use torch.optimizers.SGD or torch.optimizers.Adam, or any algo from the optimizer package, you basically use this formula: update the parameters by adding the negative gradient (times learning rate). In other words, when you use any of the torch.optimizers it will accumulate the negative gradient, because it was designed to minimize the loss function.
NOW!
In reinforcement learning, we want to increase the expected reward!
The story is very similar: take the parameters, and update them according to the gradients, but in this case, use the positive gradient!
where $$\nabla J(\theta) | _{\theta_i}$$ is the derivative of the expected reward with respect to the policy network's parameters. Not the PLUS sign, this is a gradient ascent, not descent.
With the trick described in SpinningUp's docu part 3, we can calculate a very close approximation of $$\nabla J(\theta) | _{\theta_i}$$ by multiplying the derivative of log probabilities with the rewards. Nice, we have a good approximation of $$\nabla J(\theta) | _{\theta_i}$$, but the problem is that when we call the torch.optimizer on the model parateres $$\theta$$ it will subtract this gradient, not add. Thus we need a minus sign there, to finally add the gradients (times learning rate) to create gradient ascent, instead of descent.
I don't understand why there's a minus sign here, and there's no minus sign in my formula
The text was updated successfully, but these errors were encountered: