Why use V(s) instead of Q(s, a) in computing TD error in Actor-Critic agent? #30
-
I notice that in when you teach SARSA and Q-learning in Chapter2, you always use Q(s_t, a_t) when computing TD error, only difference is SARSA use Q(s_[t+1], a_[t+1]) while Q-learning use max(Q(s_[t+1], :)). SARSA: Q-learning: HOWEVER, when you teach Actor-Critic, it seems that you use V(s) in stead of Q(s_t, a_t) to compute TD error AND, critic network is design to only output lenght 1 instead of length action_dim Is that any specific reason for this design?? Thanks in advance. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Good observation! There's an equivalent policy gradient form called Q Actor-Critic that uses the action-value function Q(s, a), but then the Critic network needs to learn the action-value function Q(s, a), which would be a vector (length=action_dim). Due to the higher-dimensional nature of Q(s, a), it is a harder function to be approximated/learned by a neural network and practically it's often unstable. (Q Actor-Critic) There's yet another equivalent policy gradient form called Advantage Actor-Critic that uses the advantage function A(s, a) but then the Critic network needs to learn both V(s) and Q(s, a) since A(s, a) = Q(s, a) - V(s). This would require two sets of Critic parameters (or a shared network with two output heads) -- One for approximating/learning V(s) & another for Q(s, a). ( Advantage Actor-Critic) Using the Advantage function (Advantage Actor-Critic) is better than using the Q function because the Advantage function and this requires only one set of critic parameters for the state-value function V(s) which is a scalar (length=1) for a given state and therefore is easier to be approximated/learned by a neural network and is relatively stable (compared to learning Q-values). Hope this helps. |
Beta Was this translation helpful? Give feedback.
Good observation!
That's because, in the Actor-Critic agent recipe, the policy gradient is computed using the (one-step) TD-error and for this, the Critic network only needs to learn the state-value function V(s), which is a scalar (length=1). (TD Actor-Critic)
There's an equivalent policy gradient form called Q Actor-Critic that uses the action-value function Q(s, a), but then the Critic network needs to learn the action-value function Q(s, a), which would be a vector (length=action_dim). Due to the higher-dimensional nature of Q(s, a), it is a harder function to be approximated/learned by a neural network and practically it's often unstable. (Q Actor-Critic)
There's yet another equivale…