Why use V(s) instead of Q(s, a) in computing TD error in Actor-Critic agent? #30

QuantHao · 2021-05-24T05:54:11Z

QuantHao
May 24, 2021

I notice that in when you teach SARSA and Q-learning in Chapter2, you always use Q(s_t, a_t) when computing TD error, only difference is SARSA use Q(s_[t+1], a_[t+1]) while Q-learning use max(Q(s_[t+1], :)).

SARSA:

Tensorflow-2-Reinforcement-Learning-Cookbook/Chapter02/5_sarsa_sarsa_lambda.py

Lines 29 to 30 in 1184e86

    
           q[state][action] += alpha * ( 
        
               reward + gamma * q[next_state][next_action] - q[state][action]

Q-learning:

Tensorflow-2-Reinforcement-Learning-Cookbook/Chapter02/6_q_learning.py

Lines 28 to 31 in 1184e86

    
           # Q-Learning update 
        
           grid_action_values[state][action] += alpha * ( 
        
               reward + gamma * max(q[next_state]) - q[state][action] 
        
           )

HOWEVER, when you teach Actor-Critic, it seems that you use V(s) in stead of Q(s_t, a_t) to compute TD error

Tensorflow-2-Reinforcement-Learning-Cookbook/Chapter02/8_actor_critic_agent.py

Lines 62 to 64 in 1184e86

    
           value, action_probabilities = self.actor_critic(state, training=True) 
        
           value_next_st, _ = self.actor_critic(next_state, training=True) 
        
           td = reward + self.gamma * value_next_st * (1 - int(done)) - value

AND, critic network is design to only output lenght 1 instead of length action_dim

Tensorflow-2-Reinforcement-Learning-Cookbook/Chapter02/8_actor_critic_agent.py

Lines 12 to 17 in 1184e86

    
           def __init__(self, action_dim): 
        
               super().__init__() 
        
               self.fc1 = tf.keras.layers.Dense(512, activation="relu") 
        
               self.fc2 = tf.keras.layers.Dense(128, activation="relu") 
        
               self.critic = tf.keras.layers.Dense(1, activation=None) 
        
               self.actor = tf.keras.layers.Dense(action_dim, activation=None)

Is that any specific reason for this design?? Thanks in advance.

Answered by praveen-palanisamy

May 24, 2021

Good observation!
That's because, in the Actor-Critic agent recipe, the policy gradient is computed using the (one-step) TD-error and for this, the Critic network only needs to learn the state-value function V(s), which is a scalar (length=1). (TD Actor-Critic)

There's an equivalent policy gradient form called Q Actor-Critic that uses the action-value function Q(s, a), but then the Critic network needs to learn the action-value function Q(s, a), which would be a vector (length=action_dim). Due to the higher-dimensional nature of Q(s, a), it is a harder function to be approximated/learned by a neural network and practically it's often unstable. (Q Actor-Critic)

There's yet another equivale…

View full answer

praveen-palanisamy · 2021-05-24T06:47:58Z

praveen-palanisamy
May 24, 2021
Maintainer

Good observation!
That's because, in the Actor-Critic agent recipe, the policy gradient is computed using the (one-step) TD-error and for this, the Critic network only needs to learn the state-value function V(s), which is a scalar (length=1). (TD Actor-Critic)

There's an equivalent policy gradient form called Q Actor-Critic that uses the action-value function Q(s, a), but then the Critic network needs to learn the action-value function Q(s, a), which would be a vector (length=action_dim). Due to the higher-dimensional nature of Q(s, a), it is a harder function to be approximated/learned by a neural network and practically it's often unstable. (Q Actor-Critic)

There's yet another equivalent policy gradient form called Advantage Actor-Critic that uses the advantage function A(s, a) but then the Critic network needs to learn both V(s) and Q(s, a) since A(s, a) = Q(s, a) - V(s). This would require two sets of Critic parameters (or a shared network with two output heads) -- One for approximating/learning V(s) & another for Q(s, a). ( Advantage Actor-Critic)

Using the Advantage function (Advantage Actor-Critic) is better than using the Q function because the Advantage function A(s, a) = Q(s, a) - V(s) has less variance in estimating the policy gradient compared to Q(s,a) due to the subtraction of the "baseline" i.e the state-value function V(s). The good news is that, the TD error for the true value function is an unbiased estimate of the Advantage function! Therefore, for all practical purposes, we can use the (approximate) TD error to estimate the policy gradient as in

Tensorflow-2-Reinforcement-Learning-Cookbook/Chapter02/8_actor_critic_agent.py

Line 64 in 1184e86

td = reward + self.gamma * value_next_st * (1 - int(done)) - value

and this requires only one set of critic parameters for the state-value function V(s) which is a scalar (length=1) for a given state and therefore is easier to be approximated/learned by a neural network and is relatively stable (compared to learning Q-values).

Hope this helps.

1 reply

QuantHao May 24, 2021
Author

Thanks so much for this detailed reply! Maybe you could add it into your book in the next published version.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why use V(s) instead of Q(s, a) in computing TD error in Actor-Critic agent? #30

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Why use V(s) instead of Q(s, a) in computing TD error in Actor-Critic agent? #30

QuantHao May 24, 2021

Replies: 1 comment · 1 reply

praveen-palanisamy May 24, 2021 Maintainer

QuantHao May 24, 2021 Author

QuantHao
May 24, 2021

Replies: 1 comment 1 reply

praveen-palanisamy
May 24, 2021
Maintainer

QuantHao May 24, 2021
Author