You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Big love for this tutorial about RL! I just started studying the materials, and in the introduction to RL, part 1, I found some places where the notations are somewhat inconsistent.
In the Reward and Return section, the reward function is denoted as $$R$$, while in the Bellman equation section, the reward function is denoted as $$r$$.
In the Policies section, a (stochastic) policy $$\pi$$ is defined as a distribution of actions at time $$t$$. However, throughout the tutorials, notations like $$\tau\sim\pi$$ and $$a\sim\pi$$ are both frequently used, which creates some confusion about what distribution $$\pi$$ really is. (I can see why $$\tau\sim\pi$$ is reasonable when a trajectory $$\tau$$ is generated by a policy $$\pi$$, but it's not what the definition promise)
These findings are by no means errors, but just notational issues that may cause confusions. I wish they could be fixed, or if they are indeed used interchangeably in the literature and you decide not to change it, then some explanations would be just fine.
Thanks!
The text was updated successfully, but these errors were encountered:
Hi OpenAI developers,
Big love for this tutorial about RL! I just started studying the materials, and in the introduction to RL, part 1, I found some places where the notations are somewhat inconsistent.
These findings are by no means errors, but just notational issues that may cause confusions. I wish they could be fixed, or if they are indeed used interchangeably in the literature and you decide not to change it, then some explanations would be just fine.
Thanks!
The text was updated successfully, but these errors were encountered: