diff --git a/docs/_video/user_guide_video/_experimentManager_page_CartPole.mp4 b/docs/_video/user_guide_video/_experimentManager_page_CartPole.mp4
index c9ae206f8..18c761662 100644
Binary files a/docs/_video/user_guide_video/_experimentManager_page_CartPole.mp4 and b/docs/_video/user_guide_video/_experimentManager_page_CartPole.mp4 differ
diff --git a/docs/basics/DeepRLTutorial/TutorialDeepRL.md b/docs/basics/DeepRLTutorial/TutorialDeepRL.md
index aae3033e5..939e11447 100644
--- a/docs/basics/DeepRLTutorial/TutorialDeepRL.md
+++ b/docs/basics/DeepRLTutorial/TutorialDeepRL.md
@@ -12,8 +12,8 @@ Imports
```python
from rlberry.envs import gym_make
from rlberry.manager import plot_writer_data, ExperimentManager, evaluate_agents
-from rlberry_research.agents.torch import A2CAgent
-from rlberry_research.agents.torch.utils.training import model_factory_from_env
+from rlberry.agents.stable_baselines import StableBaselinesAgent
+from stable_baselines3 import PPO
```
Reminder of the RL setting
@@ -48,9 +48,9 @@ In this tutorial we are going to use the [Gymnasium library (previously
OpenAI's Gym)](https://gymnasium.farama.org/api/env/). This library
provides a large number of environments to test RL algorithm.
-We will focus only on the **CartPole-v1** environment, although we
-recommend experimenting with other environments such as **Acrobot-v1**
-and **MountainCar-v0**. The following table presents some basic
+We will focus only on the **Acrobot-v1** environment, although you can
+experimenting with other environments such as **CartPole-v1**
+or **MountainCar-v0**. The following table presents some basic
components of the three environments, such as the dimensions of their
observation and action spaces and the rewards occurring at each step.
@@ -60,92 +60,33 @@ observation and action spaces and the rewards occurring at each step.
| **Action Space** | Discrete(2)| Discrete(3) | Discrete(3) |
| **Rewards** | 1 per step | -1 if not terminal else 0 | -1 per step |
-Actor-Critic algorithms and A2C
--------------------------------
-**Actor-Critic algorithms** methods consist of two models, which may
-optionally share parameters:
-
-- Critic updates the value function parameters w and depending on the
-algorithm it could be action-value $Q_{\varphi}(s,a )$ or state-value
-$V_{\varphi}(s)$.
-- Actor updates the policy parameters $\theta$ for
-$\pi_{\theta}(a \mid s)$, in the direction suggested by the critic.
-
-**A2C** is an Actor-Critic algorithm and it is part of the on-policy
-family, which means that we are learning the value function for one
-policy while following it. The original paper in which it was proposed
-can be found [here](https://arxiv.org/pdf/1602.01783.pdf) and the
-pseudocode of the algorithm is the following:
-
-- Initialize the actor $\pi_{\theta}$ and the critic $V_{\varphi}$
- with random weights.
-- Observe the initial state $s_{0}$.
-- for $t \in\left[0, T_{\text {total }}\right]$ :
- - Initialize empty episode minibatch.
- - for $k \in[0, n]:$ \# Sample episode
- - Select a action $a_{k}$ using the actor $\pi_{\theta}$.
- - Perform the action $a_{k}$ and observe the next state
- $s_{k+1}$ and the reward $r_{k+1}$.
- - Store $\left(s_{k}, a_{k}, r_{k+1}\right)$ in the episode
- minibatch.
- - if $s_{n}$ is not terminal: set
- $R=V_{\varphi}\left(s_{n}\right)$ with the critic, else $R=0$.
- - Reset gradient $d \theta$ and $d \varphi$ to 0 .
- - for $k \in[n-1,0]$ : \# Backwards iteration over the episode
- - Update the discounted sum of rewards
- $R \leftarrow r_{k}+\gamma R$
-
- - Accumulate the policy gradient using the critic:
-
- $$d \theta \leftarrow d \theta+\nabla_{\theta} \log \pi_{\theta}\left(a_{k}\mid s_{k}\right)\left(R-V_{\varphi}\left(s_{k}\right)\right)$$
-
- - Accumulate the critic gradient:
-
-$$d \varphi \leftarrow d \varphi+\nabla_{\varphi}\left(R-V_{\varphi}\left(s_{k}\right)\right)^{2}$$
-
-- Update the actor and the critic with the accumulated gradients using
- gradient descent or similar:
-
-$$\theta \leftarrow \theta+\eta d \theta \quad \varphi \leftarrow \varphi+\eta d \varphi$$
-
-Running A2C on CartPole
+Running A2C on Acrobot-v1
-----------------------
-⚠ **warning :** depending on the seed, you may get different results, and if you're (un)lucky, your default agent may learn and be better than the tuned agent. ⚠
+⚠ **warning:** depending on the seed, you may get different results. ⚠
-In the next example we use default parameters for both the Actor and the
-Critic and we use rlberry to train and evaluate our A2C agent. The
-default networks are:
-
-- a dense neural network with two hidden layers of 64 units for the
- **Actor**, the input layer has the dimension of the state space
- while the output layer has the dimension of the action space. The
- activations are RELU functions and we have a softmax in the last
- layer.
-- a dense neural network with two hidden layers of 64 units for the
- **Critic**, the input layer has the dimension of the state space
- while the output has dimension 1. The activations are RELU functions
- apart from the last layer that has a linear activation.
+In the next example we use default parameters PPO and we use rlberry to train and evaluate the [Stable Baselines](https://github.com/DLR-RM/stable-baselines3) PPO agent.
```python
"""
The ExperimentManager class is a compact way of experimenting with a deepRL agent.
"""
default_xp = ExperimentManager(
- A2CAgent, # The Agent class.
- (gym_make, dict(id="CartPole-v1")), # The Environment to solve.
- fit_budget=3e5, # The number of interactions
+ StableBaselinesAgent, # The Agent class.
+ (gym_make, dict(id="Acrobot-v1")), # The Environment to solve.
+ fit_budget=1e5, # The number of interactions
# between the agent and the
# environment during training.
+ init_kwargs=dict(algo_cls=PPO), # Init value for StableBaselinesAgent
eval_kwargs=dict(eval_horizon=500), # The number of interactions
# between the agent and the
# environment during evaluations.
- n_fit=1, # The number of agents to train.
+ n_fit=3, # The number of agents to train.
# Usually, it is good to do more
# than 1 because the training is
# stochastic.
- agent_name="A2C default", # The agent's name.
+ agent_name="PPO default", # The agent's name.
)
print("Training ...")
@@ -155,76 +96,27 @@ default_xp.fit() # Trains the agent on fit_budget steps!
# Plot the training data:
_ = plot_writer_data(
[default_xp],
- tag="episode_rewards",
+ tag="rollout/ep_rew_mean",
title="Training Episode Cumulative Rewards",
show=True,
)
```
-```none
-[INFO] Running ExperimentManager fit() for A2C default with n_fit = 1 and max_workers = None.
-INFO: Making new env: CartPole-v1
-INFO: Making new env: CartPole-v1
-[INFO] Could not find least used device (nvidia-smi might be missing), use cuda:0 instead
-```
-
-
-
```none
Training ...
-```
-
-
-
-```none
-[INFO] [A2C default[worker: 0]] | max_global_step = 5644 |episode_rewards = 196.0 | total_episodes = 111 |
-[INFO] [A2C default[worker: 0]] | max_global_step = 9551 | episode_rewards = 380.0 | total_episodes = 134 |
-[INFO] [A2C default[worker: 0]] | max_global_step = 13128 | episode_rewards = 125.0 | total_episodes = 182 |
-[INFO] [A2C default[worker: 0]] | max_global_step = 16617 | episode_rewards = 246.0 | total_episodes = 204 |
-[INFO] [A2C default[worker: 0]] | max_global_step = 20296 | episode_rewards = 179.0 | total_episodes = 222 |
-[INFO] [A2C default[worker: 0]] | max_global_step = 23633 | episode_rewards = 120.0 | total_episodes = 240 |
-[INFO] [A2C default[worker: 0]] | max_global_step = 26193 | episode_rewards = 203.0 | total_episodes = 252 |
-[INFO] [A2C default[worker: 0]] | max_global_step = 28969 | episode_rewards = 104.0 | total_episodes = 271 |
-[INFO] [A2C default[worker: 0]] | max_global_step = 34757 | episode_rewards = 123.0 | total_episodes = 335 |
-[INFO] [A2C default[worker: 0]] | max_global_step = 41554 | episode_rewards = 173.0 | total_episodes = 373 |
-[INFO] [A2C default[worker: 0]] | max_global_step = 48418 | episode_rewards = 217.0 | total_episodes = 423 |
-[INFO] [A2C default[worker: 0]] | max_global_step = 55322 | episode_rewards = 239.0 | total_episodes = 446 |
-[INFO] [A2C default[worker: 0]] | max_global_step = 62193 | episode_rewards = 218.0 | total_episodes = 471 |
-[INFO] [A2C default[worker: 0]] | max_global_step = 69233 | episode_rewards = 377.0 | total_episodes = 509 |
-[INFO] [A2C default[worker: 0]] | max_global_step = 76213 | episode_rewards = 211.0 | total_episodes = 536 |
-[INFO] [A2C default[worker: 0]] | max_global_step = 83211 | episode_rewards = 212.0 | total_episodes = 562 |
-[INFO] [A2C default[worker: 0]] | max_global_step = 90325 | episode_rewards = 211.0 | total_episodes = 586 |
-[INFO] [A2C default[worker: 0]] | max_global_step = 97267 | episode_rewards = 136.0 | total_episodes = 631 | [INFO] [A2C default[worker: 0]] | max_global_step = 104280 | episode_rewards = 175.0 | total_episodes = 686 |
-[INFO] [A2C default[worker: 0]] | max_global_step = 111194 | episode_rewards = 258.0 | total_episodes = 722 |
-[INFO] [A2C default[worker: 0]] | max_global_step = 118067 | episode_rewards = 235.0 | total_episodes = 755 |
-[INFO] [A2C default[worker: 0]] | max_global_step = 125040 | episode_rewards = 500.0 | total_episodes = 777 |
-[INFO] [A2C default[worker: 0]] | max_global_step = 132478 | episode_rewards = 500.0 | total_episodes = 792 |
-[INFO] [A2C default[worker: 0]] | max_global_step = 139591 | episode_rewards = 197.0 | total_episodes = 813 |
-[INFO] [A2C default[worker: 0]] | max_global_step = 146462 | episode_rewards = 500.0 | total_episodes = 835 |
-[INFO] [A2C default[worker: 0]] | max_global_step = 153462 | episode_rewards = 500.0 | total_episodes = 849 |
-[INFO] [A2C default[worker: 0]] | max_global_step = 160462 | episode_rewards = 500.0 | total_episodes = 863 |
-[INFO] [A2C default[worker: 0]] | max_global_step = 167462 | episode_rewards = 500.0 | total_episodes = 877 | [INFO] [A2C default[worker: 0]] | max_global_step = 174462 | episode_rewards = 500.0 | total_episodes = 891 |
-[INFO] [A2C default[worker: 0]] | max_global_step = 181462 | episode_rewards = 500.0 | total_episodes = 905 |
-[INFO] [A2C default[worker: 0]] | max_global_step = 188462 | episode_rewards = 500.0 | total_episodes = 919 |
-[INFO] [A2C default[worker: 0]] | max_global_step = 195462 | episode_rewards = 500.0 | total_episodes = 933 |
-[INFO] [A2C default[worker: 0]] | max_global_step = 202520 | episode_rewards = 206.0 | total_episodes = 957 |
-[INFO] [A2C default[worker: 0]] | max_global_step = 209932 | episode_rewards = 500.0 | total_episodes = 978 |
-[INFO] [A2C default[worker: 0]] | max_global_step = 216932 | episode_rewards = 500.0 | total_episodes = 992 |
-[INFO] [A2C default[worker: 0]] | max_global_step = 223932 | episode_rewards = 500.0 | total_episodes = 1006 |
-[INFO] [A2C default[worker: 0]] | max_global_step = 230916 | episode_rewards = 214.0 | total_episodes = 1024 |
-[INFO] [A2C default[worker: 0]] | max_global_step = 235895 | episode_rewards = 500.0 | total_episodes = 1037 |
-[INFO] [A2C default[worker: 0]] | max_global_step = 242782 | episode_rewards = 118.0 | total_episodes = 1072 |
-[INFO] [A2C default[worker: 0]] | max_global_step = 249695 | episode_rewards = 131.0 | total_episodes = 1111 |
-[INFO] [A2C default[worker: 0]] | max_global_step = 256649 | episode_rewards = 136.0 | total_episodes = 1160 |
-[INFO] [A2C default[worker: 0]] | max_global_step = 263674 | episode_rewards = 100.0 | total_episodes = 1215 |
-[INFO] [A2C default[worker: 0]] | max_global_step = 270727 | episode_rewards = 136.0 | total_episodes = 1279 |
-[INFO] [A2C default[worker: 0]] | max_global_step = 277588 | episode_rewards = 275.0 | total_episodes = 1313 |
-[INFO] [A2C default[worker: 0]] | max_global_step = 284602 | episode_rewards = 136.0 | total_episodes = 1353 |
-[INFO] [A2C default[worker: 0]] | max_global_step = 291609 | episode_rewards = 117.0 | total_episodes = 1413 |
-[INFO] [A2C default[worker: 0]] | max_global_step = 298530 | episode_rewards = 147.0 | total_episodes = 1466 |
-[INFO] ... trained!
-INFO: Making new env: CartPole-v1 INFO: Making new env: CartPole-v1
-[INFO] Could not find least used device (nvidia-smi might be missing), use cuda:0 instead
+[INFO] 09:31: Running ExperimentManager fit() for PPO default with n_fit = 3 and max_workers = None.
+[INFO] 09:31: [PPO default[worker: 0]] | max_global_step = 4096 | time/iterations = 1 | rollout/ep_rew_mean = -500.0 | rollout/ep_len_mean = 500.0 | time/fps = 791 | time/time_elapsed = 2 | time/total_timesteps = 2048 | train/learning_rate = 0.0003 |
+[INFO] 09:31: [PPO default[worker: 1]] | max_global_step = 4096 | time/iterations = 1 | rollout/ep_rew_mean = -500.0 | rollout/ep_len_mean = 500.0 | time/fps = 741 | time/time_elapsed = 2 | time/total_timesteps = 2048 | train/learning_rate = 0.0003 |
+[INFO] 09:31: [PPO default[worker: 2]] | max_global_step = 4096 | time/iterations = 1 | rollout/ep_rew_mean = -500.0 | rollout/ep_len_mean = 500.0 | time/fps = 751 | time/time_elapsed = 2 | time/total_timesteps = 2048 | train/learning_rate = 0.0003 |
+[INFO] 09:32: [PPO default[worker: 0]] | max_global_step = 6144 | time/iterations = 2 | rollout/ep_rew_mean = -500.0 | rollout/ep_len_mean = 500.0 | time/fps = 617 | time/time_elapsed = 6 | time/total_timesteps = 4096 | train/learning_rate = 0.0003 | train/entropy_loss = -1.0967000976204873 | train/policy_gradient_loss = -0.0017652213326073251 | train/value_loss = 139.4249062538147 | train/approx_kl = 0.004285778850317001 | train/clip_fraction = 0.0044921875 | train/loss = 16.845857620239258 | train/explained_variance = -0.0011605024337768555 | train/n_updates = 10 | train/clip_range = 0.2 |
+...
+...
+...
+[INFO] 09:35: [PPO default[worker: 1]] | max_global_step = 100352 | time/iterations = 48 | rollout/ep_rew_mean = -89.81 | rollout/ep_len_mean = 90.8 | time/fps = 486 | time/time_elapsed = 202 | time/total_timesteps = 98304 | train/learning_rate = 0.0003 | train/entropy_loss = -0.19921453138813378 | train/policy_gradient_loss = -0.002730156043253373 | train/value_loss = 21.20977843105793 | train/approx_kl = 0.0014179411809891462 | train/clip_fraction = 0.017626953125 | train/loss = 9.601455688476562 | train/explained_variance = 0.8966712430119514 | train/n_updates = 470 | train/clip_range = 0.2 |
+[INFO] 09:35: [PPO default[worker: 0]] | max_global_step = 100352 | time/iterations = 48 | rollout/ep_rew_mean = -83.22 | rollout/ep_len_mean = 84.22 | time/fps = 486 | time/time_elapsed = 202 | time/total_timesteps = 98304 | train/learning_rate = 0.0003 | train/entropy_loss = -0.14615743807516993 | train/policy_gradient_loss = -0.002418491238495335 | train/value_loss = 22.7100858271122 | train/approx_kl = 0.0006727844011038542 | train/clip_fraction = 0.010546875 | train/loss = 8.74121379852295 | train/explained_variance = 0.8884317129850388 | train/n_updates = 470 | train/clip_range = 0.2 |
+[INFO] 09:35: ... trained!
+[INFO] 09:35: Saved ExperimentManager(PPO default) using pickle.
+[INFO] 09:35: The ExperimentManager was saved in : 'rlberry_data/temp/manager_data/PPO default_2024-04-24_09-31-51_be15b329/manager_obj.pickle'
```
@@ -241,69 +133,11 @@ _ = evaluate_agents(
# 10 simulations of 500 steps each.
```
-```none
-[INFO] Evaluating A2C default...
-```
-
-
```none
Evaluating ...
-```
-
-
-
-```none
-[INFO][eval]... simulation 1/50
-[INFO][eval]... simulation 2/50
-[INFO][eval]... simulation 3/50
-[INFO][eval]... simulation 4/50
-[INFO][eval]... simulation 5/50
-[INFO][eval]... simulation 6/50
-[INFO][eval]... simulation 7/50
-[INFO][eval]... simulation 8/50
-[INFO][eval]... simulation 9/50
-[INFO][eval]... simulation 10/50
-[INFO][eval]... simulation 11/50
-[INFO][eval]... simulation 12/50
-[INFO][eval]... simulation 13/50
-[INFO][eval]... simulation 14/50
-[INFO][eval]... simulation 15/50
-[INFO][eval]... simulation 16/50
-[INFO][eval]... simulation 17/50
-[INFO][eval]... simulation 18/50
-[INFO][eval]... simulation 19/50
-[INFO][eval]... simulation 20/50
-[INFO][eval]... simulation 21/50
-[INFO][eval]... simulation 22/50
-[INFO][eval]... simulation 23/50
-[INFO][eval]... simulation 24/50
-[INFO][eval]... simulation 25/50
-[INFO][eval]... simulation 26/50
-[INFO][eval]... simulation 27/50
-[INFO][eval]... simulation 28/50
-[INFO][eval]... simulation 29/50
-[INFO][eval]... simulation 30/50
-[INFO][eval]... simulation 31/50
-[INFO][eval]... simulation 32/50
-[INFO][eval]... simulation 33/50
-[INFO][eval]... simulation 34/50
-[INFO][eval]... simulation 35/50
-[INFO][eval]... simulation 36/50
-[INFO][eval]... simulation 37/50
-[INFO][eval]... simulation 38/50
-[INFO][eval]... simulation 39/50
-[INFO][eval]... simulation 40/50
-[INFO][eval]... simulation 41/50
-[INFO][eval]... simulation 42/50
-[INFO][eval]... simulation 43/50
-[INFO][eval]... simulation 44/50
-[INFO][eval]... simulation 45/50
-[INFO][eval]... simulation 46/50
-[INFO][eval]... simulation 47/50
-[INFO][eval]... simulation 48/50
-[INFO][eval]... simulation 49/50
-[INFO][eval]... simulation 50/50
+[INFO] 09:36: Evaluating PPO default...
+[INFO] Evaluation:.................................................. Evaluation finished
```
@@ -313,57 +147,42 @@ Evaluating ...
:align: center
```
-Let's try to change the neural networks' architectures and see if we can
-beat our previous result. This time we use a smaller learning rate and
-bigger batch size to have more stable training.
+Let's try to change the hyperparameters and see if it change the previous result.
+
+⚠ **warning:** The aim of this section is to show that hyperparameters have an effect on agent training, and to demonstrate that it is possible to modify them.
+
+For pedagogical purposes, since the default hyperparameters are effective on these simple environments, we'll compare the default agent with an agent tuned with the wrong hyperparameters, which decreases the results. Obviously, in practical cases, the aim is to find hyperparameters that improve results... not decrease them. ⚠
+
+
-```python
-policy_configs = {
- "type": "MultiLayerPerceptron", # A network architecture
- "layer_sizes": (64, 64), # Network dimensions
- "reshape": False,
- "is_policy": True, # The network should output a distribution
- # over actions
-}
-
-critic_configs = {
- "type": "MultiLayerPerceptron",
- "layer_sizes": (64, 64),
- "reshape": False,
- "out_size": 1, # The critic network is an approximator of
- # a value function V: States -> |R
-}
-```
```python
tuned_xp = ExperimentManager(
- A2CAgent, # The Agent class.
- (gym_make, dict(id="CartPole-v1")), # The Environment to solve.
+ StableBaselinesAgent, # The Agent class.
+ (gym_make, dict(id="Acrobot-v1")), # The Environment to solve.
init_kwargs=dict( # Where to put the agent's hyperparameters
- policy_net_fn=model_factory_from_env, # A policy network constructor
- policy_net_kwargs=policy_configs, # Policy network's architecure
- value_net_fn=model_factory_from_env, # A Critic network constructor
- value_net_kwargs=critic_configs, # Critic network's architecure.
- optimizer_type="ADAM", # What optimizer to use for policy
+ algo_cls=PPO,
# gradient descent steps.
- learning_rate=1e-3, # Size of the policy gradient
# descent steps.
- entr_coef=0.0, # How much to force exploration.
- batch_size=1024 # Number of interactions used to
- # estimate the policy gradient
- # for each policy update.
+ ent_coef=0.10, # How much to force exploration.
+ normalize_advantage=False, # normalize the advantage
+ gae_lambda=0.90, # Factor for trade-off of bias vs variance for Generalized Advantage Estimator
+ n_epochs=20, # Number of epoch when optimizing the surrogate loss
+ n_steps=64, # The number of steps to run for the environment per update
+ learning_rate=1e-3,
+ batch_size=32,
),
- fit_budget=3e5, # The number of interactions
+ fit_budget=1e5, # The number of interactions
# between the agent and the
# environment during training.
eval_kwargs=dict(eval_horizon=500), # The number of interactions
# between the agent and the
# environment during evaluations.
- n_fit=1, # The number of agents to train.
+ n_fit=3, # The number of agents to train.
# Usually, it is good to do more
# than 1 because the training is
# stochastic.
- agent_name="A2C tuned", # The agent's name.
+ agent_name="PPO incorrectly tuned", # The agent's name.
)
@@ -374,75 +193,33 @@ tuned_xp.fit() # Trains the agent on fit_budget steps!
# Plot the training data:
_ = plot_writer_data(
[default_xp, tuned_xp],
- tag="episode_rewards",
+ tag="rollout/ep_rew_mean",
title="Training Episode Cumulative Rewards",
show=True,
)
```
```none
-[INFO] Running ExperimentManager fit() for A2C tuned with n_fit = 1
-and max_workers = None.
-INFO: Making new env: CartPole-v1
-INFO: Making new env: CartPole-v1
-[INFO] Could not find least used device (nvidia-smi might be missing), use cuda:0 instead
+Training ...
```
```none
-Training ...
+[INFO] 09:37: Running ExperimentManager fit() for PPO incorrectly tuned with n_fit = 3 and max_workers = None.
```
```none
-[INFO] [A2C tuned[worker: 0]] | max_global_step = 6777 | episode_rewards = 15.0 | total_episodes = 314 |
-[INFO] [A2C tuned[worker: 0]] | max_global_step = 13633 | episode_rewards = 14.0 | total_episodes = 602 |
-[INFO] [A2C tuned[worker: 0]] | max_global_step = 20522 | episode_rewards = 41.0 | total_episodes = 854 |
-[INFO] [A2C tuned[worker: 0]] | max_global_step = 27531 | episode_rewards = 13.0 | total_episodes = 1063 |
-[INFO] [A2C tuned[worker: 0]] | max_global_step = 34398 | episode_rewards = 42.0 | total_episodes = 1237 |
-[INFO] [A2C tuned[worker: 0]] | max_global_step = 41600 | episode_rewards = 118.0 | total_episodes = 1389 |
-[INFO] [A2C tuned[worker: 0]] | max_global_step = 48593 | episode_rewards = 50.0 | total_episodes = 1511 |
-[INFO] [A2C tuned[worker: 0]] | max_global_step = 55721 | episode_rewards = 113.0 | total_episodes = 1603 |
-[INFO] [A2C tuned[worker: 0]] | max_global_step = 62751 | episode_rewards = 41.0 | total_episodes = 1687 |
-[INFO] [A2C tuned[worker: 0]] | max_global_step = 69968 | episode_rewards = 344.0 | total_episodes = 1741 |
-[INFO] [A2C tuned[worker: 0]] | max_global_step = 77259 | episode_rewards = 418.0 | total_episodes = 1787 |
-[INFO] [A2C tuned[worker: 0]] | max_global_step = 84731 | episode_rewards = 293.0 | total_episodes = 1820 |
-[INFO] [A2C tuned[worker: 0]] | max_global_step = 91890 | episode_rewards = 185.0 | total_episodes = 1853 |
-[INFO] [A2C tuned[worker: 0]] | max_global_step = 99031 | episode_rewards = 278.0 | total_episodes = 1876 |
-[INFO] [A2C tuned[worker: 0]] | max_global_step = 106305 | episode_rewards = 318.0 | total_episodes = 1899 |
-[INFO] [A2C tuned[worker: 0]] | max_global_step = 113474 | episode_rewards = 500.0 | total_episodes = 1921 |
-[INFO] [A2C tuned[worker: 0]] | max_global_step = 120632 | episode_rewards = 370.0 | total_episodes = 1941 |
-[INFO] [A2C tuned[worker: 0]] | max_global_step = 127753 | episode_rewards = 375.0 | total_episodes = 1962 |
-[INFO] [A2C tuned[worker: 0]] | max_global_step = 135179 | episode_rewards = 393.0 | total_episodes = 1987 |
-[INFO] [A2C tuned[worker: 0]] | max_global_step = 142433 | episode_rewards = 500.0 | total_episodes = 2005 |
-[INFO] [A2C tuned[worker: 0]] | max_global_step = 149888 | episode_rewards = 500.0 | total_episodes = 2023 |
-[INFO] [A2C tuned[worker: 0]] | max_global_step = 157312 | episode_rewards = 467.0 | total_episodes = 2042 |
-[INFO] [A2C tuned[worker: 0]] | max_global_step = 164651 | episode_rewards = 441.0 | total_episodes = 2060 |
-[INFO] [A2C tuned[worker: 0]] | max_global_step = 172015 | episode_rewards = 500.0 | total_episodes = 2076 |
-[INFO] [A2C tuned[worker: 0]] | max_global_step = 178100 | episode_rewards = 481.0 | total_episodes = 2089 |
-[INFO] [A2C tuned[worker: 0]] | max_global_step = 183522 | episode_rewards = 462.0 | total_episodes = 2101 |
-[INFO] [A2C tuned[worker: 0]] | max_global_step = 190818 | episode_rewards = 500.0 | total_episodes = 2117 |
-[INFO] [A2C tuned[worker: 0]] | max_global_step = 198115 | episode_rewards = 500.0 | total_episodes = 2135 |
-[INFO] [A2C tuned[worker: 0]] | max_global_step = 205097 | episode_rewards = 500.0 | total_episodes = 2151 |
-[INFO] [A2C tuned[worker: 0]] | max_global_step = 212351 | episode_rewards = 500.0 | total_episodes = 2166 |
-[INFO] [A2C tuned[worker: 0]] | max_global_step = 219386 | episode_rewards = 500.0 | total_episodes = 2181 |
-[INFO] [A2C tuned[worker: 0]] | max_global_step = 226386 | episode_rewards = 500.0 | total_episodes = 2195 |
-[INFO] [A2C tuned[worker: 0]] | max_global_step = 233888 | episode_rewards = 500.0 | total_episodes = 2211 |
-[INFO] [A2C tuned[worker: 0]] | max_global_step = 241388 | episode_rewards = 500.0 | total_episodes = 2226 |
-[INFO] [A2C tuned[worker: 0]] | max_global_step = 248287 | episode_rewards = 500.0 | total_episodes = 2240 |
-[INFO] [A2C tuned[worker: 0]] | max_global_step = 255483 | episode_rewards = 500.0 | total_episodes = 2255 |
-[INFO] [A2C tuned[worker: 0]] | max_global_step = 262845 | episode_rewards = 500.0 | total_episodes = 2270 |
-[INFO] [A2C tuned[worker: 0]] | max_global_step = 270032 | episode_rewards = 500.0 | total_episodes = 2285 |
-[INFO] [A2C tuned[worker: 0]] | max_global_step = 277009 | episode_rewards = 498.0 | total_episodes = 2301 |
-[INFO] [A2C tuned[worker: 0]] | max_global_step = 284044 | episode_rewards = 255.0 | total_episodes = 2318 |
-[INFO] [A2C tuned[worker: 0]] | max_global_step = 291189 | episode_rewards = 500.0 | total_episodes = 2334 |
-[INFO] [A2C tuned[worker: 0]] | max_global_step = 298619 | episode_rewards = 500.0 | total_episodes = 2350 |
-[INFO] ... trained!
-INFO: Making new env: CartPole-v1
-INFO: Making new env: CartPole-v1
-[INFO] Could not find least used device (nvidia-smi might be missing), use cuda:0 instead
+[INFO] 09:37: [PPO incorrectly tuned[worker: 1]] | max_global_step = 832 | time/iterations = 12 | time/fps = 260 | time/time_elapsed = 2 | time/total_timesteps = 768 | train/learning_rate = 0.001 | train/entropy_loss = -0.9725531369447709 | train/policy_gradient_loss = 5.175539326667786 | train/value_loss = 17.705344581604002 | train/approx_kl = 0.028903376311063766 | train/clip_fraction = 0.33828125 | train/loss = 8.651824951171875 | train/explained_variance = 0.03754150867462158 | train/n_updates = 220 | train/clip_range = 0.2 | rollout/ep_rew_mean = -251.0 | rollout/ep_len_mean = 252.0 |
+[INFO] 09:37: [PPO incorrectly tuned[worker: 2]] | max_global_step = 832 | time/iterations = 12 | time/fps = 260 | time/time_elapsed = 2 | time/total_timesteps = 768 | train/learning_rate = 0.001 | train/entropy_loss = -1.0311604633927345 | train/policy_gradient_loss = 5.122353088855744 | train/value_loss = 18.54480469226837 | train/approx_kl = 0.02180374786257744 | train/clip_fraction = 0.359375 | train/loss = 9.690193176269531 | train/explained_variance = -0.00020706653594970703 | train/n_updates = 220 | train/clip_range = 0.2 | rollout/ep_rew_mean = -500.0 | rollout/ep_len_mean = 500.0 |
+...
+...
+...
+[INFO] 09:45: ... trained!
+[INFO] 09:45: Saved ExperimentManager(PPO incorrectly tuned) using pickle.
+[INFO] 09:45: The ExperimentManager was saved in : 'rlberry_data/temp/manager_data/PPO incorrectly tuned_2024-04-24_09-37-32_33d1646b/manager_obj.pickle'
```
@@ -451,6 +228,7 @@ INFO: Making new env: CartPole-v1
```{image} output_9_3.png
:align: center
```
+Here, we can see that modifying the hyperparameters has change the learning process (for the worse): it learns faster, but the final result is lower...
☀ : For more information on plots and visualization, you can check [here (in construction)](visualization_page)
@@ -470,108 +248,10 @@ Evaluating ...
```none
-[INFO] Evaluating A2C default...
-[INFO] [eval]... simulation 1/50
-[INFO] [eval]... simulation 2/50
-[INFO] [eval]... simulation 3/50
-[INFO] [eval]... simulation 4/50
-[INFO] [eval]... simulation 5/50
-[INFO] [eval]... simulation 6/50
-[INFO] [eval]... simulation 7/50
-[INFO] [eval]... simulation 8/50
-[INFO] [eval]... simulation 9/50
-[INFO] [eval]... simulation 10/50
-[INFO] [eval]... simulation 11/50
-[INFO] [eval]... simulation 12/50
-[INFO] [eval]... simulation 13/50
-[INFO] [eval]... simulation 14/50
-[INFO] [eval]... simulation 15/50
-[INFO] [eval]... simulation 16/50
-[INFO] [eval]... simulation 17/50
-[INFO] [eval]... simulation 18/50
-[INFO] [eval]... simulation 19/50
-[INFO] [eval]... simulation 20/50
-[INFO] [eval]... simulation 21/50
-[INFO] [eval]... simulation 22/50
-[INFO] [eval]... simulation 23/50
-[INFO] [eval]... simulation 24/50
-[INFO] [eval]... simulation 25/50
-[INFO] [eval]... simulation 26/50
-[INFO] [eval]... simulation 27/50
-[INFO] [eval]... simulation 28/50
-[INFO] [eval]... simulation 29/50
-[INFO] [eval]... simulation 30/50
-[INFO] [eval]... simulation 31/50
-[INFO] [eval]... simulation 32/50
-[INFO] [eval]... simulation 33/50
-[INFO] [eval]... simulation 34/50
-[INFO] [eval]... simulation 35/50
-[INFO] [eval]... simulation 36/50
-[INFO] [eval]... simulation 37/50
-[INFO] [eval]... simulation 38/50
-[INFO] [eval]... simulation 39/50
-[INFO] [eval]... simulation 40/50
-[INFO] [eval]... simulation 41/50
-[INFO] [eval]... simulation 42/50
-[INFO] [eval]... simulation 43/50
-[INFO] [eval]... simulation 44/50
-[INFO] [eval]... simulation 45/50
-[INFO] [eval]... simulation 46/50
-[INFO] [eval]... simulation 47/50
-[INFO] [eval]... simulation 48/50
-[INFO] [eval]... simulation 49/50
-[INFO] [eval]... simulation 50/50
-[INFO] Evaluating A2C tuned...
-[INFO] [eval]... simulation 1/50
-[INFO] [eval]... simulation 2/50
-[INFO] [eval]... simulation 3/50
-[INFO] [eval]... simulation 4/50
-[INFO] [eval]... simulation 5/50
-[INFO] [eval]... simulation 6/50
-[INFO] [eval]... simulation 7/50
-[INFO] [eval]... simulation 8/50
-[INFO] [eval]... simulation 9/50
-[INFO] [eval]... simulation 10/50
-[INFO] [eval]... simulation 11/50
-[INFO] [eval]... simulation 12/50
-[INFO] [eval]... simulation 13/50
-[INFO] [eval]... simulation 14/50
-[INFO] [eval]... simulation 15/50
-[INFO] [eval]... simulation 16/50
-[INFO] [eval]... simulation 17/50
-[INFO] [eval]... simulation 18/50
-[INFO] [eval]... simulation 19/50
-[INFO] [eval]... simulation 20/50
-[INFO] [eval]... simulation 21/50
-[INFO] [eval]... simulation 22/50
-[INFO] [eval]... simulation 23/50
-[INFO] [eval]... simulation 24/50
-[INFO] [eval]... simulation 25/50
-[INFO] [eval]... simulation 26/50
-[INFO] [eval]... simulation 27/50
-[INFO] [eval]... simulation 28/50
-[INFO] [eval]... simulation 29/50
-[INFO] [eval]... simulation 30/50
-[INFO] [eval]... simulation 31/50
-[INFO] [eval]... simulation 32/50
-[INFO] [eval]... simulation 33/50
-[INFO] [eval]... simulation 34/50
-[INFO] [eval]... simulation 35/50
-[INFO] [eval]... simulation 36/50
-[INFO] [eval]... simulation 37/50
-[INFO] [eval]... simulation 38/50
-[INFO] [eval]... simulation 39/50
-[INFO] [eval]... simulation 40/50
-[INFO] [eval]... simulation 41/50
-[INFO] [eval]... simulation 42/50
-[INFO] [eval]... simulation 43/50
-[INFO] [eval]... simulation 44/50
-[INFO] [eval]... simulation 45/50
-[INFO] [eval]... simulation 46/50
-[INFO] [eval]... simulation 47/50
-[INFO] [eval]... simulation 48/50
-[INFO] [eval]... simulation 49/50
-[INFO] [eval]... simulation 50/50
+[INFO] 09:47: Evaluating PPO default...
+[INFO] Evaluation:.................................................. Evaluation finished
+[INFO] 09:47: Evaluating PPO incorrectly tuned...
+[INFO] Evaluation:.................................................. Evaluation finished
```
diff --git a/docs/basics/DeepRLTutorial/output_10_3.png b/docs/basics/DeepRLTutorial/output_10_3.png
index 8a6c39010..6981fb863 100644
Binary files a/docs/basics/DeepRLTutorial/output_10_3.png and b/docs/basics/DeepRLTutorial/output_10_3.png differ
diff --git a/docs/basics/DeepRLTutorial/output_5_3.png b/docs/basics/DeepRLTutorial/output_5_3.png
index ebc942831..b63d693d2 100644
Binary files a/docs/basics/DeepRLTutorial/output_5_3.png and b/docs/basics/DeepRLTutorial/output_5_3.png differ
diff --git a/docs/basics/DeepRLTutorial/output_6_3.png b/docs/basics/DeepRLTutorial/output_6_3.png
index cebfcbd05..c9139d724 100644
Binary files a/docs/basics/DeepRLTutorial/output_6_3.png and b/docs/basics/DeepRLTutorial/output_6_3.png differ
diff --git a/docs/basics/DeepRLTutorial/output_9_3.png b/docs/basics/DeepRLTutorial/output_9_3.png
index 80a69998b..44a4d2ab3 100644
Binary files a/docs/basics/DeepRLTutorial/output_9_3.png and b/docs/basics/DeepRLTutorial/output_9_3.png differ
diff --git a/docs/basics/comparison.md b/docs/basics/comparison.md
index 75041a9c0..38d8194da 100644
--- a/docs/basics/comparison.md
+++ b/docs/basics/comparison.md
@@ -48,7 +48,8 @@ We compute the performances of one agent as follows:
```python
import numpy as np
from rlberry.envs import gym_make
-from rlberry.agents.torch import A2CAgent
+from rlberry.agents.stable_baselines import StableBaselinesAgent
+from stable_baselines3 import A2C
from rlberry.manager import AgentManager, evaluate_agents
env_ctor = gym_make
@@ -58,8 +59,9 @@ n_simulations = 50
n_fit = 8
rbagent = AgentManager(
- A2CAgent,
+ StableBaselinesAgent,
(env_ctor, env_kwargs),
+ init_kwargs=dict(algo_cls=A2C), # Init value for StableBaselinesAgent
agent_name="A2CAgent",
fit_budget=3e4,
eval_kwargs=dict(eval_horizon=500),
@@ -78,32 +80,36 @@ The evaluation and statistical hypothesis testing is handled through the functio
For example we may compare PPO, A2C and DQNAgent on Cartpole with the following code.
-``` python
-from rlberry.agents.torch import A2CAgent, PPOAgent, DQNAgent
+```python
+from rlberry.agents.stable_baselines import StableBaselinesAgent
+from stable_baselines3 import A2C, PPO, DQN
from rlberry.manager.comparison import compare_agents
agents = [
AgentManager(
- A2CAgent,
+ StableBaselinesAgent,
(env_ctor, env_kwargs),
+ init_kwargs=dict(algo_cls=A2C), # Init value for StableBaselinesAgent
agent_name="A2CAgent",
- fit_budget=3e4,
+ fit_budget=1e5,
eval_kwargs=dict(eval_horizon=500),
n_fit=n_fit,
),
AgentManager(
- PPOAgent,
+ StableBaselinesAgent,
(env_ctor, env_kwargs),
+ init_kwargs=dict(algo_cls=PPO), # Init value for StableBaselinesAgent
agent_name="PPOAgent",
- fit_budget=3e4,
+ fit_budget=1e5,
eval_kwargs=dict(eval_horizon=500),
n_fit=n_fit,
),
AgentManager(
- DQNAgent,
+ StableBaselinesAgent,
(env_ctor, env_kwargs),
+ init_kwargs=dict(algo_cls=DQN), # Init value for StableBaselinesAgent
agent_name="DQNAgent",
- fit_budget=3e4,
+ fit_budget=1e5,
eval_kwargs=dict(eval_horizon=500),
n_fit=n_fit,
),
@@ -116,12 +122,12 @@ print(compare_agents(agents))
```
```
- Agent1 vs Agent2 mean Agent1 mean Agent2 mean diff std diff decisions p-val significance
-0 A2CAgent vs PPOAgent 213.600875 423.431500 -209.830625 144.600160 reject 0.002048 **
-1 A2CAgent vs DQNAgent 213.600875 443.296625 -229.695750 152.368506 reject 0.000849 ***
-2 PPOAgent vs DQNAgent 423.431500 443.296625 -19.865125 104.279024 accept 0.926234
+ Agent1 vs Agent2 mean Agent1 mean Agent2 mean diff std diff decisions p-val significance
+0 A2CAgent vs PPOAgent 416.9975 500.00000 -83.00250 147.338488 accept 0.266444
+1 A2CAgent vs DQNAgent 416.9975 260.38375 156.61375 179.503659 reject 0.017001 *
+2 PPOAgent vs DQNAgent 500.0000 260.38375 239.61625 80.271521 reject 0.000410 ***
```
-The results of `compare_agents(agents)` show the p-values and significance level if the method is `tukey_hsd` and in all the cases it shows the decision accept or reject of the test with Family-wise error controlled by $0.05$. In our case, we see that A2C seems significantly worst than both PPO and DQN but the difference between PPO and DQN is not statistically significant. Remark that no significance (which is to say, decision to accept $H_0$) does not necessarily mean that the algorithms perform the same, it can be that there is not enough data.
+The results of `compare_agents(agents)` show the p-values and significance level if the method is tukey_hsd and it shows the decision accept or reject of the test with Family-wise error controlled by $0.05$. In our case, we see that DQN is worse than A2C and PPO but the difference between PPO and A2C is not statistically significant. Remark that no significance (which is to say, decision to accept $H_0$) does not necessarily mean that the algorithms perform the same, it can be that there is not enough data (and it is likely that it is the case here).
*Remark*: the comparison we do here is a black-box comparison in the sense that we don't care how the algorithms were tuned or how many training steps are used, we suppose that the user already tuned these parameters adequately for a fair comparison.
diff --git a/docs/basics/quick_start_rl/quickstart.md b/docs/basics/quick_start_rl/quickstart.md
index 8b83aec11..516ca4214 100644
--- a/docs/basics/quick_start_rl/quickstart.md
+++ b/docs/basics/quick_start_rl/quickstart.md
@@ -17,8 +17,8 @@ import numpy as np
import pandas as pd
import time
from rlberry.agents import AgentWithSimplePolicy
-from rlberry_research.agents import UCBVIAgent
-from rlberry_research.envs import Chain
+from rlberry_scool.agents import UCBVIAgent
+from rlberry_scool.envs import Chain
from rlberry.manager import (
ExperimentManager,
evaluate_agents,
@@ -26,7 +26,6 @@ from rlberry.manager import (
read_writer_data,
)
from rlberry.wrappers import WriterWrapper
-from IPython.display import Image
```
Choosing an RL environment
@@ -59,8 +58,6 @@ env.save_gif("gif_chain.gif")
# clear rendering data
env.clear_render_buffer()
env.disable_rendering()
-# view result
-Image(open("gif_chain.gif", "rb").read())
```
@@ -76,7 +73,7 @@ Defining an agent and a baseline
--------------------------------
We will compare a RandomAgent (which select random action) to the
-UCBVIAgent(from [rlberry_research](https://github.com/rlberry-py/rlberry-research)), which is an algorithm that is designed to perform an
+UCBVIAgent(from [rlberry_scool](https://github.com/rlberry-py/rlberry-scool)), which is an algorithm that is designed to perform an
efficient exploration. Our goal is then to assess the performance of the
two algorithms.
@@ -288,7 +285,7 @@ iteration, the environment takes 100 steps (`horizon`) times the
-Finally, we plot the reward: Here you can see the mean value over the 10 fited agent, with 2 options (raw and smoothed). Note that, to be able to see the smoothed version, you must have installed the extra package `scikit-fda`, (For more information, you can check the options on the [install page](../../installation.md#options)).
+Finally, we plot the reward. Here you can see the mean value over the 10 fitted agent, with 2 options (raw and smoothed). Note that, to be able to see the smoothed version, you must have installed the extra package `scikit-fda`, (For more information, you can check the options on the [install page](../../installation.md#options)).
```python
# Plot of the reward.
diff --git a/docs/basics/userguide/adastop.md b/docs/basics/userguide/adastop.md
index 89d7002ce..c22ba79f7 100644
--- a/docs/basics/userguide/adastop.md
+++ b/docs/basics/userguide/adastop.md
@@ -1,7 +1,7 @@
(adastop_userguide)=
-# AdaStop
+# Adaptive hypothesis testing for comparison of RL agents with AdaStop
diff --git a/docs/basics/userguide/agent.md b/docs/basics/userguide/agent.md
index ee7de57a6..86f66479a 100644
--- a/docs/basics/userguide/agent.md
+++ b/docs/basics/userguide/agent.md
@@ -7,11 +7,11 @@ In rlberry, you can use existing Agent, or create your own custom Agent. You can
## Use rlberry Agent
An agent needs an environment to train. We'll use the same environment as in the [environment](environment_page) section of the user guide.
-("Chain" environment from "[rlberry-research](https://github.com/rlberry-py/rlberry-research)")
+("Chain" environment from "[rlberry-scool](https://github.com/rlberry-py/rlberry-scool)")
### without agent
```python
-from rlberry_research.envs.finite import Chain
+from rlberry_scool.envs.finite import Chain
env = Chain(10, 0.1)
env.enable_rendering()
@@ -37,7 +37,7 @@ With the same environment, we will use an Agent to choose the actions instead of
For this example, you can use "ValueIterationAgent" Agent from "[rlberry-scool](https://github.com/rlberry-py/rlberry-scool)"
```python
-from rlberry_research.envs.finite import Chain
+from rlberry_scool.envs.finite import Chain
from rlberry_scool.agents.dynprog import ValueIterationAgent
env = Chain(10, 0.1) # same env
diff --git a/docs/basics/userguide/environment.md b/docs/basics/userguide/environment.md
index 4afaae1ce..96dd3a12b 100644
--- a/docs/basics/userguide/environment.md
+++ b/docs/basics/userguide/environment.md
@@ -7,9 +7,9 @@ This is the world with which the agent interacts. The agent can observe this env
## Use rlberry environment
You can find some environments in our other projects "[rlberry-research](https://github.com/rlberry-py/rlberry-research)" and "[rlberry-scool](https://github.com/rlberry-py/rlberry-scool)".
-For this example, you can use "Chain" environment from "[rlberry-research](https://github.com/rlberry-py/rlberry-research)"
+For this example, you can use "Chain" environment from "[rlberry-scool](https://github.com/rlberry-py/rlberry-scool)"
```python
-from rlberry_research.envs.finite import Chain
+from rlberry_scool.envs.finite import Chain
env = Chain(10, 0.1)
env.enable_rendering()
diff --git a/docs/basics/userguide/expManager_multieval.png b/docs/basics/userguide/expManager_multieval.png
index 2ceeb5ff2..92dc25586 100644
Binary files a/docs/basics/userguide/expManager_multieval.png and b/docs/basics/userguide/expManager_multieval.png differ
diff --git a/docs/basics/userguide/experimentManager.md b/docs/basics/userguide/experimentManager.md
index a68008d1d..03e04af61 100644
--- a/docs/basics/userguide/experimentManager.md
+++ b/docs/basics/userguide/experimentManager.md
@@ -6,13 +6,14 @@ It's the element that allows you to make your experiments on [Agent](agent_page)
You can use it to train, optimize hyperparameters, evaluate, compare, and gather statistics about your agent on a specific environment. You can find the API doc [here](rlberry.manager.ExperimentManager).
It's not the only solution, but it's the compact (and recommended) way of doing experiments with an agent.
-For these examples, you will use the "PPO" torch agent from "[rlberry-research](https://github.com/rlberry-py/rlberry-research)"
+For this example, you will use the "PPO" torch agent from "[StableBaselines3](https://stable-baselines3.readthedocs.io/en/master/guide/algos.html)" and wrap it in rlberry Agent. To do that, you need to use [StableBaselinesAgent](rlberry.agents.stable_baselines.StableBaselinesAgent). More information [here](stable_baselines).
## Create your experiment
```python
from rlberry.envs import gym_make
-from rlberry_research.agents.torch import PPOAgent
+from rlberry.agents.stable_baselines import StableBaselinesAgent
+from stable_baselines3 import PPO
from rlberry.manager import ExperimentManager, evaluate_agents
@@ -23,8 +24,9 @@ env_kwargs = dict(id=env_id) # give the id of the env inside the kwargs
first_experiment = ExperimentManager(
- PPOAgent, # Agent Class
+ StableBaselinesAgent, # Agent Class to manage stableBaselinesAgents
(env_ctor, env_kwargs), # Environment as Tuple(constructor,kwargs)
+ init_kwargs=dict(algo_cls=PPO, verbose=1), # Init value for StableBaselinesAgent
fit_budget=int(100), # Budget used to call our agent "fit()"
eval_kwargs=dict(
eval_horizon=1000
@@ -43,17 +45,37 @@ print(output)
```
```none
-[INFO] 14:26: Running ExperimentManager fit() for PPO_first_experimentCartPole-v1 with n_fit = 1 and max_workers = None.
-[INFO] 14:26: ... trained!
-[INFO] 14:26: Evaluating PPO_first_experimentCartPole-v1...
-[INFO] Evaluation:..... Evaluation finished
-
+[INFO] 09:18: Running ExperimentManager fit() for PPO_first_experimentCartPole-v1 with n_fit = 1 and max_workers = None.
+Using cpu device
+Wrapping the env with a `Monitor` wrapper
+Wrapping the env in a DummyVecEnv.
+---------------------------------
+| rollout/ | |
+| ep_len_mean | 23.9 |
+| ep_rew_mean | 23.9 |
+| time/ | |
+| fps | 2977 |
+| iterations | 1 |
+| time_elapsed | 0 |
+| total_timesteps | 2048 |
+---------------------------------
+[INFO] 09:18: ... trained!
+Using cpu device
+Wrapping the env with a `Monitor` wrapper
+Wrapping the env in a DummyVecEnv.
+[INFO] 09:18: Saved ExperimentManager(PPO_first_experimentCartPole-v1) using pickle.
+[INFO] 09:18: The ExperimentManager was saved in : 'rlberry_data/temp/manager_data/PPO_first_experimentCartPole-v1_2024-04-12_09-18-10_3a9fa8ad/manager_obj.pickle'
+[INFO] 09:18: Evaluating PPO_first_experimentCartPole-v1...
+[INFO] Evaluation:Using cpu device
+Wrapping the env with a `Monitor` wrapper
+Wrapping the env in a DummyVecEnv.
+..... Evaluation finished
PPO_first_experimentCartPole-v1
-0 15.0
-1 18.4
-2 21.4
-3 22.3
-4 23.0
+0 89.0
+1 64.0
+2 82.0
+3 121.0
+4 64.0
```
@@ -66,11 +88,11 @@ Now you can compare this agent with another one. Here, we are going to compare i
⚠ **warning :** add this code after the previous one. ⚠
```python
second_experiment = ExperimentManager(
- PPOAgent, # Agent Class
+ StableBaselinesAgent, # Agent Class
(env_ctor, env_kwargs), # Environment as Tuple(constructor,kwargs)
fit_budget=int(10000), # Budget used to call our agent "fit()"
init_kwargs=dict(
- batch_size=24, n_steps=96, device="cpu"
+ algo_cls=PPO, batch_size=24, n_steps=96, device="cpu"
), # Arguments for the Agent’s constructor.
eval_kwargs=dict(
eval_horizon=1000
@@ -90,22 +112,23 @@ print(output)
```
```none
-[INFO] 14:39: Running ExperimentManager fit() for PPO_second_experimentCartPole-v1 with n_fit = 1 and max_workers = None.
-[INFO] 14:39: [PPO_second_experimentCartPole-v1[worker: 0]] | max_global_step = 2496 | fit/policy_loss = -0.0443466454744339 | fit/value_loss = 33.09639358520508 | fit/entropy_loss = 0.6301112174987793 | fit/approx_kl = 0.0029671359807252884 | fit/clipfrac = 0.0 | fit/explained_variance = 0.4449042081832886 | fit/learning_rate = 0.0003 |
-[INFO] 14:39: [PPO_second_experimentCartPole-v1[worker: 0]] | max_global_step = 5472 | fit/policy_loss = -0.020021788775920868 | fit/value_loss = 171.70037841796875 | fit/entropy_loss = 0.5415757298469543 | fit/approx_kl = 0.001022467389702797 | fit/clipfrac = 0.0 | fit/explained_variance = 0.1336498260498047 | fit/learning_rate = 0.0003 |
-[INFO] 14:39: [PPO_second_experimentCartPole-v1[worker: 0]] | max_global_step = 8256 | fit/policy_loss = -0.016511857509613037 | fit/value_loss = 199.02989196777344 | fit/entropy_loss = 0.5490894317626953 | fit/approx_kl = 0.022175027057528496 | fit/clipfrac = 0.27083333395421505 | fit/explained_variance = 0.19932276010513306 | fit/learning_rate = 0.0003 |
-[INFO] 14:39: ... trained!
-[INFO] 14:39: Evaluating PPO_first_experimentCartPole-v1...
+[INFO] 09:29: Running ExperimentManager fit() for PPO_second_experimentCartPole-v1 with n_fit = 1 and max_workers = None.
+[INFO] 09:29: [PPO_second_experimentCartPole-v1[worker: 0]] | max_global_step = 2688 | time/iterations = 27 | rollout/ep_rew_mean = 57.044444444444444 | rollout/ep_len_mean = 57.044444444444444 | time/fps = 888 | time/time_elapsed = 2 | time/total_timesteps = 2592 | train/learning_rate = 0.0003 | train/entropy_loss = -0.6261792600154876 | train/policy_gradient_loss = -0.001418954369607306 | train/value_loss = 87.49215440750122 | train/approx_kl = 0.0018317258218303323 | train/clip_fraction = 0.0 | train/loss = 31.3124942779541 | train/explained_variance = -0.33643925189971924 | train/n_updates = 260 | train/clip_range = 0.2 |
+[INFO] 09:29: [PPO_second_experimentCartPole-v1[worker: 0]] | max_global_step = 5568 | time/iterations = 57 | rollout/ep_rew_mean = 85.19354838709677 | rollout/ep_len_mean = 85.19354838709677 | time/fps = 916 | time/time_elapsed = 5 | time/total_timesteps = 5472 | train/learning_rate = 0.0003 | train/entropy_loss = -0.617610102891922 | train/policy_gradient_loss = 0.0007477130696315725 | train/value_loss = 66.27523021697998 | train/approx_kl = 1.8932236343971454e-05 | train/clip_fraction = 0.0 | train/loss = 21.402034759521484 | train/explained_variance = 0.46521711349487305 | train/n_updates = 560 | train/clip_range = 0.2 |
+[INFO] 09:29: [PPO_second_experimentCartPole-v1[worker: 0]] | max_global_step = 8640 | time/iterations = 89 | rollout/ep_rew_mean = 107.29113924050633 | rollout/ep_len_mean = 107.29113924050633 | time/fps = 946 | time/time_elapsed = 9 | time/total_timesteps = 8544 | train/learning_rate = 0.0003 | train/entropy_loss = -0.5820738852024079 | train/policy_gradient_loss = -0.008271816929482156 | train/value_loss = 279.90625591278075 | train/approx_kl = 0.005026700906455517 | train/clip_fraction = 0.03750000102445483 | train/loss = 192.93894958496094 | train/explained_variance = 0.00014603137969970703 | train/n_updates = 880 | train/clip_range = 0.2 |
+[INFO] 09:29: ... trained!
+[INFO] 09:29: Saved ExperimentManager(PPO_second_experimentCartPole-v1) using pickle.
+[INFO] 09:29: The ExperimentManager was saved in : 'rlberry_data/temp/manager_data/PPO_second_experimentCartPole-v1_2024-04-12_09-29-45_77245043/manager_obj.pickle'
+[INFO] 09:29: Evaluating PPO_first_experimentCartPole-v1...
[INFO] Evaluation:..... Evaluation finished
-[INFO] 14:39: Evaluating PPO_second_experimentCartPole-v1...
+[INFO] 09:29: Evaluating PPO_second_experimentCartPole-v1...
[INFO] Evaluation:..... Evaluation finished
-
PPO_first_experimentCartPole-v1 PPO_second_experimentCartPole-v1
-0 20.6 200.6
-1 20.5 286.7
-2 18.9 238.6
-3 18.2 248.2
-4 17.7 271.9
+0 108.0 500.0
+1 97.0 500.0
+2 130.0 500.0
+3 166.0 500.0
+4 81.0 500.0
```
As we can see in the output or in the following image, the second agent succeed better.
@@ -136,11 +159,13 @@ eval_env_kwargs = { # kwars for eval env (with wrapper)
}
third_experiment = ExperimentManager(
- PPOAgent, # Agent Class
+ StableBaselinesAgent, # Agent Class
(env_ctor, env_kwargs), # Environment as Tuple(constructor,kwargs)
fit_budget=int(10000), # Budget used to call our agent "fit()"
eval_env=(eval_env_ctor, eval_env_kwargs), # Evaluation environment as tuple
- init_kwargs=dict(batch_size=24, n_steps=96, device="cpu"), # settings for the Agent
+ init_kwargs=dict(
+ algo_cls=PPO, batch_size=24, n_steps=96, device="cpu"
+ ), # settings for the Agent
eval_kwargs=dict(
eval_horizon=1000
), # Arguments required to call rlberry.agents.agent.Agent.eval().
@@ -159,12 +184,13 @@ print(output3)
```None
-[INFO] 17:03: Running ExperimentManager fit() for PPO_third_experimentCartPole-v1 with n_fit = 1 and max_workers = None.
-[INFO] 17:03: [PPO_third_experimentCartPole-v1[worker: 0]] | max_global_step = 1536 | fit/policy_loss = -0.0001924981625052169 | fit/value_loss = 34.07163619995117 | fit/entropy_loss = 0.6320618987083435 | fit/approx_kl = 0.00042163082980550826 | fit/clipfrac = 0.0 | fit/explained_variance = -0.05607199668884277 | fit/learning_rate = 0.0003 |
-[INFO] 17:03: [PPO_third_experimentCartPole-v1[worker: 0]] | max_global_step = 3744 | fit/policy_loss = -0.02924121916294098 | fit/value_loss = 0.8705029487609863 | fit/entropy_loss = 0.6485489010810852 | fit/approx_kl = 0.0006057650898583233 | fit/clipfrac = 0.0 | fit/explained_variance = 0.9505079835653305 | fit/learning_rate = 0.0003 |
-[INFO] 17:03: [PPO_third_experimentCartPole-v1[worker: 0]] | max_global_step = 5856 | fit/policy_loss = -0.008760576136410236 | fit/value_loss = 2.063389778137207 | fit/entropy_loss = 0.5526289343833923 | fit/approx_kl = 0.017247432842850685 | fit/clipfrac = 0.08645833283662796 | fit/explained_variance = 0.9867914840579033 | fit/learning_rate = 0.0003 |
-[INFO] 17:03: [PPO_third_experimentCartPole-v1[worker: 0]] | max_global_step = 8256 | fit/policy_loss = -0.016511857509613037 | fit/value_loss = 199.02989196777344 | fit/entropy_loss = 0.5490894317626953 | fit/approx_kl = 0.022175027057528496 | fit/clipfrac = 0.27083333395421505 | fit/explained_variance = 0.19932276010513306 | fit/learning_rate = 0.0003 |
-[INFO] 09:45: Evaluating PPO_third_experimentCartPole-v1...
+[INFO] 09:36: [PPO_third_experimentCartPole-v1[worker: 0]] | max_global_step = 1920 | time/iterations = 19 | rollout/ep_rew_mean = 44.146341463414636 | rollout/ep_len_mean = 44.146341463414636 | time/fps = 687 | time/time_elapsed = 2 | time/total_timesteps = 1824 | train/learning_rate = 0.0003 | train/entropy_loss = -0.612512381374836 | train/policy_gradient_loss = -0.004653797230503187 | train/value_loss = 75.76153821945191 | train/approx_kl = 0.008641918189823627 | train/clip_fraction = 0.03333333339542151 | train/loss = 35.162071228027344 | train/explained_variance = 0.3032127618789673 | train/n_updates = 180 | train/clip_range = 0.2 |
+[INFO] 09:36: [PPO_third_experimentCartPole-v1[worker: 0]] | max_global_step = 4704 | time/iterations = 48 | rollout/ep_rew_mean = 79.20689655172414 | rollout/ep_len_mean = 79.20689655172414 | time/fps = 804 | time/time_elapsed = 5 | time/total_timesteps = 4608 | train/learning_rate = 0.0003 | train/entropy_loss = -0.5940127298235893 | train/policy_gradient_loss = -0.016441003710982238 | train/value_loss = 154.39369611740113 | train/approx_kl = 0.010226544924080372 | train/clip_fraction = 0.07500000102445484 | train/loss = 48.81913375854492 | train/explained_variance = 0.005669653415679932 | train/n_updates = 470 | train/clip_range = 0.2 |
+[INFO] 09:36: [PPO_third_experimentCartPole-v1[worker: 0]] | max_global_step = 7392 | time/iterations = 76 | rollout/ep_rew_mean = 96.08108108108108 | rollout/ep_len_mean = 96.08108108108108 | time/fps = 826 | time/time_elapsed = 8 | time/total_timesteps = 7296 | train/learning_rate = 0.0003 | train/entropy_loss = -0.5620817124843598 | train/policy_gradient_loss = -0.0007149307257350301 | train/value_loss = 89.1684087753296 | train/approx_kl = 0.00030671278364025056 | train/clip_fraction = 0.0 | train/loss = 26.46017837524414 | train/explained_variance = 0.4496734142303467 | train/n_updates = 750 | train/clip_range = 0.2 |
+[INFO] 09:36: [PPO_third_experimentCartPole-v1[worker: 0]] | max_global_step = 9984 | time/iterations = 103 | rollout/ep_rew_mean = 113.64285714285714 | rollout/ep_len_mean = 113.64285714285714 | time/fps = 832 | time/time_elapsed = 11 | time/total_timesteps = 9888 | train/learning_rate = 0.0003 | train/entropy_loss = -0.5782853797078132 | train/policy_gradient_loss = -0.012480927801546693 | train/value_loss = 27.679842436313628 | train/approx_kl = 0.013762158341705799 | train/clip_fraction = 0.04479166660457849 | train/loss = 3.8429009914398193 | train/explained_variance = -0.32027459144592285 | train/n_updates = 1020 | train/clip_range = 0.2 |
+[INFO] 09:36: ... trained!
+[INFO] 09:36: The ExperimentManager was saved in : 'rlberry_data/temp/manager_data/PPO_third_experimentCartPole-v1_2024-04-12_09-36-09_da4411b3/manager_obj.pickle'
+[INFO] 09:36: Evaluating PPO_third_experimentCartPole-v1...
[INFO] Evaluation:Moviepy - Building video /CartPole-v1-episode-0.mp4.
Moviepy - Writing video CartPole-v1-episode-0.mp4
@@ -175,14 +201,23 @@ Moviepy - Writing video /CartPole-v1-episode-1.mp4
Moviepy - Done !
Moviepy - video ready /CartPole-v1-episode-1.mp4
-.... Evaluation finished
-
- PPO_third_experimentCartPole-v1
-0 175.0
-1 189.0
-2 234.0
-3 146.0
-4 236.0
+....... Evaluation finished
+ PPO_third_experimentCartPole-v1
+0 500.0
+1 500.0
+2 500.0
+3 500.0
+4 500.0
+5 500.0
+6 500.0
+7 500.0
+8 500.0
+9 500.0
+10 500.0
+11 500.0
+12 500.0
+13 500.0
+14 500.0
```