Copy of UCBVI from rlberry_research to rlberry_scool and misc changes…

… to doc (#451) * move UCBVI from rlberry_research to rlberry_scool * update script to test markdown * add toggleable menu * update user guide to NOT use rlberry_research * remove use of IPython * update tuto deepRL * update tuto deepRL * update contributing guideline (agent are not in rlberry_main anymore) * add doc to monthly test --------- Co-authored-by: Timothee Mathieu <[email protected]>
rlberry-py · Apr 24, 2024 · 6f933f5 · 6f933f5
1 parent 730b92b
commit 6f933f5
Show file tree

Hide file tree

Showing 42 changed files with 738 additions and 819 deletions.
diff --git a/docs/_video/user_guide_video/_experimentManager_page_CartPole.mp4 b/docs/_video/user_guide_video/_experimentManager_page_CartPole.mp4
diff --git a/docs/basics/DeepRLTutorial/TutorialDeepRL.md b/docs/basics/DeepRLTutorial/TutorialDeepRL.md
diff --git a/docs/basics/DeepRLTutorial/output_10_3.png b/docs/basics/DeepRLTutorial/output_10_3.png
diff --git a/docs/basics/DeepRLTutorial/output_5_3.png b/docs/basics/DeepRLTutorial/output_5_3.png
diff --git a/docs/basics/DeepRLTutorial/output_6_3.png b/docs/basics/DeepRLTutorial/output_6_3.png
diff --git a/docs/basics/DeepRLTutorial/output_9_3.png b/docs/basics/DeepRLTutorial/output_9_3.png
diff --git a/docs/basics/comparison.md b/docs/basics/comparison.md
@@ -48,7 +48,8 @@ We compute the performances of one agent as follows:
 ```python
 import numpy as np
 from rlberry.envs import gym_make
-from rlberry.agents.torch import A2CAgent
+from rlberry.agents.stable_baselines import StableBaselinesAgent
+from stable_baselines3 import A2C
 from rlberry.manager import AgentManager, evaluate_agents
 
 env_ctor = gym_make
@@ -58,8 +59,9 @@ n_simulations = 50
 n_fit = 8
 
 rbagent = AgentManager(
-    A2CAgent,
+    StableBaselinesAgent,
     (env_ctor, env_kwargs),
+    init_kwargs=dict(algo_cls=A2C),  # Init value for StableBaselinesAgent
     agent_name="A2CAgent",
     fit_budget=3e4,
     eval_kwargs=dict(eval_horizon=500),
@@ -78,32 +80,36 @@ The evaluation and statistical hypothesis testing is handled through the functio
 
 For example we may compare PPO, A2C and DQNAgent on Cartpole with the following code.
 
-``` python
-from rlberry.agents.torch import A2CAgent, PPOAgent, DQNAgent
+```python
+from rlberry.agents.stable_baselines import StableBaselinesAgent
+from stable_baselines3 import A2C, PPO, DQN
 from rlberry.manager.comparison import compare_agents
 
 agents = [
     AgentManager(
-        A2CAgent,
+        StableBaselinesAgent,
         (env_ctor, env_kwargs),
+        init_kwargs=dict(algo_cls=A2C),  # Init value for StableBaselinesAgent
         agent_name="A2CAgent",
-        fit_budget=3e4,
+        fit_budget=1e5,
         eval_kwargs=dict(eval_horizon=500),
         n_fit=n_fit,
     ),
     AgentManager(
-        PPOAgent,
+        StableBaselinesAgent,
         (env_ctor, env_kwargs),
+        init_kwargs=dict(algo_cls=PPO),  # Init value for StableBaselinesAgent
         agent_name="PPOAgent",
-        fit_budget=3e4,
+        fit_budget=1e5,
         eval_kwargs=dict(eval_horizon=500),
         n_fit=n_fit,
     ),
     AgentManager(
-        DQNAgent,
+        StableBaselinesAgent,
         (env_ctor, env_kwargs),
+        init_kwargs=dict(algo_cls=DQN),  # Init value for StableBaselinesAgent
         agent_name="DQNAgent",
-        fit_budget=3e4,
+        fit_budget=1e5,
         eval_kwargs=dict(eval_horizon=500),
         n_fit=n_fit,
     ),
@@ -116,12 +122,12 @@ print(compare_agents(agents))
 ```
 
 ```
-       Agent1 vs Agent2  mean Agent1  mean Agent2   mean diff    std diff decisions     p-val significance
-0  A2CAgent vs PPOAgent   213.600875   423.431500 -209.830625  144.600160    reject  0.002048           **
-1  A2CAgent vs DQNAgent   213.600875   443.296625 -229.695750  152.368506    reject  0.000849          ***
-2  PPOAgent vs DQNAgent   423.431500   443.296625  -19.865125  104.279024    accept  0.926234
+       Agent1 vs Agent2  mean Agent1  mean Agent2  mean diff    std diff decisions     p-val significance
+0  A2CAgent vs PPOAgent     416.9975    500.00000  -83.00250  147.338488    accept  0.266444
+1  A2CAgent vs DQNAgent     416.9975    260.38375  156.61375  179.503659    reject  0.017001            *
+2  PPOAgent vs DQNAgent     500.0000    260.38375  239.61625   80.271521    reject  0.000410          ***
 ```
 
-The results of `compare_agents(agents)` show the p-values and significance level if the method is `tukey_hsd` and in all the cases it shows the decision accept or reject of the test with Family-wise error controlled by $0.05$. In our case, we see that A2C seems significantly worst than both PPO and DQN but the difference between PPO and DQN is not statistically significant. Remark that no significance (which is to say, decision to accept $H_0$) does not necessarily mean that the algorithms perform the same, it can be that there is not enough data.
+The results of `compare_agents(agents)` show the p-values and significance level if the method is tukey_hsd and  it shows the decision accept or reject of the test with Family-wise error controlled by $0.05$. In our case, we see that DQN is worse than A2C and PPO but the difference between PPO and A2C is not statistically significant. Remark that no significance (which is to say, decision to accept $H_0$) does not necessarily mean that the algorithms perform the same, it can be that there is not enough data (and it is likely that it is the case here).
 
 *Remark*: the comparison we do here is a black-box comparison in the sense that we don't care how the algorithms were tuned or how many training steps are used, we suppose that the user already tuned these parameters adequately for a fair comparison.
diff --git a/docs/basics/quick_start_rl/quickstart.md b/docs/basics/quick_start_rl/quickstart.md
@@ -17,16 +17,15 @@ import numpy as np
 import pandas as pd
 import time
 from rlberry.agents import AgentWithSimplePolicy
-from rlberry_research.agents import UCBVIAgent
-from rlberry_research.envs import Chain
+from rlberry_scool.agents import UCBVIAgent
+from rlberry_scool.envs import Chain
 from rlberry.manager import (
     ExperimentManager,
     evaluate_agents,
     plot_writer_data,
     read_writer_data,
 )
 from rlberry.wrappers import WriterWrapper
-from IPython.display import Image
 ```
 
 Choosing an RL environment
@@ -59,8 +58,6 @@ env.save_gif("gif_chain.gif")
 # clear rendering data
 env.clear_render_buffer()
 env.disable_rendering()
-# view result
-Image(open("gif_chain.gif", "rb").read())
 ```
 
 
@@ -76,7 +73,7 @@ Defining an agent and a baseline
 --------------------------------
 
 We will compare a RandomAgent (which select random action) to the
-UCBVIAgent(from [rlberry_research](https://github.com/rlberry-py/rlberry-research)), which is an algorithm that is designed to perform an
+UCBVIAgent(from [rlberry_scool](https://github.com/rlberry-py/rlberry-scool)), which is an algorithm that is designed to perform an
 efficient exploration. Our goal is then to assess the performance of the
 two algorithms.
 
@@ -288,7 +285,7 @@ iteration, the environment takes 100 steps (`horizon`) times the
 
 
 
-Finally, we plot the reward: Here you can see the mean value over the 10 fited agent, with 2 options (raw and smoothed). Note that, to be able to see the smoothed version, you must have installed the extra package `scikit-fda`, (For more information, you can check the options on the [install page](../../installation.md#options)).
+Finally, we plot the reward. Here you can see the mean value over the 10 fitted agent, with 2 options (raw and smoothed). Note that, to be able to see the smoothed version, you must have installed the extra package `scikit-fda`, (For more information, you can check the options on the [install page](../../installation.md#options)).
 
 ```python
 # Plot of the reward.

diff --git a/docs/basics/userguide/adastop.md b/docs/basics/userguide/adastop.md
@@ -1,7 +1,7 @@
 (adastop_userguide)=
 
 
-# AdaStop
+# Adaptive hypothesis testing for comparison of RL agents with AdaStop
 
 
 

diff --git a/docs/basics/userguide/agent.md b/docs/basics/userguide/agent.md
@@ -7,11 +7,11 @@ In rlberry, you can use existing Agent, or create your own custom Agent. You can
 
 ## Use rlberry Agent
 An agent needs an environment to train. We'll use the same environment as in the [environment](environment_page) section of the user guide.
-("Chain" environment from "[rlberry-research](https://github.com/rlberry-py/rlberry-research)")
+("Chain" environment from "[rlberry-scool](https://github.com/rlberry-py/rlberry-scool)")
 
 ### without agent
 ```python
-from rlberry_research.envs.finite import Chain
+from rlberry_scool.envs.finite import Chain
 
 env = Chain(10, 0.1)
 env.enable_rendering()
@@ -37,7 +37,7 @@ With the same environment, we will use an Agent to choose the actions instead of
 For this example, you can use "ValueIterationAgent" Agent from "[rlberry-scool](https://github.com/rlberry-py/rlberry-scool)"
 
 ```python
-from rlberry_research.envs.finite import Chain
+from rlberry_scool.envs.finite import Chain
 from rlberry_scool.agents.dynprog import ValueIterationAgent
 
 env = Chain(10, 0.1)  # same env

diff --git a/docs/basics/userguide/environment.md b/docs/basics/userguide/environment.md
@@ -7,9 +7,9 @@ This is the world with which the agent interacts. The agent can observe this env
 
 ## Use rlberry environment
 You can find some environments in our other projects "[rlberry-research](https://github.com/rlberry-py/rlberry-research)" and "[rlberry-scool](https://github.com/rlberry-py/rlberry-scool)".
-For this example, you can use "Chain" environment from "[rlberry-research](https://github.com/rlberry-py/rlberry-research)"
+For this example, you can use "Chain" environment from "[rlberry-scool](https://github.com/rlberry-py/rlberry-scool)"
 ```python
-from rlberry_research.envs.finite import Chain
+from rlberry_scool.envs.finite import Chain
 
 env = Chain(10, 0.1)
 env.enable_rendering()

diff --git a/docs/basics/userguide/expManager_multieval.png b/docs/basics/userguide/expManager_multieval.png