Skip to content

Commit

Permalink
revise paper title and docs
Browse files Browse the repository at this point in the history
  • Loading branch information
aiueola committed Nov 18, 2023
1 parent f163ef4 commit 92c0fcd
Show file tree
Hide file tree
Showing 20 changed files with 104 additions and 68 deletions.
16 changes: 8 additions & 8 deletions README_ja.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,9 +44,9 @@ SCOPE-RL は,データ収集からオフ方策学習,オフ方策性能評

特に,SCOPE-RLは以下の研究トピックに関連する評価とアルゴリズム比較を簡単に行えます:

- **オフライン強化学習**:オフライン強化学習は,挙動方策によって収集されたオフラインのログデータのみから新しい方策を学習することを目的としています.SCOPE-RLは,様々な挙動方策と環境によって収集されたデータによる柔軟な実験を可能にします
- **オフライン強化学習**:オフライン強化学習は,データ収集方策によって収集されたオフラインのログデータのみから新しい方策を学習することを目的としています.SCOPE-RLは,様々なデータ収集と環境によって収集されたデータによる柔軟な実験を可能にします

- **オフ方策評価(OPE)**:オフ方策評価は,挙動方策により集められたオフラインのログデータのみを使用して(挙動方策とは異なる)新たな方策の性能を評価することを目的とします.SCOPE-RLは多くのオフ方策推定量の実装可能にする抽象クラスや、推定量を評価し比較するための実験手順を実装しています.また、SCOPE-RLが実装し公開している発展的なオフ方策評価手法には、状態-行動密度推定や累積分布推定に基づく推定量なども含まれます.
- **オフ方策評価(OPE)**:オフ方策評価は,データ収集方策により集められたオフラインのログデータのみを使用して(データ収集方策とは異なる)新たな方策の性能を評価することを目的とします.SCOPE-RLは多くのオフ方策推定量の実装可能にする抽象クラスや、推定量を評価し比較するための実験手順を実装しています.また、SCOPE-RLが実装し公開している発展的なオフ方策評価手法には、状態-行動密度推定や累積分布推定に基づく推定量なども含まれます.

- **オフ方策選択(OPS)**:オフ方策選択は,オフラインのログデータを使用して,いくつかの候補方策の中から最も性能の良い方策を特定することを目的とします.SCOPE-RLは様々な方策選択の基準を実装するだけでなく,方策選択の結果を評価するためのいくつかの指標を提供します.

Expand All @@ -58,11 +58,11 @@ SCOPE-RL は,データ収集からオフ方策学習,オフ方策性能評
*SCOPE-RL* は主に以下の3つのモジュールから構成されています.

- [**dataset module**](./_gym/dataset): このモジュールは,[OpenAI Gym](http://gym.openai.com/)[Gymnasium](https://gymnasium.farama.org/)のようなインターフェイスに基づく任意の環境から人工データを生成するためのツールを提供します.また,ログデータの前処理を行うためのツールも提供します.
- [**policy module**](./_gym/policy): このモジュールはd3rlpyのwrapperクラスを提供し,様々な挙動方策による柔軟なデータ収集を可能にします
- [**policy module**](./_gym/policy): このモジュールはd3rlpyのwrapperクラスを提供し,様々なデータ収集方策による柔軟なデータ収集を可能にします
- [**ope module**](./_gym/ope): このモジュールは,オフ方策推定量を実装するための汎用的な抽象クラスを提供します.また,オフ方策選択を実行するために便利ないくつかのツールも提供します.

<details>
<summary><strong>挙動方策</strong>(クリックして展開)</summary>
<summary><strong>データ収集方策</strong>(クリックして展開)</summary>

- Discrete
- Epsilon Greedy
Expand Down Expand Up @@ -181,7 +181,7 @@ env = gym.make("RTBEnv-discrete-v0")
# (1) オンライン環境で基本方策を学習する(d3rlpyを使用)
# アルゴリズムを初期化する
ddqn = DoubleDQNConfig().create(device=device)
# オンライン挙動方策を訓練する
# オンラインデータ収集方策を訓練する
# 約5分かかる
ddqn.fit_online(
env,
Expand All @@ -193,7 +193,7 @@ ddqn.fit_online(
)

# (2) ログデータを生成する
# ddqn方策を確率的な挙動方策に変換する
# ddqn方策を確率的なデータ収集方策に変換する
behavior_policy = EpsilonGreedyHead(
ddqn,
n_actions=env.action_space.n,
Expand All @@ -206,7 +206,7 @@ dataset = SyntheticDataset(
env=env,
max_episode_steps=env.step_per_episode,
)
# 挙動方策がいくつかのログデータを収集する
# データ収集方策がいくつかのログデータを収集する
train_logged_dataset = dataset.obtain_episodes(
behavior_policies=behavior_policy,
n_trajectories=10000,
Expand Down Expand Up @@ -250,7 +250,7 @@ cql.fit(

### 標準的なオフ方策評価

次に,挙動方策によって収集されたオフラインのログデータを使用して,いくつかの評価方策 (ddqn,cql,random) のパフォーマンスを評価します.具体的には,Direct Method (DM),Trajectory-wise Importance Sampling (TIS),Per-Decision Importance Sampling (PDIS),Doubly Robust (DR) を含む様々なオフ方策推定量の推定結果を比較します.
次に,データ収集方策によって収集されたオフラインのログデータを使用して,いくつかの評価方策 (ddqn,cql,random) のパフォーマンスを評価します.具体的には,Direct Method (DM),Trajectory-wise Importance Sampling (TIS),Per-Decision Importance Sampling (PDIS),Doubly Robust (DR) を含む様々なオフ方策推定量の推定結果を比較します.

```Python
# SCOPE-RLを使用して基本的なOPE手順を実装する
Expand Down
Binary file removed docs/_static/images/benchmark_acrobot.png
Binary file not shown.
Binary file added docs/_static/images/benchmark_mountaincar.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/_static/images/benchmark_sharpe_ratio_4.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/_static/images/empirical_comparison.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/_static/images/offline_rl_workflow.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/_static/images/ops_workflow.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/_static/images/real_world_interaction.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/_static/images/sharpe_ratio_1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/_static/images/sharpe_ratio_2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file removed docs/_static/images/topk_metrics_acrobot.png
Binary file not shown.
Binary file added docs/_static/images/topk_metrics_mountaincar.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
10 changes: 5 additions & 5 deletions docs/documentation/distinctive_features.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,10 @@ Why SCOPE-RL?
Motivation
~~~~~~~~~~

Sequential decision making is ubiquitous in many real-world applications, including recommender, search, and advertising systems.
Sequential decision making is ubiquitous in many real-world applications, including healthcare, education, recommender systems, and robotics.
While a *logging* or *behavior* policy interacts with users to optimize such sequential decision making, it also produces logged data valuable for learning and evaluating future policies.
For example, a search engine often records a user's search query (state), the document presented by the behavior policy (action), the user response such as a click observed for the presented document (reward), and the next user behavior including a more specific search query (next state).
Making most of these logged data to evaluate a counterfactual policy is particularly beneficial in practice, as it can be a safe and cost-effective substitute for online A/B tests.
For example, a medical agency often records patients' condition (state), the treatment chosen by the expert or behavior policy (action), the patients' health index after the treatment such as vitals (reward), and the patients' condition in the next time period (next state).
Making most of these logged data to evaluate a counterfactual policy is particularly beneficial in practice, as it can be a safe and cost-effective substitute for online A/B tests or clinical trials.

.. card::
:width: 75%
Expand Down Expand Up @@ -230,7 +230,7 @@ Moreover, we streamline the evaluation protocol of OPE/OPS with the following me
* Sharpe ratio (our proposal)

Note that, among the above top-:math:`k` metrics, SharpeRatio is the proposal in our research paper **"
Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation in Reinforcement Learning"**.
Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation"**.
The page: :doc:`sharpe_ratio` describe the above metrics and the contribution of SharpeRatio@k in details. We also discuss these metrics briefly in :ref:`the later sub-section <feature_sharpe_ratio>`.

.. _feature_cd_ope:
Expand Down Expand Up @@ -314,7 +314,7 @@ we measure risk, return, and efficiency of the selected top-:math:`k` policy wit
.. seealso::

Among the top-:math:`k` risk-return tradeoff metrics, SharpeRatio is the main proposal of our research paper
**"Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation in Reinforcement Learning"**.
**"Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation"**.
We describe the motivation and contributions of the SharpeRatio metric in :doc:`sharpe_ratio`.


Expand Down
11 changes: 5 additions & 6 deletions docs/documentation/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -194,7 +194,7 @@ OPS metrics (performance of top :math:`k` deployment policies)
.. seealso::

Among the top-:math:`k` risk-return tradeoff metrics, **SharpeRatio** is the main proposal of our research paper
**"Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation in Reinforcement Learning."**
**"Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation."**
We describe the motivation and contributions of the SharpeRatio metric in :doc:`sharpe_ratio`.

.. seealso::
Expand All @@ -214,13 +214,13 @@ If you use our pipeline in your work, please cite our paper below.
.. card::

| Haruka Kiyohara, Ren Kishimoto, Kosuke Kawakami, Ken Kobayashi, Kazuhide Nakata, Yuta Saito.
| **SCOPE-RL: A Python Library for Offline Reinforcement Learning, Off-Policy Evaluation, and Policy Selection**
| **SCOPE-RL: A Python Library for Offline Reinforcement Learning and Off-Policy Evaluation**
| (a preprint is coming soon..)
.. code-block::
@article{kiyohara2023scope,
title={SCOPE-RL: A Python Library for Offline Reinforcement Learning, Off-Policy Evaluation, and Policy Selection},
title={SCOPE-RL: A Python Library for Offline Reinforcement Learning and Off-Policy Evaluation},
author={Kiyohara, Haruka and Kishimoto, Ren and Kawakami, Kosuke and Kobayashi, Ken and Nakata, Kazuhide and Saito, Yuta},
journal={arXiv preprint arXiv:23xx.xxxxx},
year={2023}
Expand All @@ -231,13 +231,13 @@ If you use the proposed metric (SharpeRatio@k) or refer to our findings in your
.. card::

| Haruka Kiyohara, Ren Kishimoto, Kosuke Kawakami, Ken Kobayashi, Kazuhide Nakata, Yuta Saito.
| **Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation in Reinforcement Learning**
| **Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation**
| (a preprint is coming soon..)
.. code-block::
@article{kiyohara2023towards,
title={Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation in Reinforcement Learning},
title={Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation},
author={Kiyohara, Haruka and Kishimoto, Ren and Kawakami, Kosuke and Kobayashi, Ken and Nakata, Kazuhide and Saito, Yuta},
journal={arXiv preprint arXiv:23xx.xxxxx},
year={2023}
Expand Down Expand Up @@ -265,7 +265,6 @@ Table of Contents

installation
quickstart
.. _autogallery/index
distinctive_features

.. toctree::
Expand Down
8 changes: 4 additions & 4 deletions docs/documentation/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -32,13 +32,13 @@ If you use our pipeline in your work, please cite our paper below.
.. card::

| Haruka Kiyohara, Ren Kishimoto, Kosuke Kawakami, Ken Kobayashi, Kazuhide Nakata, Yuta Saito.
| **SCOPE-RL: A Python Library for Offline Reinforcement Learning, Off-Policy Evaluation, and Policy Selection**
| **SCOPE-RL: A Python Library for Offline Reinforcement Learning and Off-Policy Evaluation**
| (a preprint is coming soon..)
.. code-block::
@article{kiyohara2023scope,
title={SCOPE-RL: A Python Library for Offline Reinforcement Learning, Off-Policy Evaluation, and Policy Selection},
title={SCOPE-RL: A Python Library for Offline Reinforcement Learning and Off-Policy Evaluation},
author={Kiyohara, Haruka and Kishimoto, Ren and Kawakami, Kosuke and Kobayashi, Ken and Nakata, Kazuhide and Saito, Yuta},
journal={arXiv preprint arXiv:23xx.xxxxx},
year={2023}
Expand All @@ -49,13 +49,13 @@ If you use the proposed metric (SharpeRatio@k) or refer to our findings in your
.. card::

| Haruka Kiyohara, Ren Kishimoto, Kosuke Kawakami, Ken Kobayashi, Kazuhide Nakata, Yuta Saito.
| **Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation in Reinforcement Learning**
| **Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation**
| (a preprint is coming soon..)
.. code-block::
@article{kiyohara2023towards,
title={Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation in Reinforcement Learning},
title={Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation},
author={Kiyohara, Haruka and Kishimoto, Ren and Kawakami, Kosuke and Kobayashi, Ken and Nakata, Kazuhide and Saito, Yuta},
journal={arXiv preprint arXiv:23xx.xxxxx},
year={2023}
Expand Down
5 changes: 5 additions & 0 deletions docs/documentation/news.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,4 +6,9 @@ Follow us on `Google Group ([email protected]) <https://groups.google.co
2023
~~~~~~~~~~

**2023.11.xx** Preprints of our papers: (1) [SCOPE-RL: A Python Library for Offline Reinforcement Learning and Off-Policy Evaluation]() ([slides](), [日本語スライド]()),
and (2) [Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation]() ([slides](), [日本語スライド]()) are now available at arXiv!

**2023.7.30** Released :class:`v0.2.1` of SCOPE-RL! This release upgrades the version of d3rlpy from `1.1.1` to `2.0.4`.

**2023.7.21** Released :class:`v0.1.1` of SCOPE-RL! [`PyPI <https://pypi.org/project/scope-rl/>`_] [`Release Note <https://github.com/hakuhodo-technologies/scope-rl/releases>`_]
Loading

0 comments on commit 92c0fcd

Please sign in to comment.