Merge pull request #23 from hakuhodo-technologies/paper

Paper
hakuhodo-technologies · Dec 1, 2023 · b1c9e9f · b1c9e9f
2 parents 5d9864c + 406a756
commit b1c9e9f
Show file tree

Hide file tree

Showing 22 changed files with 97 additions and 39 deletions.
diff --git a/README.md b/README.md
@@ -9,7 +9,8 @@
 [![GitHub last commit](https://img.shields.io/github/last-commit/hakuhodo-technologies/scope-rl)](https://github.com/hakuhodo-technologies/scope-rl/graphs/commit-activity)
 [![Documentation Status](https://readthedocs.org/projects/scope-rl/badge/?version=latest)](https://scope-rl.readthedocs.io/en/latest/)
 [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
-[![arXiv](https://img.shields.io/badge/arXiv-23xx.xxxxx-b31b1b.svg)](https://arxiv.org/abs/23xx.xxxxx)
+[![arXiv](https://img.shields.io/badge/arXiv-2311.18206-b31b1b.svg)](https://arxiv.org/abs/2311.18206)
+[![arXiv](https://img.shields.io/badge/arXiv-2311.18207-b31b1b.svg)](https://arxiv.org/abs/2311.18207)
 
 <details>
 <summary><strong>Table of Contents </strong>(click to expand)</summary>
@@ -36,6 +37,8 @@
 
 **Stable versions are available at [PyPI](https://pypi.org/project/scope-rl/)**
 
+**Slides are available [here](https://speakerdeck.com/harukakiyohara_/scope-rl)**
+
 **日本語は[こちら](README_ja.md)**
 
 ## Overview
@@ -56,6 +59,13 @@ This software is inspired by [Open Bandit Pipeline](https://github.com/st-tech/z
 
 ### Implementations
 
+<div align="center"><img src="https://raw.githubusercontent.com/hakuhodo-technologies/scope-rl/main/images/scope_workflow.png" width="100%"/></div>
+<figcaption>
+<p align="center">
+  End-to-end workflow of offline RL and OPE with SCOPE-RL
+</p>
+</figcaption>
+
 *SCOPE-RL* mainly consists of the following three modules.
 - [**dataset module**](./scope_rl/dataset): This module provides tools to generate synthetic data from any environment on top of [OpenAI Gym](http://gym.openai.com/) and [Gymnasium](https://gymnasium.farama.org/)-like interface. It also provides tools to pre-process the logged data.
 - [**policy module**](./scope_rl/policy/): This module provides a wrapper class for [d3rlpy](https://github.com/takuseno/d3rlpy) to enable flexible data collection.
@@ -130,6 +140,15 @@ This software is inspired by [Open Bandit Pipeline](https://github.com/st-tech/z
 
 Note that in addition to the above OPE and OPS methods, researchers can easily implement and compare their own estimators through a generic abstract class implemented in SCOPE-RL. Moreover, practitioners can apply the above methods to their real-world data to evaluate and choose counterfactual policies for their own practical situations.
 
+The distinctive features of OPE/OPS modules of SCOPE-RL are summarized as follows.
+
+<div align="center"><img src="https://raw.githubusercontent.com/hakuhodo-technologies/scope-rl/main/images/ope_features.png" width="100%"/></div>
+<figcaption>
+<p align="center">
+  Four distinctive features of OPE/OPS implementation of SCOPE-RL
+</p>
+</figcaption>
+
 To provide an example of performing a customized experiment imitating a practical setup, we also provide [RTBGym](./rtbgym) and [RecGym](./recgym), RL environments for Real-Time Bidding (RTB) and Recommender Systems.
 
 ## Installation
@@ -426,14 +445,14 @@ If you use our software in your work, please cite our paper:
 
 Haruka Kiyohara, Ren Kishimoto, Kosuke Kawakami, Ken Kobayashi, Kazuhide Nakata, Yuta Saito.<br>
 **SCOPE-RL: A Python Library for Offline Reinforcement Learning and Off-Policy Evaluation**<br>
-[link]() (a preprint coming soon..)
+[[arXiv](https://arxiv.org/abs/2311.18206)] [[slides](https://speakerdeck.com/harukakiyohara_/scope-rl)]
 
 Bibtex:
 ```
 @article{kiyohara2023scope,
   author = {Kiyohara, Haruka and Kishimoto, Ren and Kawakami, Kosuke and Kobayashi, Ken and Nataka, Kazuhide and Saito, Yuta},
   title = {SCOPE-RL: A Python Library for Offline Reinforcement Learning and Off-Policy Evaluation},
-  journal={arXiv preprint arXiv:23xx.xxxxx},
+  journal={arXiv preprint arXiv:2311.18206},
   year={2023},
 }
 ```
@@ -442,14 +461,14 @@ If you use our proposed metric "SharpeRatio@k" in your work, please cite our pap
 
 Haruka Kiyohara, Ren Kishimoto, Kosuke Kawakami, Ken Kobayashi, Kazuhide Nakata, Yuta Saito.<br>
 **Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation**<br>
-[link]() (a preprint coming soon..)
+[[arXiv](https://arxiv.org/abs/2311.18207)] [[slides](https://speakerdeck.com/harukakiyohara_/towards-risk-return-assessment-of-ope)]
 
 Bibtex:
 ```
 @article{kiyohara2023towards,
   author = {Kiyohara, Haruka and Kishimoto, Ren and Kawakami, Kosuke and Kobayashi, Ken and Nataka, Kazuhide and Saito, Yuta},
   title = {Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation},
-  journal={arXiv preprint arXiv:23xx.xxxxx},
+  journal={arXiv preprint arXiv:2311.18207},
   year={2023},
 }
 ```

diff --git a/README_ja.md b/README_ja.md
@@ -9,7 +9,8 @@
 [![GitHub last commit](https://img.shields.io/github/last-commit/hakuhodo-technologies/scope-rl)](https://github.com/hakuhodo-technologies/scope-rl/graphs/commit-activity)
 [![Documentation Status](https://readthedocs.org/projects/scope-rl/badge/?version=latest)](https://scope-rl.readthedocs.io/en/latest/)
 [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
-[![arXiv](https://img.shields.io/badge/arXiv-23xx.xxxxx-b31b1b.svg)](https://arxiv.org/abs/23xx.xxxxx)
+[[![arXiv](https://img.shields.io/badge/arXiv-2311.18206-b31b1b.svg)](https://arxiv.org/abs/2311.18206)
+[![arXiv](https://img.shields.io/badge/arXiv-2311.18207-b31b1b.svg)](https://arxiv.org/abs/2311.18207)
 
 <details>
 <summary><strong>目次</strong>(クリックして展開)</summary>
@@ -36,6 +37,9 @@
 
 **PyPIで最新版が利用可能 [PyPI](https://pypi.org/project/scope-rl/)**
 
+**解説スライドは [こちら](https://speakerdeck.com/aiueola/scope-rl-ja)**
+
+
 ## 概要
 SCOPE-RL は，データ収集からオフ方策学習，オフ方策性能評価，方策選択をend-to-endで実装するためのオープンソースのPythonソフトウェアです．私たちのソフトウェアには，人工データ生成，データの前処理，オフ方策評価 (off-policy evaluation; OPE) の推定量，オフ方策選択 (off-policy selection; OPS) 手法を実装するための一連のモジュールが含まれています．
 
@@ -55,6 +59,13 @@ SCOPE-RL は，データ収集からオフ方策学習，オフ方策性能評
 
 ### 実装
 
+<div align="center"><img src="https://raw.githubusercontent.com/hakuhodo-technologies/scope-rl/main/images/scope_workflow.png" width="100%"/></div>
+<figcaption>
+<p align="center">
+  SCOPE-RL上で行えるオフライン強化学習とオフ方策評価の一貫した実装手順
+</p>
+</figcaption>
+
 *SCOPE-RL* は主に以下の3つのモジュールから構成されています．
 
 - [**dataset module**](./_gym/dataset): このモジュールは，[OpenAI Gym](http://gym.openai.com/) や[Gymnasium](https://gymnasium.farama.org/)のようなインターフェイスに基づく任意の環境から人工データを生成するためのツールを提供します．また，ログデータの前処理を行うためのツールも提供します．
@@ -130,6 +141,15 @@ SCOPE-RL は，データ収集からオフ方策学習，オフ方策性能評
 
 研究上でのSCOPE-RLの利点は，抽象クラスを用いることで既に実装されているオフ方策評価およびオフ方策選択手法に加えて，研究者が自分の推定量を簡単に実装し，比較することができることです．さらに実践上では，様々なオフ方策推定量を実データに適用し，自身の実際の状況に合った方策を評価し，選択することができることです．
 
+さらにSCOPE-RLでは既存のパッケージの機能に留まらず，以下の図の通りより実用に即したオフ方策評価の実装が可能です．
+
+<div align="center"><img src="https://raw.githubusercontent.com/hakuhodo-technologies/scope-rl/main/images/ope_features_ja.png" width="100%"/></div>
+<figcaption>
+<p align="center">
+  SCOPE-RLのオフ方策評価モジュールが力を入れる4つの機能
+</p>
+</figcaption>
+
 またSCOPE-RLはサブパッケージとして、シンプルな設定の[BasicGym](./basicgym)実用的な環境をシミュレーションした広告入札 (real-time bidding; RTB) と推薦システム用の強化学習環境である[RTBGym](./rtbgym)と[RecGym](./recgym)も提供しています。
 
 
@@ -434,14 +454,14 @@ ops.visualize_conditional_value_at_risk_for_validation(
 
 Haruka Kiyohara, Ren Kishimoto, Kosuke Kawakami, Ken Kobayashi, Kazuhide Nakata, Yuta Saito.<br>
 **SCOPE-RL: A Python Library for Offline Reinforcement Learning and Off-Policy Evaluation**<br>
-[link]() (a preprint coming soon..)
+[[arXiv](https://arxiv.org/abs/2311.18206)] [[日本語スライド](https://speakerdeck.com/aiueola/scope-rl-ja)]
 
 Bibtex:
 ```
 @article{kiyohara2023scope,
   author = {Kiyohara, Haruka and Kishimoto, Ren and Kawakami, Kosuke and Kobayashi, Ken and Nataka, Kazuhide and Saito, Yuta},
   title = {SCOPE-RL: A Python Library for Offline Reinforcement Learning and Off-Policy Evaluation},
-  journal={arXiv preprint arXiv:23xx.xxxxx},
+  journal={arXiv preprint arXiv:2311.18206},
   year={2023},
 }
 ```
@@ -450,14 +470,14 @@ Bibtex:
 
 Haruka Kiyohara, Ren Kishimoto, Kosuke Kawakami, Ken Kobayashi, Kazuhide Nakata, Yuta Saito.<br>
 **Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation**<br>
-[link]() (a preprint coming soon..)
+[[arXiv](https://arxiv.org/abs/2311.18207)] [[日本語スライド](https://speakerdeck.com/aiueola/towards-risk-return-assessment-of-ope-ja)]
 
 Bibtex:
 ```
 @article{kiyohara2023towards,
   author = {Kiyohara, Haruka and Kishimoto, Ren and Kawakami, Kosuke and Kobayashi, Ken and Nataka, Kazuhide and Saito, Yuta},
   title = {Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation},
-  journal={arXiv preprint arXiv:23xx.xxxxx},
+  journal={arXiv preprint arXiv:2311.18207},
   year={2023},
 }
 ```

diff --git a/basicgym/README.md b/basicgym/README.md
@@ -245,14 +245,13 @@ If you use our software in your work, please cite our paper:
 
 Haruka Kiyohara, Ren Kishimoto, Kosuke Kawakami, Ken Kobayashi, Kazuhide Nakata, Yuta Saito.<br>
 **SCOPE-RL: A Python Library for Offline Reinforcement Learning and Off-Policy Evaluation**<br>
-[link]() (a preprint coming soon..)
 
 Bibtex:
 ```
 @article{kiyohara2023scope,
   author = {Kiyohara, Haruka and Kishimoto, Ren and Kawakami, Kosuke and Kobayashi, Ken and Nataka, Kazuhide and Saito, Yuta},
   title = {SCOPE-RL: A Python Library for Offline Reinforcement Learning and Off-Policy Evaluation},
-  journal={arXiv preprint arXiv:23xx.xxxxx},
+  journal={arXiv preprint arXiv:2311.18206},
   year = {2023},
 }
 ```

diff --git a/basicgym/README_ja.md b/basicgym/README_ja.md
@@ -246,14 +246,13 @@ class CustomizedRewardFunction(BaseRewardFunction):
 
 Haruka Kiyohara, Ren Kishimoto, Kosuke Kawakami, Ken Kobayashi, Kazuhide Nakata, Yuta Saito.<br>
 **SCOPE-RL: A Python Library for Offline Reinforcement Learning and Off-Policy Evaluation**<br>
-[link]() (a preprint coming soon..)
 
 Bibtex:
 ```
 @article{kiyohara2023scope,
   author = {Kiyohara, Haruka and Kishimoto, Ren and Kawakami, Kosuke and Kobayashi, Ken and Nataka, Kazuhide and Saito, Yuta},
   title = {SCOPE-RL: A Python Library for Offline Reinforcement Learning and Off-Policy Evaluation},
-  journal={arXiv preprint arXiv:23xx.xxxxx},
+  journal={arXiv preprint arXiv:2311.18206},
   year = {2023},
 }
 ```

diff --git a/docs/_static/images/ope_features.png b/docs/_static/images/ope_features.png
diff --git a/docs/_static/images/scope_workflow.png b/docs/_static/images/scope_workflow.png
diff --git a/docs/conf.py b/docs/conf.py
@@ -81,7 +81,7 @@
     "icon_links": [
         {
             "name": "Speaker Deck",
-            "url": "https://speakerdeck.com/aiueola/ofrl-designing-an-offline-reinforcement-learning-and-policy-evaluation-platform-from-practical-perspectives",
+            "url": "https://speakerdeck.com/harukakiyohara_/scope-rl",
             "icon": "fa-brands fa-speaker-deck",
             "type": "fontawesome",
         },

diff --git a/docs/documentation/distinctive_features.rst b/docs/documentation/distinctive_features.rst
@@ -106,6 +106,18 @@ advanced ones :cite:`kallus2020double, uehara2020minimax, liu2018breaking, yang2
 Moreover, we provide the meta-class to handle OPE/OPS experiments and the abstract base implementation of OPE estimators.
 This allows researchers to quickly test their own algorithms with this platform and also helps practitioners empirically learn the property of various OPE methods.
 
+.. card::
+    :width: 75%
+    :margin: auto
+    :img-top: ../_static/images/ope_features.png
+    :text-align: center
+
+    Four key features of OPE/OPS modules of SCOPE-RL
+
+.. raw:: html
+
+    <div class="white-space-20px"></div>
+
 .. _feature_variety_ope:
 
 Variety of OPE estimators and evaluation protocol of OPE

diff --git a/docs/documentation/examples/index.rst b/docs/documentation/examples/index.rst
@@ -6,6 +6,10 @@ Example Codes
 SCOPE-RL
 ----------
 
+.. seealso::
+
+    Please also refer to :doc:`SCOPE-RL package reference </documentation/scope_rl_api>` for APIs.
+
 .. _basic_ope_example:
 
 Basic and High-Confidence Off-Policy Evaluation (OPE):

diff --git a/docs/documentation/index.rst b/docs/documentation/index.rst
@@ -215,14 +215,13 @@ If you use our pipeline in your work, please cite our paper below.
 
     | Haruka Kiyohara, Ren Kishimoto, Kosuke Kawakami, Ken Kobayashi, Kazuhide Nakata, Yuta Saito.
     | **SCOPE-RL: A Python Library for Offline Reinforcement Learning and Off-Policy Evaluation**
-    | (a preprint is coming soon..)
 
     .. code-block::
 
         @article{kiyohara2023scope,
             title={SCOPE-RL: A Python Library for Offline Reinforcement Learning and Off-Policy Evaluation},
             author={Kiyohara, Haruka and Kishimoto, Ren and Kawakami, Kosuke and Kobayashi, Ken and Nakata, Kazuhide and Saito, Yuta},
-            journal={arXiv preprint arXiv:23xx.xxxxx},
+            journal={arXiv preprint arXiv:2311.18206},
             year={2023}
         }
 
@@ -239,7 +238,7 @@ If you use the proposed metric (SharpeRatio@k) or refer to our findings in your
         @article{kiyohara2023towards,
             title={Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation},
             author={Kiyohara, Haruka and Kishimoto, Ren and Kawakami, Kosuke and Kobayashi, Ken and Nakata, Kazuhide and Saito, Yuta},
-            journal={arXiv preprint arXiv:23xx.xxxxx},
+            journal={arXiv preprint arXiv:2311.18207},
             year={2023}
         }
 

diff --git a/docs/documentation/installation.rst b/docs/documentation/installation.rst
@@ -40,7 +40,7 @@ If you use our pipeline in your work, please cite our paper below.
         @article{kiyohara2023scope,
             title={SCOPE-RL: A Python Library for Offline Reinforcement Learning and Off-Policy Evaluation},
             author={Kiyohara, Haruka and Kishimoto, Ren and Kawakami, Kosuke and Kobayashi, Ken and Nakata, Kazuhide and Saito, Yuta},
-            journal={arXiv preprint arXiv:23xx.xxxxx},
+            journal={arXiv preprint arXiv:2311.18206},
             year={2023}
         }
 
@@ -50,14 +50,13 @@ If you use the proposed metric (SharpeRatio@k) or refer to our findings in your
 
     | Haruka Kiyohara, Ren Kishimoto, Kosuke Kawakami, Ken Kobayashi, Kazuhide Nakata, Yuta Saito.
     | **Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation**
-    | (a preprint is coming soon..)
 
     .. code-block::
 
         @article{kiyohara2023towards,
             title={Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation},
             author={Kiyohara, Haruka and Kishimoto, Ren and Kawakami, Kosuke and Kobayashi, Ken and Nakata, Kazuhide and Saito, Yuta},
-            journal={arXiv preprint arXiv:23xx.xxxxx},
+            journal={arXiv preprint arXiv:2311.18207},
             year={2023}
         }
 

diff --git a/docs/documentation/news.rst b/docs/documentation/news.rst
@@ -6,8 +6,8 @@ Follow us on `Google Group ([email protected]) <https://groups.google.co
 2023
 ~~~~~~~~~~
 
-**2023.11.xx** Preprints of our papers: (1) [SCOPE-RL:  A Python Library for Offline Reinforcement Learning and Off-Policy Evaluation]() ([slides](), [日本語スライド]()), 
-and (2) [Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation]() ([slides](), [日本語スライド]()) are now available at arXiv!
+**2023.12.01** Preprints of our twin papers: (1) `SCOPE-RL:  A Python Library for Offline Reinforcement Learning and Off-Policy Evaluation <https://arxiv.org/abs/2311.18206>`_ (`slides <https://speakerdeck.com/harukakiyohara_/scope-rl>`_, `日本語スライド <https://speakerdeck.com/aiueola/scope-rl-ja>`_), 
+and (2) `Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation <https://arxiv.org/abs/2311.18207>`_ (`slides <https://speakerdeck.com/harukakiyohara_/towards-risk-return-assessment-of-ope>`_, `日本語スライド <https://speakerdeck.com/aiueola/towards-risk-return-assessment-of-ope-ja>`_) are now available at arXiv!
 
 **2023.7.30** Released :class:`v0.2.1` of SCOPE-RL! This release upgrades the version of d3rlpy from  `1.1.1` to `2.0.4`.
 

diff --git a/docs/documentation/quickstart.rst b/docs/documentation/quickstart.rst
@@ -2,7 +2,19 @@ Quickstart
 ==========
 
 We show an example workflow of synthetic dataset collection, offline Reinforcement Learning (RL), to Off-Policy Evaluation (OPE).
-The workflow mainly consists of the following three steps:
+The workflow mainly consists of the following three steps (and a validation step):
+
+.. card::
+   :width: 75%
+   :margin: auto
+   :img-top: ../_static/images/scope_workflow.png
+   :text-align: center
+
+   Workflow of offline RL and OPE streamlined by SCOPE-RL
+
+.. raw:: html
+
+    <div class="white-space-20px"></div>
 
 * **Synthetic Dataset Generation and Data Preprocessing**: 
     The initial step is to collect logged data using a behavior policy. In a synthetic setup, we first train a behavior policy through online interaction and then generate dataset(s) with the behavior policy. In a practical situation, we should use the preprocessed logged data obtained from real-world applications.

diff --git a/docs/documentation/sharpe_ratio.rst b/docs/documentation/sharpe_ratio.rst
@@ -10,9 +10,8 @@ Note that for the basic problem formulation of Off-Policy Evaluation and Selecti
 .. seealso::
 
     The **SharpeRatio@k** metric is the main contribution of our paper **"Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation."** 
-    Our paper is currently under submission, and the arXiv version of the paper will come soon..
+    Our paper is currently under submission, and the arXiv version of the paper is available `here <https://arxiv.org/abs/2311.18207>`_.
 
-    .. A preprint is available at `arXiv <>`_.
 
 Background
 ~~~~~~~~~~
@@ -281,7 +280,7 @@ The above figure reports the benchmark results of OPE estimators with SharpeRati
 1. Future research in OPE should include the assessment of estimators based on SharpeRatio@k:
 
     The findings from the previous section suggest that SharpeRatio@k provides more actionable insights compared to traditional accuracy metrics. 
-    The benchmark results using SharpeRatio@k, as shown in Figure~\ref{fig:sharpe_ratio_benchmark}, often significantly differ from those obtained with conventional accuracy metrics. 
+    The benchmark results using SharpeRatio@k (particularly as shown in the figures of Result 2), often significantly differ from those obtained with conventional accuracy metrics. 
     This highlights the importance of integrating SharpeRatio@k into future research to more effectively evaluate the efficiency of OPE estimators.
 
 2. A new estimator that explicitly optimizes the risk-return tradeoff:
@@ -312,14 +311,13 @@ If you use the proposed metric (SharpeRatio@k) or refer to our findings in your
 
     | Haruka Kiyohara, Ren Kishimoto, Kosuke Kawakami, Ken Kobayashi, Kazuhide Nakata, Yuta Saito.
     | **Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation**
-    | (a preprint is coming soon..)
 
     .. code-block::
 
         @article{kiyohara2023towards,
             title={Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation},
             author={Kiyohara, Haruka and Kishimoto, Ren and Kawakami, Kosuke and Kobayashi, Ken and Nakata, Kazuhide and Saito, Yuta},
-            journal={arXiv preprint arXiv:23xx.xxxxx},
+            journal={arXiv preprint arXiv:2311.18207},
             year={2023}
         }