diff --git a/README_ja.md b/README_ja.md index b2964286..f4283b81 100644 --- a/README_ja.md +++ b/README_ja.md @@ -44,9 +44,9 @@ SCOPE-RL は,データ収集からオフ方策学習,オフ方策性能評 特に,SCOPE-RLは以下の研究トピックに関連する評価とアルゴリズム比較を簡単に行えます: -- **オフライン強化学習**:オフライン強化学習は,挙動方策によって収集されたオフラインのログデータのみから新しい方策を学習することを目的としています.SCOPE-RLは,様々な挙動方策と環境によって収集されたデータによる柔軟な実験を可能にします. +- **オフライン強化学習**:オフライン強化学習は,データ収集方策によって収集されたオフラインのログデータのみから新しい方策を学習することを目的としています.SCOPE-RLは,様々なデータ収集と環境によって収集されたデータによる柔軟な実験を可能にします. -- **オフ方策評価(OPE)**:オフ方策評価は,挙動方策により集められたオフラインのログデータのみを使用して(挙動方策とは異なる)新たな方策の性能を評価することを目的とします.SCOPE-RLは多くのオフ方策推定量の実装可能にする抽象クラスや、推定量を評価し比較するための実験手順を実装しています.また、SCOPE-RLが実装し公開している発展的なオフ方策評価手法には、状態-行動密度推定や累積分布推定に基づく推定量なども含まれます. +- **オフ方策評価(OPE)**:オフ方策評価は,データ収集方策により集められたオフラインのログデータのみを使用して(データ収集方策とは異なる)新たな方策の性能を評価することを目的とします.SCOPE-RLは多くのオフ方策推定量の実装可能にする抽象クラスや、推定量を評価し比較するための実験手順を実装しています.また、SCOPE-RLが実装し公開している発展的なオフ方策評価手法には、状態-行動密度推定や累積分布推定に基づく推定量なども含まれます. - **オフ方策選択(OPS)**:オフ方策選択は,オフラインのログデータを使用して,いくつかの候補方策の中から最も性能の良い方策を特定することを目的とします.SCOPE-RLは様々な方策選択の基準を実装するだけでなく,方策選択の結果を評価するためのいくつかの指標を提供します. @@ -58,11 +58,11 @@ SCOPE-RL は,データ収集からオフ方策学習,オフ方策性能評 *SCOPE-RL* は主に以下の3つのモジュールから構成されています. - [**dataset module**](./_gym/dataset): このモジュールは,[OpenAI Gym](http://gym.openai.com/) や[Gymnasium](https://gymnasium.farama.org/)のようなインターフェイスに基づく任意の環境から人工データを生成するためのツールを提供します.また,ログデータの前処理を行うためのツールも提供します. -- [**policy module**](./_gym/policy): このモジュールはd3rlpyのwrapperクラスを提供し,様々な挙動方策による柔軟なデータ収集を可能にします. +- [**policy module**](./_gym/policy): このモジュールはd3rlpyのwrapperクラスを提供し,様々なデータ収集方策による柔軟なデータ収集を可能にします. - [**ope module**](./_gym/ope): このモジュールは,オフ方策推定量を実装するための汎用的な抽象クラスを提供します.また,オフ方策選択を実行するために便利ないくつかのツールも提供します.
-挙動方策(クリックして展開) +データ収集方策(クリックして展開) - Discrete - Epsilon Greedy @@ -181,7 +181,7 @@ env = gym.make("RTBEnv-discrete-v0") # (1) オンライン環境で基本方策を学習する(d3rlpyを使用) # アルゴリズムを初期化する ddqn = DoubleDQNConfig().create(device=device) -# オンライン挙動方策を訓練する +# オンラインデータ収集方策を訓練する # 約5分かかる ddqn.fit_online( env, @@ -193,7 +193,7 @@ ddqn.fit_online( ) # (2) ログデータを生成する -# ddqn方策を確率的な挙動方策に変換する +# ddqn方策を確率的なデータ収集方策に変換する behavior_policy = EpsilonGreedyHead( ddqn, n_actions=env.action_space.n, @@ -206,7 +206,7 @@ dataset = SyntheticDataset( env=env, max_episode_steps=env.step_per_episode, ) -# 挙動方策がいくつかのログデータを収集する +# データ収集方策がいくつかのログデータを収集する train_logged_dataset = dataset.obtain_episodes( behavior_policies=behavior_policy, n_trajectories=10000, @@ -250,7 +250,7 @@ cql.fit( ### 標準的なオフ方策評価 -次に,挙動方策によって収集されたオフラインのログデータを使用して,いくつかの評価方策 (ddqn,cql,random) のパフォーマンスを評価します.具体的には,Direct Method (DM),Trajectory-wise Importance Sampling (TIS),Per-Decision Importance Sampling (PDIS),Doubly Robust (DR) を含む様々なオフ方策推定量の推定結果を比較します. +次に,データ収集方策によって収集されたオフラインのログデータを使用して,いくつかの評価方策 (ddqn,cql,random) のパフォーマンスを評価します.具体的には,Direct Method (DM),Trajectory-wise Importance Sampling (TIS),Per-Decision Importance Sampling (PDIS),Doubly Robust (DR) を含む様々なオフ方策推定量の推定結果を比較します. ```Python # SCOPE-RLを使用して基本的なOPE手順を実装する diff --git a/docs/_static/images/benchmark_acrobot.png b/docs/_static/images/benchmark_acrobot.png deleted file mode 100644 index 179f41e4..00000000 Binary files a/docs/_static/images/benchmark_acrobot.png and /dev/null differ diff --git a/docs/_static/images/benchmark_mountaincar.png b/docs/_static/images/benchmark_mountaincar.png new file mode 100644 index 00000000..57927c1a Binary files /dev/null and b/docs/_static/images/benchmark_mountaincar.png differ diff --git a/docs/_static/images/benchmark_sharpe_ratio_4.png b/docs/_static/images/benchmark_sharpe_ratio_4.png index a8bbb82d..2eb793b0 100644 Binary files a/docs/_static/images/benchmark_sharpe_ratio_4.png and b/docs/_static/images/benchmark_sharpe_ratio_4.png differ diff --git a/docs/_static/images/empirical_comparison.png b/docs/_static/images/empirical_comparison.png new file mode 100644 index 00000000..3c25782d Binary files /dev/null and b/docs/_static/images/empirical_comparison.png differ diff --git a/docs/_static/images/offline_rl_workflow.png b/docs/_static/images/offline_rl_workflow.png index 0053007d..c592a8d5 100644 Binary files a/docs/_static/images/offline_rl_workflow.png and b/docs/_static/images/offline_rl_workflow.png differ diff --git a/docs/_static/images/ops_workflow.png b/docs/_static/images/ops_workflow.png index c6d9d5ec..e8058ca3 100644 Binary files a/docs/_static/images/ops_workflow.png and b/docs/_static/images/ops_workflow.png differ diff --git a/docs/_static/images/real_world_interaction.png b/docs/_static/images/real_world_interaction.png index fa603b18..6a6c8419 100644 Binary files a/docs/_static/images/real_world_interaction.png and b/docs/_static/images/real_world_interaction.png differ diff --git a/docs/_static/images/sharpe_ratio_1.png b/docs/_static/images/sharpe_ratio_1.png index dcbd38fa..884aa8f4 100644 Binary files a/docs/_static/images/sharpe_ratio_1.png and b/docs/_static/images/sharpe_ratio_1.png differ diff --git a/docs/_static/images/sharpe_ratio_2.png b/docs/_static/images/sharpe_ratio_2.png index 208bb3ee..66242678 100644 Binary files a/docs/_static/images/sharpe_ratio_2.png and b/docs/_static/images/sharpe_ratio_2.png differ diff --git a/docs/_static/images/topk_metrics_acrobot.png b/docs/_static/images/topk_metrics_acrobot.png deleted file mode 100644 index d2dba0a4..00000000 Binary files a/docs/_static/images/topk_metrics_acrobot.png and /dev/null differ diff --git a/docs/_static/images/topk_metrics_mountaincar.png b/docs/_static/images/topk_metrics_mountaincar.png new file mode 100644 index 00000000..7d611a6a Binary files /dev/null and b/docs/_static/images/topk_metrics_mountaincar.png differ diff --git a/docs/documentation/distinctive_features.rst b/docs/documentation/distinctive_features.rst index b2a07666..c0561d9f 100644 --- a/docs/documentation/distinctive_features.rst +++ b/docs/documentation/distinctive_features.rst @@ -7,10 +7,10 @@ Why SCOPE-RL? Motivation ~~~~~~~~~~ -Sequential decision making is ubiquitous in many real-world applications, including recommender, search, and advertising systems. +Sequential decision making is ubiquitous in many real-world applications, including healthcare, education, recommender systems, and robotics. While a *logging* or *behavior* policy interacts with users to optimize such sequential decision making, it also produces logged data valuable for learning and evaluating future policies. -For example, a search engine often records a user's search query (state), the document presented by the behavior policy (action), the user response such as a click observed for the presented document (reward), and the next user behavior including a more specific search query (next state). -Making most of these logged data to evaluate a counterfactual policy is particularly beneficial in practice, as it can be a safe and cost-effective substitute for online A/B tests. +For example, a medical agency often records patients' condition (state), the treatment chosen by the expert or behavior policy (action), the patients' health index after the treatment such as vitals (reward), and the patients' condition in the next time period (next state). +Making most of these logged data to evaluate a counterfactual policy is particularly beneficial in practice, as it can be a safe and cost-effective substitute for online A/B tests or clinical trials. .. card:: :width: 75% @@ -230,7 +230,7 @@ Moreover, we streamline the evaluation protocol of OPE/OPS with the following me * Sharpe ratio (our proposal) Note that, among the above top-:math:`k` metrics, SharpeRatio is the proposal in our research paper **" -Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation in Reinforcement Learning"**. +Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation"**. The page: :doc:`sharpe_ratio` describe the above metrics and the contribution of SharpeRatio@k in details. We also discuss these metrics briefly in :ref:`the later sub-section `. .. _feature_cd_ope: @@ -314,7 +314,7 @@ we measure risk, return, and efficiency of the selected top-:math:`k` policy wit .. seealso:: Among the top-:math:`k` risk-return tradeoff metrics, SharpeRatio is the main proposal of our research paper - **"Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation in Reinforcement Learning"**. + **"Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation"**. We describe the motivation and contributions of the SharpeRatio metric in :doc:`sharpe_ratio`. diff --git a/docs/documentation/index.rst b/docs/documentation/index.rst index c5ce59ff..8ad626bd 100644 --- a/docs/documentation/index.rst +++ b/docs/documentation/index.rst @@ -194,7 +194,7 @@ OPS metrics (performance of top :math:`k` deployment policies) .. seealso:: Among the top-:math:`k` risk-return tradeoff metrics, **SharpeRatio** is the main proposal of our research paper - **"Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation in Reinforcement Learning."** + **"Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation."** We describe the motivation and contributions of the SharpeRatio metric in :doc:`sharpe_ratio`. .. seealso:: @@ -214,13 +214,13 @@ If you use our pipeline in your work, please cite our paper below. .. card:: | Haruka Kiyohara, Ren Kishimoto, Kosuke Kawakami, Ken Kobayashi, Kazuhide Nakata, Yuta Saito. - | **SCOPE-RL: A Python Library for Offline Reinforcement Learning, Off-Policy Evaluation, and Policy Selection** + | **SCOPE-RL: A Python Library for Offline Reinforcement Learning and Off-Policy Evaluation** | (a preprint is coming soon..) .. code-block:: @article{kiyohara2023scope, - title={SCOPE-RL: A Python Library for Offline Reinforcement Learning, Off-Policy Evaluation, and Policy Selection}, + title={SCOPE-RL: A Python Library for Offline Reinforcement Learning and Off-Policy Evaluation}, author={Kiyohara, Haruka and Kishimoto, Ren and Kawakami, Kosuke and Kobayashi, Ken and Nakata, Kazuhide and Saito, Yuta}, journal={arXiv preprint arXiv:23xx.xxxxx}, year={2023} @@ -231,13 +231,13 @@ If you use the proposed metric (SharpeRatio@k) or refer to our findings in your .. card:: | Haruka Kiyohara, Ren Kishimoto, Kosuke Kawakami, Ken Kobayashi, Kazuhide Nakata, Yuta Saito. - | **Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation in Reinforcement Learning** + | **Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation** | (a preprint is coming soon..) .. code-block:: @article{kiyohara2023towards, - title={Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation in Reinforcement Learning}, + title={Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation}, author={Kiyohara, Haruka and Kishimoto, Ren and Kawakami, Kosuke and Kobayashi, Ken and Nakata, Kazuhide and Saito, Yuta}, journal={arXiv preprint arXiv:23xx.xxxxx}, year={2023} @@ -265,7 +265,6 @@ Table of Contents installation quickstart - .. _autogallery/index distinctive_features .. toctree:: diff --git a/docs/documentation/installation.rst b/docs/documentation/installation.rst index 44494d09..a5954a51 100644 --- a/docs/documentation/installation.rst +++ b/docs/documentation/installation.rst @@ -32,13 +32,13 @@ If you use our pipeline in your work, please cite our paper below. .. card:: | Haruka Kiyohara, Ren Kishimoto, Kosuke Kawakami, Ken Kobayashi, Kazuhide Nakata, Yuta Saito. - | **SCOPE-RL: A Python Library for Offline Reinforcement Learning, Off-Policy Evaluation, and Policy Selection** + | **SCOPE-RL: A Python Library for Offline Reinforcement Learning and Off-Policy Evaluation** | (a preprint is coming soon..) .. code-block:: @article{kiyohara2023scope, - title={SCOPE-RL: A Python Library for Offline Reinforcement Learning, Off-Policy Evaluation, and Policy Selection}, + title={SCOPE-RL: A Python Library for Offline Reinforcement Learning and Off-Policy Evaluation}, author={Kiyohara, Haruka and Kishimoto, Ren and Kawakami, Kosuke and Kobayashi, Ken and Nakata, Kazuhide and Saito, Yuta}, journal={arXiv preprint arXiv:23xx.xxxxx}, year={2023} @@ -49,13 +49,13 @@ If you use the proposed metric (SharpeRatio@k) or refer to our findings in your .. card:: | Haruka Kiyohara, Ren Kishimoto, Kosuke Kawakami, Ken Kobayashi, Kazuhide Nakata, Yuta Saito. - | **Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation in Reinforcement Learning** + | **Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation** | (a preprint is coming soon..) .. code-block:: @article{kiyohara2023towards, - title={Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation in Reinforcement Learning}, + title={Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation}, author={Kiyohara, Haruka and Kishimoto, Ren and Kawakami, Kosuke and Kobayashi, Ken and Nakata, Kazuhide and Saito, Yuta}, journal={arXiv preprint arXiv:23xx.xxxxx}, year={2023} diff --git a/docs/documentation/news.rst b/docs/documentation/news.rst index 743d9cf5..97950e55 100644 --- a/docs/documentation/news.rst +++ b/docs/documentation/news.rst @@ -6,4 +6,9 @@ Follow us on `Google Group (scope-rl@googlegroups.com) `_] [`Release Note `_] \ No newline at end of file diff --git a/docs/documentation/sharpe_ratio.rst b/docs/documentation/sharpe_ratio.rst index d1a8e87e..4a8db435 100644 --- a/docs/documentation/sharpe_ratio.rst +++ b/docs/documentation/sharpe_ratio.rst @@ -9,7 +9,7 @@ Note that for the basic problem formulation of Off-Policy Evaluation and Selecti .. seealso:: - The **SharpeRatio@k** metric is the main contribution of our paper **"Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation in Reinforcement Learning."** + The **SharpeRatio@k** metric is the main contribution of our paper **"Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation."** Our paper is currently under submission, and the arXiv version of the paper will come soon.. .. A preprint is available at `arXiv <>`_. @@ -49,7 +49,7 @@ To evaluate and compare the performance of OPE estimators, the following three m In the above metrics, MSE measures the accuracy of OPE estimation, while the latter two assess the accuracy of downstream policy selection tasks. By combining these metrics, especially the latter two, we can quantify how likely an OPE estimator can choose a near-optimal policy in policy selection when solely relying on the OPE result. However, a critical shortcoming of the current evaluation protocol is that these metrics do not assess potential risks experienced during online A/B tests in more practical two-stage selection combined with online A/B tests. -For instance, let us now consider the following toy situation as an illustrative example. +For instance, let us now consider the following situation as an illustrative example. .. card:: :width: 75% @@ -57,7 +57,7 @@ For instance, let us now consider the following toy situation as an illustrative :img-top: ../_static/images/toy_example_1.png :text-align: center - Toy example 1: overestimation vs. underestimation + Example 1: overestimation vs. underestimation .. raw:: html @@ -110,7 +110,7 @@ Below, we showcase how SharpeRatio@k provides valuable insights for comparing OP
-**Toy example 1: Overestimation vs. Underestimation.** +**Example 1: Overestimation vs. Underestimation.** The first case is the previously mentioned example of evaluating estimator X (which underestimates the near-best policy) and estimator Y (which overestimates the poor-performing policies) in the above figure. While the conventional metrics fail to distinguish the two estimators, SharpeRatio@k reports the following results: @@ -118,7 +118,7 @@ While the conventional metrics fail to distinguish the two estimators, SharpeRat :img-top: ../_static/images/sharpe_ratio_1.png :text-align: center - SharpeRatio@k of the toy example 1 + SharpeRatio@k of example 1 .. raw:: html @@ -134,7 +134,7 @@ Therefore, in terms of SharpeRatio@k, estimator X is preferable to Y, while the
-**Toy example 2: Conservative vs. High-Stakes.** +**Example 2: Conservative vs. High-Stakes.** Another example involves evaluating a conservative OPE (estimator W, which always underestimates) and a uniform random OPE (estimator Z) as shown in the following figure. .. card:: @@ -143,7 +143,7 @@ Another example involves evaluating a conservative OPE (estimator W, which alway :img-top: ../_static/images/toy_example_2.png :text-align: center - Toy example 2: conservative vs. high-stakes + Example 2: conservative vs. high-stakes .. raw:: html @@ -168,7 +168,7 @@ In contrast, our top-:math:`k` RRT metrics report the following results, which c :img-top: ../_static/images/sharpe_ratio_2.png :text-align: center - SharpeRatio@k the toy example 2 + SharpeRatio@k of example 2 .. raw:: html @@ -189,13 +189,43 @@ For the detailed settings, please refer to Section 4.1 of our paper.
-**Result 1: SharpeRatio@k is more appropriate and informative than conventional accuracy metrics.** +**Result 1: SharpeRatio report the performance of OPE estimators differently from conventional metrics.** .. card:: - :img-top: ../_static/images/benchmark_acrobot.png + :img-top: ../_static/images/empirical_comparison.png :text-align: left - **Result 1-1**: Estimators' performance comparison based on **SharpeRatio@k** (the left figure) and **conventional metrics including nMSE, RankCorr, and nRegret@1** (the right three figures) in **Acrobot**. + (Left) Comparison of **SharpeRatio@4** and **conventional metrics (RankCorr, nRegret, nMSE)** in assessing OPE estimators. + (Right) **The number of trials in which the best estimator, selected by SharpeRatio@4 (SR@4) and conventional metrics, aligns.** Both figures report the results of 70 trials, consisting of 7 tasks and 10 random seeds for each. A lower value is better for nMSE and nRegret, while a higher value is better for RankCorr and SharpeRatio@4. + + +.. raw:: html + +
+ +The left figure illustrates the correlation and divergence between SharpeRatio@4 and conventional metrics in evaluating OPE estimators across various RL tasks. +Each point in the figure represents the metrics for five estimators over 70 trials, consisting of 7 different tasks and 10 random seeds. +The right figure presents the number of trials where the best estimators, as identified by SharpeRatio@4 and each conventional metric, coincide. + +The above figures reveal that superior conventional metric values (i.e., higher RankCorr and lower nRegret and nMSE) do not consistently correspond to higher SharpeRatio@4 values. +The most significant deviation of SharpeRatio@4 is from nMSE, which is understandable given that nMSE focuses solely on the estimation accuracy of OPE without considering policy selection effectiveness. +In contrast, SharpeRatio@4 shows some correlation with policy selection metrics (RankCorr and nRegret). +Nonetheless, the best estimator chosen by SharpeRatio@4 often differs from those selected by RankCorr and nRegret. +SharpeRatio@4 and nRegret align in only 8 of the 70 trials, and RankCorr, despite being the most closely correlated metric with SharpeRatio, diverges in the choice of the estimator in over 40\% of the trials (29 out of 70). + +The following sections explore specific instances where SharpeRatio@k and conventional metrics diverge, demonstrating how SharpeRatio@k effectively validates the risk-return trade-off, while conventional metrics fall short. + +.. raw:: html + +
+ +**Result 2: SharpeRatio@k is more appropriate and informative than conventional accuracy metrics.** + +.. card:: + :img-top: ../_static/images/benchmark_mountaincar.png + :text-align: left + + **Result 2-1**: Estimators' performance comparison based on **SharpeRatio@k** (the left figure) and **conventional metrics including nMSE, RankCorr, and nRegret@1** (the right three figures) in **MounrainCar**. A lower value is better for nMSE and nRegret@1, while a higher value is better for RankCorr and SharpeRatio@k. The stars ( :math:`\star`) indicate the best estimator(s) under each metric. .. raw:: html @@ -204,10 +234,10 @@ For the detailed settings, please refer to Section 4.1 of our paper. .. card:: - :img-top: ../_static/images/topk_metrics_acrobot.png + :img-top: ../_static/images/topk_metrics_mountaincar.png :text-align: left - **Result 1-2**: **Reference statistics of the top-** :math:`k` **policy portfolio** formed by each estimator in **Acrobot** + **Result 2-2**: **Reference statistics of the top-** :math:`k` **policy portfolio** formed by each estimator in **MounrainCar** "best" is used as the numerator of SharpeRatio@k, while "std" is used as its denominator. A higher value is better for "best" and " :math:`k`-th best policy's performance", while a lower value is better for "std". The dark red lines show the performance of :math:`\pi_b`, which is the risk-free baseline of SharpeRatio@k. @@ -216,27 +246,25 @@ For the detailed settings, please refer to Section 4.1 of our paper.
-The above figure (Result 1-1.) presents a comparison between the benchmark results under SharpeRatio@k and those under conventional metrics in Acrobot. -The next figure (Result 1-2.) reports some reference statistics about the top- :math:`k` policy portfolios formed by each estimator, where " :math:`k`-th best policy's performance" shows the performance of the policy ranked :math:`k`-th among the candidates by each estimator. +The top figure (Result 2-1) contrasts the benchmark results obtained using SharpeRatio@k with those derived from conventional metrics in the MountainCar task. +The bottom figure (Result 2-2) details reference statistics for the top-:math:`k` policy portfolios created by each estimator. +Notably, the ":math:`k`-th best policy's performance" indicates how well the policy, ranked :math:`k`-th by each estimator, performs. -First, Result 1-1. shows that both conventional metrics and SharpeRatio@k acknowledge the advantage of MDR, which is ranked the best in SharpeRatio@k ( :math:`4 \leq k \leq 8`) and the second-best according to conventional metrics. -In contrast, there exists a substantial difference in the evaluation of MIS and DM between SharpeRatio@k and the other metrics. -This discrepancy arises because, as shown in " :math:`k`-th best policy's performance" of Result 1-2, MIS overestimates one of the worst policies, even though it ranks the other policies in a nearly perfect order (which parallels that of estimator Y in the toy example 2). -Thus, conventional metrics evaluate MIS as the most "accurate" estimator, neglecting the evident risk of implementing a detrimental policy. -On the other hand, SharpeRatio@k successfully detects this risky conduct of MIS by taking "std" (risk metric) into account, gives more preference to MDR and DM for :math:`k \ge 4`, as they perform safer than MIS. +These results highlight that the preferred OPE estimator varies significantly based on the evaluation metrics used. +For instance, MSE and Regret favor MIS as the best estimator, while Rankcorr and SharpeRatio@7 select DM, and SharpeRatio@4 opts for PDIS. +Upon examining these three estimators through the reference statistics in the bottom figure (Result 2-2), it becomes evident that conventional metrics tend to overlook the risk associated with OPE estimators including suboptimal policies in their portfolio. +Specifically, nMSE and nRegret fail to recognize the danger of MIS implementing an almost worst-case estimator for :math:`k \leq 4`. +Additionally, RankCorr does not acknowledge the risk involved with PDIS implementing a nearly worst-case estimator for :math:`k \leq 6`, and it inappropriately ranks PDIS higher than MDR, which avoids deploying a suboptimal policy until the last deployment (:math:`k=9, 10`). -It is worth noticing that SharpeRatio@k evaluates DM as the best estimator when :math:`k \geq 6`, whereas it is among the worst estimators under conventional metrics. -This contrast can be attributed to DM's weakness in accurately ranking the top candidate policies. -As we can see in " :math:`k`-th best policy's performance" of Result 1-2, DM is also able to avoid selecting the worse policy until the very last ( :math:`k=10`) in this environment. -SharpeRatio@k captures this particular characteristic of DM and precisely evaluates its risk-return tradeoffs with varying online evaluation budgets ( :math:`k`), while existing accuracy metrics fail to do so. +In contrast, SharpeRatio@k effectively discerns the varied characteristics of policy portfolios and adeptly identifies a safe and efficient estimator that is adaptable to the specific budget (:math:`k`) or problem instance (:math:`J(\pi_b)`). -Overall, the benchmark results suggest that SharpeRatio@k provides a more practically meaningful comparison of OPE estimators than conventional accuracy metrics. +Overall, the benchmark findings suggest that SharpeRatio@k offers a more pragmatically meaningful comparison of OPE estimators than existing accuracy metrics. .. raw:: html
-**Result 2: Comprehensive results and suggested future works** +**Result 3: Comprehensive results and suggested future works** .. card:: :img-top: ../_static/images/benchmark_sharpe_ratio_4.png @@ -248,28 +276,32 @@ Overall, the benchmark results suggest that SharpeRatio@k provides a more practi
-The above figure reports the benchmark results of the OPE estimators with SharpeRatio@4 in various benchmark environments, providing the following directions and suggestions for future OPE research. - -1. Future research in OPE should include assessments of estimators under SharpeRatio@k: +The above figure reports the benchmark results of OPE estimators with SharpeRatio@4 in various RL environments, providing the following directions and suggestions for future OPE research. - We observe in the previous Acrobot case that SharpeRatio@k offers more practical insights than conventional accuracy metrics, and the benchmark results under SharpeRatio@k sometimes diverge substantially from those under conventional accuracy metrics (See our paper for the details). - This indicates that future research should, at least additionally, employ SharpeRatio@k to assess OPE estimators in their experiments. +1. Future research in OPE should include the assessment of estimators based on SharpeRatio@k: + The findings from the previous section suggest that SharpeRatio@k provides more actionable insights compared to traditional accuracy metrics. + The benchmark results using SharpeRatio@k, as shown in Figure~\ref{fig:sharpe_ratio_benchmark}, often significantly differ from those obtained with conventional accuracy metrics. + This highlights the importance of integrating SharpeRatio@k into future research to more effectively evaluate the efficiency of OPE estimators. + 2. A new estimator that explicitly optimizes the risk-return tradeoff: - Even though DR and MDR are generally considered more sophisticated in existing research, they do not always outperform DM, PDIS, and MIS under SharpeRatio@k in the above figure. - This is because they are not specifically designed to enhance the risk-return tradeoff and associated efficiency. - Therefore, it would be a valuable direction to develop a novel estimator that more explicitly optimizes the risk-return tradeoff than existing methods. + While DR and MDR are generally regarded as advanced in existing literature, they do not consistently outperform DM, PDIS, and MIS according to SharpeRatio@k, as indicated in the figure. + This is attributable to their lack of specific design for optimizing the risk-return tradeoff and efficiency. + Consequently, a promising research avenue would be to create a new estimator that explicitly focuses more on optimizing this risk-return tradeoff than existing methods. 3. A data-driven estimator selection method: - The results demonstrate that the most *efficient* estimator can change greatly across environments, suggesting that adaptively selecting an appropriate estimator is critical for a reliable OPE in practice. - Since existing methods in estimator selection mostly focus on the "accuracy" metrics such as MSE and Regret, developing a novel estimator selection method that can account for risks and efficiency would also be an interesting direction for future studies. + The results show that the most *efficient* estimator varies significantly across different environments, underscoring the need for adaptively selecting the most suitable estimator for reliable OPE. + Given that existing estimator selection methods predominantly focus on "accuracy'' metrics like MSE and Regret, there is an intriguing opportunity for future research to develop a novel estimator selection method that considers risks and efficiency. +.. raw:: html + +
.. seealso:: - More results and discussions are available in our research paper. + More results and discussions are available in Appendix of our research paper. Citation ~~~~~~~~~~ @@ -279,13 +311,13 @@ If you use the proposed metric (SharpeRatio@k) or refer to our findings in your .. card:: | Haruka Kiyohara, Ren Kishimoto, Kosuke Kawakami, Ken Kobayashi, Kazuhide Nakata, Yuta Saito. - | **Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation in Reinforcement Learning** + | **Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation** | (a preprint is coming soon..) .. code-block:: @article{kiyohara2023towards, - title={Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation in Reinforcement Learning}, + title={Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation}, author={Kiyohara, Haruka and Kishimoto, Ren and Kawakami, Kosuke and Kobayashi, Ken and Nakata, Kazuhide and Saito, Yuta}, journal={arXiv preprint arXiv:23xx.xxxxx}, year={2023} diff --git a/docs/documentation/subpackages/basicgym_about.rst b/docs/documentation/subpackages/basicgym_about.rst index 4f52ef49..7dc2df99 100644 --- a/docs/documentation/subpackages/basicgym_about.rst +++ b/docs/documentation/subpackages/basicgym_about.rst @@ -212,14 +212,14 @@ Citation If you use our pipeline in your work, please cite our paper below. | Haruka Kiyohara, Ren Kishimoto, Kosuke Kawakami, Ken Kobayashi, Kazuhide Nakata, Yuta Saito. -| **Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation in Reinforcement Learning** +| **Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation** | (a preprint coming soon..) .. code-block:: @article{kiyohara2023towards, author = {Kiyohara, Haruka and Kishimoto, Ren and Kawakami, Kosuke and Kobayashi, Ken and Nataka, Kazuhide and Saito, Yuta}, - title = {Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation in Reinforcement Learning}, + title = {Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation}, journal = {A GitHub repository}, pages = {xxx--xxx}, year = {2023}, diff --git a/docs/index.rst b/docs/index.rst index 45b4cc63..408c124c 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -298,13 +298,13 @@ If you use our pipeline in your work, please cite our paper below. .. card:: | Haruka Kiyohara, Ren Kishimoto, Kosuke Kawakami, Ken Kobayashi, Kazuhide Nakata, Yuta Saito. - | **SCOPE-RL: A Python Library for Offline Reinforcement Learning, Off-Policy Evaluation, and Policy Selection** + | **SCOPE-RL: A Python Library for Offline Reinforcement Learning and Off-Policy Evaluation** | (a preprint is coming soon..) .. code-block:: @article{kiyohara2023scope, - title={SCOPE-RL: A Python Library for Offline Reinforcement Learning, Off-Policy Evaluation, and Policy Selection}, + title={SCOPE-RL: A Python Library for Offline Reinforcement Learning and Off-Policy Evaluation}, author={Kiyohara, Haruka and Kishimoto, Ren and Kawakami, Kosuke and Kobayashi, Ken and Nakata, Kazuhide and Saito, Yuta}, journal={arXiv preprint arXiv:23xx.xxxxx}, year={2023} diff --git a/experiments/README.md b/experiments/README.md index d8103dc8..5ce357f5 100644 --- a/experiments/README.md +++ b/experiments/README.md @@ -3,13 +3,13 @@ This directory includes the code to replicate the benchmark experiment done in the following paper. Haruka Kiyohara, Ren Kishimoto, Kousuke Kawakami, Ken Kobayashi, Kazuhide Nakata, Yuta Saito.
-**Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation in Reinforcement Learning**
+**Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation**
[link]() (a preprint coming soon..) If you find this code useful in your research then please cite: ``` @article{kiyohara2023towards, - title={Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation in Reinforcement Learning}, + title={Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation}, author={Kiyohara, Haruka and Kishimoto, Ren and Kawakami, Kosuke and Kobayashi, Ken and Nataka, Kazuhide and Saito, Yuta}, journal = {A github repository}, pages = {xxx--xxx},