Skip to content

Re-implementation of the original TD3+BC algorithm for offline reinforcement learning by Fujimoto & Gu, 2021.

License

Notifications You must be signed in to change notification settings

holdenb/TD3-BC-from-scratch

Repository files navigation

Re-implementation of "A Minimalist Approach to Offline Reinforcement Learning"

Original TD3 Implementation

Original TD3+BC Implementation

Original BCQ Implementation

TD3+BC is a simple approach to offline RL where only two changes are made to TD3: (1) a weighted behavior cloning loss is added to the policy update and (2) the states are normalized. Unlike competing methods there are no changes to architecture or underlying hyperparameters (Fujimoto & Gu, 2021). The paper can be found here.

Details

Original Paper results were collected with MuJoCo 1.50 (and mujoco-py 1.50.1.1) in OpenAI gym 0.17.0 with the D4RL datasets. Networks are trained using PyTorch 1.4.0 and Python 3.6.

The experiments utilize our own rewritten version of TD3+BC utilizing the dependencies described above and utilize the DR4L benchmark of OpenAI gym MuJoCo tasks. All D4RL datasets use the v0 version to replicate experiments and baselines conducted by Fujimoto et al. in the original paper.

Description

Each simulated environment has five difficulties: Random, Medium, Medium Replay, Medium Expert, and Expert. Each simulation contains a goal, and an agent is scored at the end of an evaluation based on appropriate actions that the agent took to attain the goal. Our re-implementation of TD3+BC is trained for for 1 million time steps and evaluated every 5000 time steps. Each evaluation consists of 10 episodes of the gym simulation. The average normalized D4RL score over the final 10 evaluations and 5 seeds is then used to plot the learning curves for each task, where a learning curve consists of the normalized score over the entire duration of training.

Fujimoto et al. notes in Addressing Function Approximation Error in Actor-Critic Methods that the original TD3 implementation utilizes a two layer feed-forward neural network of 400 and 300 hidden nodes respectively with rectified linear units (ReLU) between each layer for both the actor and critic, and a final tanh unit following the output of the actor. TD3+BC utilizes 256 hidden nodes between both the actor and critic. Our implementation adds the original 400 and 300 hidden nodes to TD3+BC. Our implementation continues to use a batch size of 256.

How to Run the Experiments

After installing the python dependencies simply run the provided script

./run_experiments.sh

This should kick off a training run on each experiment listed. The TD3+BC agent is trained on each environment as stated in the description.

This script is directly referenced from the original implementation. This is needed to reproduce results across the environments used in the original testing.

To visualize results, the TD3_BC_results.ipynb can be used to plot the normalized DR4L scores at each time step. The script should overlay runs of each environment. The num_eval_per_scenario parameter can be modified according to how many runs were produced per scenario i.e. [hopper-random-v0, hopper-medium-v0, hopper-expert-v0] would be 3 evaluations for the hopper scenario.

Datasets

As mentioned in the details, the datasets used are OpenAI gym 0.17.0 with the D4RL datasets. Offline RL utilizes batch-constrained learning, so on a given run a single batch is acquired from the gym environment and used throughout training and evaluation. The results are evaluated using the D4RL normalized scores and are based on each specific MuJoCu task.

TODO

  • Run v2 envs and compare against paper v0 baselines
  • Add dockerfile for easy reproduction of the experiments
  • Fix args to allow for model saving & hyperparam tuning
  • Add option for internally benchmarking each evaluation phase (save to npy similar to D4RL scores)
  • Update to use Gymnasium-Robotics and Minari. D4RL is planning to support Minari.
  • Replicate the D4RL normalized score for the Minari envs

BibTex

Please cite the primary authors if any code is used or referenced.

TD3+BC Reference & Paper

@inproceedings{fujimoto2021minimalist,
 title={A Minimalist Approach to Offline Reinforcement Learning},
 author={Scott Fujimoto and Shixiang Shane Gu},
 booktitle={Thirty-Fifth Conference on Neural Information Processing Systems},
 year={2021},
}

TD3 (Online Algorithm) Reference & Paper

@inproceedings{fujimoto2018addressing,
  title={Addressing Function Approximation Error in Actor-Critic Methods},
  author={Fujimoto, Scott and Hoof, Herke and Meger, David},
  booktitle={International Conference on Machine Learning},
  pages={1582--1591},
  year={2018}
}

Batch-Constrained Deep Q-Learning (BCQ) References & Papers

@inproceedings{fujimoto2019off,
  title={Off-Policy Deep Reinforcement Learning without Exploration},
  author={Fujimoto, Scott and Meger, David and Precup, Doina},
  booktitle={International Conference on Machine Learning},
  pages={2052--2062},
  year={2019}
}
@article{fujimoto2019benchmarking,
  title={Benchmarking Batch Deep Reinforcement Learning Algorithms},
  author={Fujimoto, Scott and Conti, Edoardo and Ghavamzadeh, Mohammad and Pineau, Joelle},
  journal={arXiv preprint arXiv:1910.01708},
  year={2019}
}

About

Re-implementation of the original TD3+BC algorithm for offline reinforcement learning by Fujimoto & Gu, 2021.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published