A comprehensive collection of reinforcement learning algorithm implementations, from classic dynamic programming to modern deep reinforcement learning methods.
This repository contains implementations of various reinforcement learning algorithms, organized by lecture topics. Each algorithm is implemented in Python with clean, well-documented code and includes visualizations of the learning process.
These implementations are based on the course "Mathematical Foundation of Reinforcement Learning" available at:
https://github.com/MathFoundationRL/Book-Mathmatical-Foundation-of-Reinforcement-Learning
The code follows the mathematical formulations and algorithmic approaches presented in the course, with additional documentation and visualizations for better understanding.
- Python 3.7 or higher
- pip package manager
- Clone the repository:
git clone https://github.com/yourusername/reinforcement-learning-notes.git
cd reinforcement-learning-notes- Install dependencies:
pip install -r requirements.txtTo run all algorithms in sequence:
./run_all.shNavigate to the lecture directory and run the specific algorithm:
cd lecture4
python policy_iteration.py- Policy Iteration: Iterative policy evaluation and improvement
- Value Iteration: Simultaneous value and policy updates
- MC Basic: First-visit Monte Carlo with policy iteration
- MC Epsilon-Greedy: Monte Carlo with ε-greedy exploration
- MC Exploring Starts: Monte Carlo with exploring starts assumption
- SARSA: On-policy TD control
- Q-Learning (On-Policy): Q-learning with ε-greedy policy
- Q-Learning (Off-Policy): Off-policy Q-learning with experience replay
- Q-Learning with Function Approximation: Linear function approximation for Q-learning
- SARSA with Function Approximation: Linear function approximation for SARSA
- DQN: Deep Q-Network with experience replay and target network
- REINFORCE: Monte Carlo policy gradient method
- QAC: Q-Actor-Critic with Q-value critic
- A2C: Advantage Actor-Critic with state-value critic
- A2C Off-Policy: Off-policy A2C with importance sampling
- PPO: Proximal Policy Optimization with GAE
| Algorithm | Environment | Key Features |
|---|---|---|
| Policy Iteration | GridWorld | Iterative policy evaluation and improvement |
| Value Iteration | GridWorld | Simultaneous value and policy updates |
| Algorithm | Environment | Exploration | Key Features |
|---|---|---|---|
| MC Basic | GridWorld | Policy-based | First-visit MC, policy iteration |
| MC Epsilon-Greedy | GridWorld | ε-greedy | Decaying ε, every-visit MC |
| MC Exploring Starts | GridWorld | Exploring starts | Greedy policy improvement |
| Algorithm | Environment | On/Off-Policy | Key Features |
|---|---|---|---|
| SARSA | GridWorld | On-policy | ε-greedy, TD(0) |
| Q-Learning (On-Policy) | GridWorld | On-policy | ε-greedy, max Q-value |
| Q-Learning (Off-Policy) | GridWorld | Off-policy | Experience replay |
| Algorithm | Environment | Function Approximator | Key Features |
|---|---|---|---|
| Q-Learning (FA) | GridWorld | Linear | One-hot features |
| SARSA (FA) | GridWorld | Linear | One-hot features |
| DQN | LunarLander-v2 | Neural Network | Experience replay, target network |
| Algorithm | Environment | Key Features |
|---|---|---|
| REINFORCE | GridWorld | Monte Carlo policy gradient, return normalization |
| Algorithm | Environment | Critic Type | Key Features |
|---|---|---|---|
| QAC | LunarLander-v2 | Q-value | SARSA-style TD error |
| A2C | LunarLander-v2 | State-value | Advantage estimation |
| A2C Off-Policy | LunarLander-v2 | State-value | Importance sampling |
| PPO | LunarLander-v2 | State-value | Clipped objective, GAE |
Recommended order for learning:
- Lecture 4: Start with Dynamic Programming (Policy Iteration → Value Iteration)
- Lecture 5: Learn Monte Carlo methods (Basic → ε-Greedy → Exploring Starts)
- Lecture 7: Understand Temporal Difference learning (SARSA → Q-Learning)
- Lecture 8: Explore function approximation (Linear → Deep with DQN)
- Lecture 9: Study policy gradient methods (REINFORCE)
- Lecture 10: Master Actor-Critic methods (QAC → A2C → PPO)
- Course: Mathematical Foundation of Reinforcement Learning