Changelog

[v1.0.0] - 2023-05-13

Complete rebranding from "LM Human Preferences" to "Gymformer" to better reflect the framework's capabilities
Fully embraced Gymnasium as the unified interface for all environments
Created modular architecture with extendable components:
- env: Added CoTEnv for Chain-of-Thought math reasoning alongside RLHF
- lm: Reorganized reward model implementations for better interoperability
- rl: Enhanced PPO implementation with direct transformer model support

Custom language model environments now fully compatible with Gymnasium API
Unified training interface through train_agent() for all environment types
Support for both language tasks (RLHF, CoT) and classic RL problems with the same API
Simplified extension points for creating custom environments and models
Enhanced documentation with code examples for multiple use cases

This commit marks the last state of the original architecture before a major migration.
All further development will move towards a new architecture, which may introduce breaking changes.
The v0.last tag and branch serve as a reference point for the legacy codebase and for any hotfixes or maintenance required on the old architecture.

The core architecture uses a custom reward model implemented as a subclass of torch.nn.Module, trained on human preference data.
This reward model is integrated with the TRL (Transformer Reinforcement Learning) PPO (Proximal Policy Optimization) trainer to fine-tune language models.
The system is modular and reproducible, with consistent seed and device management, and comprehensive experiment tracking using Weights & Biases.
Configuration is managed through YAML files for both reward model and language model training.
The codebase is organized for extensibility and maintainability, following clear inheritance and modular design patterns.