Releases: alibaba/RecIS
Releases · alibaba/RecIS
Release v1.1.0
We are excited to announce the release of RecIS v1.1.0. This version marks a significant milestone with the introduction of Model Bank 1.0, native ROCm support, and substantial performance optimizations for large-scale embedding tables.
🌟 Key Highlights
| Category | Description |
|---|---|
| 🏆 Framework | Model Bank 1.0 officially arrives; New Negative Sampler and RTP Exporter support. |
| ⚡ Performance | Introduction of Auto-resizing Hash Tables and Fused AdamW TF CUDA operations. |
| 🌐 Compatibility | Expanded hardware support for AMD ROCm; Fixed non-NVIDIA device kernel launches. |
| 🛡️ Robustness | Improved multi-node synchronization and robust handling for empty tensor edge cases. |
📝 Detailed Changelog
Bug Fixes
- checkpoint: fix mos version format, update use openlm api (854bbb3)
- checkpoint: refine torch_rank_weights_embs_table_multi_shard.json format (d5e7a5c)
- checkpoint: walk around save bug, deal with xpfs model path (ae99728)
- embedding: fix empty kernel launch in non-nvidia device (2e310d0)
- embedding: fix insert when size == 1 (7702c9e)
- framework: add an option for algo_config for export (0ad4c3f)
- framework: fix bugs of invalid index, grad accumulation; add clear child feat (1e7acf9)
- framework: fix eval in trainer (676a053)
- framework: fix fg && exporter bugs (3964ce2)
- framework: fix load extra info not in ckpt (a64cd00)
- framework: fix loss backward (7d9a41b)
- framework: fix some bug of model bank (be196db)
- framework: fix window io failover (cde3049)
- framework: reset io state when start another epoch (f918f24)
- io: fix batch_convert row_splits when dataset read empty data (44661ab)
- io: fix None data when window switch (e788b4d)
- io: fix odps import bug (7c13f09)
- io: use openstorage get_table_size directly (d5c0952)
- ops: fix bug in fast atomic operations (fea8d47)
- ops: fix dense_to_ragged op when check_invalid=False (#14) (300a77b)
- ops: fix edge cases for empty tensors and improve CUDA kernel handling (794be12)
- ops: fix emb segment reduce mean op (3f82b9c)
- ops: handle empty tensor inputs in ragged ops (a39fc2a)
- optimizer: step add 1 should be in-place (cdb3632)
- serialize: fix bug of file sync of multi node (822af49)
- serialize: fix bug of load tensor (e25eee4)
- serialize: fix bug when load by oname (e5ca3d7)
- serialize: fix bug when tensor num < parallel num (a02aded)
- tools: fix torch_fx_tool string format (1d426f8)
Features
- checkpoint: add label for ckpt (5436b5b)
- checkpoint: load dense optimizer by named_parameters (a07dbaf)
- docs: add model bank docs (ff0d23e)
- embedding: add monitor for ids/embs (2f268eb)
- embedding: expose methods to retrieve child ids and embs from the coalesced hashtable; fix clear method of hashtable (b5de207)
- framework,checkpoint: change checkpointmanager to save/load hooks (eb3b441)
- framework: [internal] add negative sampler (8c21517)
- framework: add exporter for rtp (b8af849)
- framework: add skip option in model bank (00828ce)
- framework: add some utility to RaggedTensor (78eca0a)
- framework: add window_iter for window pipline (87886a0)
- framework: collect eval result for hooks and fix after_data bug (81d3723)
- framework: enable amp by options (db5bbe7)
- framework: impl-independent monitor (24a1631)
- framework: model bank 1.0 (488672b)
- framework: support filter hashtable for saver, update hook for window, fix metric (01eb2ae)
- io: add adaptor filter by scene (c3e6738)
- io: add new dedup option for neg sampler (61b2cb7)
- io: add standard fg for input features (2deedff)
- ops: add fused AdamW TF CUDA operation (05dba24)
- ops: add parse_sample_id ops (78674cd)
- packaging: support ROCm (7a626d3)
- serialize: update load metric interface (66b085d)
- update column-io to support ROCm device (7907158)
Performance Improvements
- embedding: use auto-resizing hash table (2f53f53)
Release v1.0.0
🎉 Initial Release
RecIS (Recommendation Intelligence System) v1.0.0 is now officially released! This is a unified architecture deep learning framework designed specifically for ultra-large-scale sparse models, built on the PyTorch open-source ecosystem. It has been widely used in Alibaba advertising, recommendation, searching and other scenarios.
✨ New Features
Core Architecture
- ColumnIO: Data Reading
- Supports distributed sharded data reading
- Completes simple feature pre-computation during the reading phase
- Assembles samples into Torch Tensors and provides data prefetching functionality
- Feature Engine: Feature Processing
- Provides feature engineering and feature transformation processing capabilities, including Hash / Mod / Bucketize, etc.
- Supports automatic operator fusion optimization strategies
- Embedding Engine: Embedding Management and Computing
- Provides conflict-free, scalable KV storage embedding tables
- Provides multi-table fusion optimization capabilities for better memory access performance
- Supports feature elimination and admission strategies
- Saver: Parameter Saving and Loading
- Provides sparse parameter storage and delivery capabilities in SafeTensors standard format
- Pipelines: Training Process Orchestration
- Connects the above components and encapsulates training processes
- Supports complex training workflows such as multi-stage (training/testing interleaved) and multi-objective computation
🛠️ Installation & Compatibility
System Requirements
- Python: 3.10+
- PyTorch: 2.4+
- CUDA: 12.4
Installation Methods
- Docker Installation: Pre-built Docker images for PyTorch 2.4.0/2.5.1/2.6.0
- Source Installation: Complete build system with CMake and setuptools
Dependencies
torch>=2.4accelerate==0.29.2simple-parsingpyarrow(for ORC support)
📚 Documentation
- Complete English and Chinese documentation
- Quick start tutorials with CTR model examples
- Comprehensive API reference
- Installation guides for different environments
- FAQ and troubleshooting guides
📦 Package Structure
- Core Library:
recis/- Main framework code - C++ Extensions:
csrc/- High-performance C++ implementations - Documentation:
docs/- Comprehensive documentation in RST format - Examples:
examples/- Practical usage examples - Tools:
tools/- Data conversion and utility tools - Tests:
tests/- Comprehensive test suite
🚀 Key Optimizations
Efficient Dynamic Embedding
The RecIS framework implements efficient dynamic embedding (HashTable) through a two-level storage architecture:
- IDMap: Serves as first-level storage, using feature ID as key and Offset as value
- EmbeddingBlocks:
- Serves as second-level storage, continuous sharded memory blocks for storing embedding parameters and optimizer states
- Supports dynamic sharding with flexible expansion capabilities
- Flexible Hardware Adaptation Strategy: Supports both GPU and CPU placement for IDMap and EmbeddingBlocks
Distributed Optimization
- Parameter Aggregation and Sharding:
- During model creation phase, merges parameter tables with identical properties (dimensions, initializers, etc.) into one logical table
- Parameters are evenly distributed across compute nodes
- Request Merging and Splitting:
- During forward computation, merges requests for parameter tables with identical properties and computes sharding information with deduplication
- Obtains embedding vectors from various compute nodes through All-to-All collective communication
Efficient Hardware Resource Utilization
- GPU Concurrency Optimization:
- Supports feature processing operator fusion optimization, significantly reducing operator count and launch overhead
- Parameter Table Fusion Optimization:
- Supports merging parameter tables with identical properties, reducing feature lookup frequency, significantly decreasing operator count, and improving memory space utilization efficiency
- Operator Implementation Optimization:
- Operator implementations use vectorized memory access to improve memory utilization
- Optimizes reduction operators through warp-level merging, reducing atomic operations and improving memory access utilization
🤝 Community & Support
- Open source under Apache 2.0 license
- Issue tracking and community support
- Active development by XDL Team
For detailed usage instructions, please refer to our documentation and quick start guide.