Skip to content

Releases: alibaba/RecIS

Release v1.1.0

29 Dec 07:05
7456f88

Choose a tag to compare

We are excited to announce the release of RecIS v1.1.0. This version marks a significant milestone with the introduction of Model Bank 1.0, native ROCm support, and substantial performance optimizations for large-scale embedding tables.

🌟 Key Highlights

Category Description
🏆 Framework Model Bank 1.0 officially arrives; New Negative Sampler and RTP Exporter support.
⚡ Performance Introduction of Auto-resizing Hash Tables and Fused AdamW TF CUDA operations.
🌐 Compatibility Expanded hardware support for AMD ROCm; Fixed non-NVIDIA device kernel launches.
🛡️ Robustness Improved multi-node synchronization and robust handling for empty tensor edge cases.

📝 Detailed Changelog

Bug Fixes

  • checkpoint: fix mos version format, update use openlm api (854bbb3)
  • checkpoint: refine torch_rank_weights_embs_table_multi_shard.json format (d5e7a5c)
  • checkpoint: walk around save bug, deal with xpfs model path (ae99728)
  • embedding: fix empty kernel launch in non-nvidia device (2e310d0)
  • embedding: fix insert when size == 1 (7702c9e)
  • framework: add an option for algo_config for export (0ad4c3f)
  • framework: fix bugs of invalid index, grad accumulation; add clear child feat (1e7acf9)
  • framework: fix eval in trainer (676a053)
  • framework: fix fg && exporter bugs (3964ce2)
  • framework: fix load extra info not in ckpt (a64cd00)
  • framework: fix loss backward (7d9a41b)
  • framework: fix some bug of model bank (be196db)
  • framework: fix window io failover (cde3049)
  • framework: reset io state when start another epoch (f918f24)
  • io: fix batch_convert row_splits when dataset read empty data (44661ab)
  • io: fix None data when window switch (e788b4d)
  • io: fix odps import bug (7c13f09)
  • io: use openstorage get_table_size directly (d5c0952)
  • ops: fix bug in fast atomic operations (fea8d47)
  • ops: fix dense_to_ragged op when check_invalid=False (#14) (300a77b)
  • ops: fix edge cases for empty tensors and improve CUDA kernel handling (794be12)
  • ops: fix emb segment reduce mean op (3f82b9c)
  • ops: handle empty tensor inputs in ragged ops (a39fc2a)
  • optimizer: step add 1 should be in-place (cdb3632)
  • serialize: fix bug of file sync of multi node (822af49)
  • serialize: fix bug of load tensor (e25eee4)
  • serialize: fix bug when load by oname (e5ca3d7)
  • serialize: fix bug when tensor num < parallel num (a02aded)
  • tools: fix torch_fx_tool string format (1d426f8)

Features

  • checkpoint: add label for ckpt (5436b5b)
  • checkpoint: load dense optimizer by named_parameters (a07dbaf)
  • docs: add model bank docs (ff0d23e)
  • embedding: add monitor for ids/embs (2f268eb)
  • embedding: expose methods to retrieve child ids and embs from the coalesced hashtable; fix clear method of hashtable (b5de207)
  • framework,checkpoint: change checkpointmanager to save/load hooks (eb3b441)
  • framework: [internal] add negative sampler (8c21517)
  • framework: add exporter for rtp (b8af849)
  • framework: add skip option in model bank (00828ce)
  • framework: add some utility to RaggedTensor (78eca0a)
  • framework: add window_iter for window pipline (87886a0)
  • framework: collect eval result for hooks and fix after_data bug (81d3723)
  • framework: enable amp by options (db5bbe7)
  • framework: impl-independent monitor (24a1631)
  • framework: model bank 1.0 (488672b)
  • framework: support filter hashtable for saver, update hook for window, fix metric (01eb2ae)
  • io: add adaptor filter by scene (c3e6738)
  • io: add new dedup option for neg sampler (61b2cb7)
  • io: add standard fg for input features (2deedff)
  • ops: add fused AdamW TF CUDA operation (05dba24)
  • ops: add parse_sample_id ops (78674cd)
  • packaging: support ROCm (7a626d3)
  • serialize: update load metric interface (66b085d)
  • update column-io to support ROCm device (7907158)

Performance Improvements

  • embedding: use auto-resizing hash table (2f53f53)

Release v1.0.0

18 Sep 13:39

Choose a tag to compare

🎉 Initial Release

RecIS (Recommendation Intelligence System) v1.0.0 is now officially released! This is a unified architecture deep learning framework designed specifically for ultra-large-scale sparse models, built on the PyTorch open-source ecosystem. It has been widely used in Alibaba advertising, recommendation, searching and other scenarios.

✨ New Features

Core Architecture

  • ColumnIO: Data Reading
    • Supports distributed sharded data reading
    • Completes simple feature pre-computation during the reading phase
    • Assembles samples into Torch Tensors and provides data prefetching functionality
  • Feature Engine: Feature Processing
    • Provides feature engineering and feature transformation processing capabilities, including Hash / Mod / Bucketize, etc.
    • Supports automatic operator fusion optimization strategies
  • Embedding Engine: Embedding Management and Computing
    • Provides conflict-free, scalable KV storage embedding tables
    • Provides multi-table fusion optimization capabilities for better memory access performance
    • Supports feature elimination and admission strategies
  • Saver: Parameter Saving and Loading
    • Provides sparse parameter storage and delivery capabilities in SafeTensors standard format
  • Pipelines: Training Process Orchestration
    • Connects the above components and encapsulates training processes
    • Supports complex training workflows such as multi-stage (training/testing interleaved) and multi-objective computation

🛠️ Installation & Compatibility

System Requirements

  • Python: 3.10+
  • PyTorch: 2.4+
  • CUDA: 12.4

Installation Methods

  • Docker Installation: Pre-built Docker images for PyTorch 2.4.0/2.5.1/2.6.0
  • Source Installation: Complete build system with CMake and setuptools

Dependencies

  • torch>=2.4
  • accelerate==0.29.2
  • simple-parsing
  • pyarrow (for ORC support)

📚 Documentation

  • Complete English and Chinese documentation
  • Quick start tutorials with CTR model examples
  • Comprehensive API reference
  • Installation guides for different environments
  • FAQ and troubleshooting guides

📦 Package Structure

  • Core Library: recis/ - Main framework code
  • C++ Extensions: csrc/ - High-performance C++ implementations
  • Documentation: docs/ - Comprehensive documentation in RST format
  • Examples: examples/ - Practical usage examples
  • Tools: tools/ - Data conversion and utility tools
  • Tests: tests/ - Comprehensive test suite

🚀 Key Optimizations

Efficient Dynamic Embedding

The RecIS framework implements efficient dynamic embedding (HashTable) through a two-level storage architecture:

  • IDMap: Serves as first-level storage, using feature ID as key and Offset as value
  • EmbeddingBlocks:
    • Serves as second-level storage, continuous sharded memory blocks for storing embedding parameters and optimizer states
    • Supports dynamic sharding with flexible expansion capabilities
  • Flexible Hardware Adaptation Strategy: Supports both GPU and CPU placement for IDMap and EmbeddingBlocks

Distributed Optimization

  • Parameter Aggregation and Sharding:
    • During model creation phase, merges parameter tables with identical properties (dimensions, initializers, etc.) into one logical table
    • Parameters are evenly distributed across compute nodes
  • Request Merging and Splitting:
    • During forward computation, merges requests for parameter tables with identical properties and computes sharding information with deduplication
    • Obtains embedding vectors from various compute nodes through All-to-All collective communication

Efficient Hardware Resource Utilization

  • GPU Concurrency Optimization:
    • Supports feature processing operator fusion optimization, significantly reducing operator count and launch overhead
  • Parameter Table Fusion Optimization:
    • Supports merging parameter tables with identical properties, reducing feature lookup frequency, significantly decreasing operator count, and improving memory space utilization efficiency
  • Operator Implementation Optimization:
    • Operator implementations use vectorized memory access to improve memory utilization
    • Optimizes reduction operators through warp-level merging, reducing atomic operations and improving memory access utilization

🤝 Community & Support

  • Open source under Apache 2.0 license
  • Issue tracking and community support
  • Active development by XDL Team

For detailed usage instructions, please refer to our documentation and quick start guide.