chunking opt split and fix duplicate flush (#3042)

duduyi2013 · facebook-github-bot · commit 45223e08f31c · 2025-06-04T23:42:25.000-07:00
Summary: Pull Request resolved: #3042 X-link: pytorch/FBGEMM#4260 X-link: facebookresearch/FBGEMM#1338 changesets 1. in ZeroCollisionKeyValueEmbedding, we force flush when calling split_embedding_weights, remove that to utilize the cached weights on the same global step. 2. on split embedding optimizer, rocksdb has to read the whole value part(embedding + optimizer) out into dram, without chunking we essentially read everything into dram at once(temporarily huge mem spike), with chunk loading, we could keep mem spike low. Reviewed By: steven1327, emlin Differential Revision: D75988991 fbshipit-source-id: 414fab2aad45e05e1da12f95a7ab99fb82c4f8aa
diff --git a/torchrec/distributed/batched_embedding_kernel.py b/torchrec/distributed/batched_embedding_kernel.py
@@ -1435,7 +1435,6 @@ def _init_sharded_split_embedding_weights(
 
         pmt_list, weight_ids_list, bucket_cnt_list = self.split_embedding_weights(
             no_snapshot=False,
-            should_flush=True,
         )
         emb_table_config_copy = copy.deepcopy(self._config.embedding_tables)
         for emb_table in emb_table_config_copy:
@@ -1528,7 +1527,7 @@ def purge(self) -> None:
 
     # pyre-ignore [15]
     def split_embedding_weights(
-        self, no_snapshot: bool = True, should_flush: bool = True
+        self, no_snapshot: bool = True, should_flush: bool = False
     ) -> Tuple[
         Union[List[PartiallyMaterializedTensor], List[torch.Tensor]],
         Optional[List[torch.Tensor]],