Due to behavior of CudaCacheAllocator, record stream will lead to a late memory free, which has significant on memory peak. (Could refer to fsdp1's issue due to recordstream). In Pytorch 2.8 +,c10d has remove all recordStream in collective communication. Thus, I wonder if there are plans to remove record stream and use reference stash to handle multistream scenario?