[WIP] optimize `get()` to only fetch necessary tensor slice #36

kaiyuan-li · 2025-09-19T16:22:00Z

Before this change, when fetching a tensor, we always fetch the whole tensor that's stored on each volume, e.g.
Full Tensor: [[0, 1], [2, 3]]
Storage Volume 0: [[0, 1]]
Storage Volume 1: [[2, 3]]

When we want to get a tensor slice of shape [2, 1] and offset [0, 0], the get() method first fetch the full tensor [[0, 1], [2, 3]] then extract the first column of it. This is not efficient.

After this change, we only fetch [[0]] from volume 0 and [[2]] from volume 1 and assemble them together.

Added a example/dtensor.py to help development. We can remove that later.

kaiyuan-li · 2025-09-19T16:23:38Z

torchstore/client.py

+    ) -> torch.Tensor:
        """Fetches slices from all volume storages and stitch together to return the whole tensor"""

+        # dtensor_slice = None


This is a switch for me to test the old (dtensor_slice=None) and new code path during development.

LucasLLC · 2025-09-25T18:13:54Z

torchstore/client.py

+
+        return assembled_tensor
+
+    def _compute_slice_intersection(


I've been putting these un utils. We can consider creating a dtensor_utils file

LucasLLC · 2025-09-25T18:14:19Z

torchstore/storage_volume.py

+                stored_tensor, stored_slice, request.tensor_slice
+            )
+
+            if extracted_tensor is not None:


what does it mean for extracted tensor to be None here?

The workflow has 2 steps

in client, we try to find the overlapped tensor slice

we send the overlapped tensor slice to volume. So when volume is fetching, it uses the slice info passed in from client. Theoretically, client the passed in slice MUST have an overlap with what's stored in the volume. If there's no overlap, the extracted_tensor will be None here.

should we raise an error in this case? It sounds like at this point the code expects the shard to exist

Discussed offline, will address in follow up to request all tensors inside the whole volume.

LucasLLC

Lgtm! Tysm!

LucasLLC · 2025-09-25T18:15:58Z

torchstore/client.py

+                        tensor_slice, dtensor_slice
+                    )
+
+                    if tensor_slice is None:


This looks great to me. Another thing to consider -- if we've already fetched the entire tensor slice region we can avoid doing so. DTensor is often replicated

I'll create an issue for this and follow up. Probably need a discussion with you on how to create a test case with replicated dtensor first.

LucasLLC · 2025-09-25T18:48:18Z

torchstore/client.py

                assert device_mesh_shape == tensor_slice.mesh_shape

-        return assemble_global_tensor(
+        assembled_tensor = assemble_global_tensor(


is this safe because we never access regions of the tensor which are not initialized? Is there any danger here of us returning uninitialized memory?

Can you give a more information on this? I'm not sure what you mean by "access regions of the tensor which are not initialized"? I thought if we are able to access the volume info, that means the tensor is already properly initialized?

In assemble global tensor, the first thing we do is create a 'torch.empty' tensor of the correct global size --

torchstore/torchstore/utils.py

Line 72 in bef9ba7

global_tensor = torch.empty(

Now that I'm thinking about it, for particularly large tensor types the behavior here is somewhat unwanted as well. Ideally we create a tensor of the correct local size, and correct for the offsets. devmate may be able to help with the mapping logic, can you give it a shot?

LucasLLC · 2025-09-25T18:48:56Z

torchstore/client.py


-    async def _get_distributed_whole_tensor(self, key: str) -> torch.Tensor:
-        """Fetches slices from all volume storages and stitch together to return the whole tensor"""
+    async def _get_distributed_whole_tensor(


Should we rename this function now that it's not technically fetching the whole_tensor ?

LucasLLC · 2025-10-02T15:54:37Z

torchstore/storage_volume.py

+            f"Tensor slice {request.tensor_slice} not found in any stored shards for {key}"
+        )
+
+    def _extract_tensor_subset(


for this to be correct in the current implementation, I believe it must always return None if the entire requested_slice is not present.

LucasLLC · 2025-10-02T15:57:32Z

torchstore/storage_volume.py

-                await transport_buffer.write_from(shard["tensor"])
+            stored_slice = shard["slice"]
+            stored_tensor = shard["tensor"]
+


What might be nice here is to have an if statement like: requested_tensor_slice is in shard

if not request_slice in shard: continue

Verification added to early return if there's no intersection or the intersection is not exactly the tensor slice in the requestion.

LucasLLC · 2025-10-06T14:16:57Z

tests/test_resharding_ext.py

@@ -0,0 +1,127 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.


nit: why _ext for filename?

It's for "extended". Since running the whole test_sharding is too much (>30 tests) and CI will fail. So I picked some sharding tests (~20) to be "test_sharding_basic" and rest of them (more complete) to be "test_sharding_ext".

LucasLLC · 2025-10-06T14:17:04Z

.github/workflows/unit_test.yaml

          TORCHSTORE_RDMA_ENABLED=0 \
          pytest tests/test_tensor_slice.py \
            --cov=. --cov-report=xml -vv -s
+      - name: Run test_resharding_basic tests with coverage


LucasLLC · 2025-10-06T14:18:46Z

torchstore/client.py


        if stored_object_type is ObjectType.TENSOR:
-            full_tensor = await self._get_tensor(key)
+            # TODO: we should get the part of interest in this branch.


Could you explain this to me? Or rather confirm my understanding is correct:

If the stored object is a tensor, then we always fetch the entire tensor and then slice for the requested spec since the tensor can only be in one storage volume?

Yes, your description is accurate.

LucasLLC · 2025-10-06T14:21:09Z

torchstore/storage_volume.py

+                intersection_slice is None
+                or intersection_slice.local_shape != request.tensor_slice.local_shape
+                or intersection_slice.offsets != request.tensor_slice.offsets
+            ):


iiuc, we only return here if we have complete overlap between requested tensor slice and storage volume slice?

I think this is reasonable since the client has knowledge in advanced of whats being requested?

iiuc, we only return here if we have complete overlap between requested tensor slice and storage volume slice?
Right, to clarify here complete_overlap means the requested slice is a subset of the storage.

I think this is reasonable since the client has knowledge in advanced of whats being requested?
There's a consistency model here - metadata and actual storage. We always peek into metadata first, then do the actual fetch. So if metadata never lies, then actual data fetch should be very smooth.

LucasLLC · 2025-10-06T14:22:31Z

torchstore/utils.py

+
+
+# A dev print util.
+def color_print(s, color=None, **kwargs):


small nit: we're accumulating a lot in utils.py. Might be worth considering placing this under example for now, or we can think a bit about the overall folder structure

Yes, this is very personalized. I just removed it since it's not used.

LucasLLC · 2025-10-06T14:23:36Z

torchstore/client.py

-                full_tensor,
-                request.tensor_slice.local_shape,
-                request.tensor_slice.offsets,
+            # Strored object is a DTensor. Return full tensor if


Can we assert this is the case here?

* latest rdma updates from monarch * remove test code * remove test code

#56) * latest rdma updates from monarch * remove test code * remove test code * working v1 * removing test code * v1 * add v1 gate * nits * linter

kaiyuan-li added 5 commits September 18, 2025 11:04

basic test for get

cd7774b

parametrized get set perf test

5f57826

sync

d061ac1

merge to main

7362b07

sync

6283a81

kaiyuan-li requested a review from LucasLLC September 19, 2025 16:22

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 19, 2025

kaiyuan-li commented Sep 19, 2025

View reviewed changes

kaiyuan-li and others added 5 commits September 19, 2025 10:31

Merge branch 'main' into lky_get_optimization

f7ed5cb

Merge branch 'main' into lky_get_optimization

ea9681d

cleanup

d063f19

fmt

c9e85b6

sync

abcf8b8

LucasLLC reviewed Sep 25, 2025

View reviewed changes

LucasLLC approved these changes Sep 25, 2025

View reviewed changes

LucasLLC reviewed Sep 25, 2025

View reviewed changes

kaiyuan-li added 10 commits September 30, 2025 08:23

Merge branch 'main' into lky_get_optimization

e7939dd

merge

ada0849

assemble tensor slice

98e224e

test

58102e5

sync

245c224

partial assemble works

7c366bb

cleanup

fde4e1b

fmt

044d23b

sync

ea04321

more assemble tests

4ec1908

LucasLLC reviewed Oct 2, 2025

View reviewed changes

kaiyuan-li added 6 commits October 3, 2025 10:47

fix resharding tests and try on ci

da0118f

simplify storage volume get

9aed7ed

merge

bcf1313

enable basic resharding tests

40e2fc6

sync

55ffb25

allow overlapped tensor

4d84842

LucasLLC reviewed Oct 7, 2025

View reviewed changes

LucasLLC and others added 6 commits October 14, 2025 08:25

Updates for latest rdma + monarch (#50)

6440a99

* latest rdma updates from monarch * remove test code * remove test code

sync

4f03e34

sync

1717e3e

Monarch V1 Support. Necessary for direct actor to actor communications (

563955c

#56) * latest rdma updates from monarch * remove test code * remove test code * working v1 * removing test code * v1 * add v1 gate * nits * linter

remove color_print

c9aaae8

merge

04656e3

kaiyuan-li merged commit 2f2902c into main Oct 14, 2025
5 checks passed

kaiyuan-li deleted the lky_get_optimization branch October 14, 2025 16:01

		@@ -0,0 +1,127 @@
		# Copyright (c) Meta Platforms, Inc. and affiliates.



		# A dev print util.
		def color_print(s, color=None, **kwargs):

[WIP] optimize get() to only fetch necessary tensor slice #36

[WIP] optimize get() to only fetch necessary tensor slice #36

Uh oh!

Conversation

kaiyuan-li commented Sep 19, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LucasLLC left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LucasLLC Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[WIP] optimize `get()` to only fetch necessary tensor slice #36

[WIP] optimize `get()` to only fetch necessary tensor slice #36

LucasLLC Oct 2, 2025 •

edited

Loading