Implement XLAShardedTensor._spec and test #9488

aws-cph · 2025-07-17T20:35:17Z

Implementing and adding tests for XLAShardedTensor._spec in regards to #9418.

jeffhataws · 2025-07-17T22:39:32Z

torch_xla/distributed/spmd/xla_sharded_tensor.py

+      mesh = DeviceMesh("xla",
+                        torch.tensor(device_list).reshape(self.mesh_shape))
+    else:
+      # default to 1D mesh


Maybe this should be an error.

jeffhataws · 2025-07-17T22:40:01Z

torch_xla/distributed/spmd/xla_sharded_tensor.py

+        placements.append(
+            Shard(tensor_dim) if tensor_dim is not None else Replicate())
+    else:
+      placements = [Replicate()]


Same here. Should be an error.

jeffhataws · 2025-07-17T22:42:47Z

torch_xla/distributed/spmd/xla_sharding.py

@@ -651,7 +651,9 @@ def mark_sharding(t: Union[torch.Tensor, XLAShardedTensor], mesh: Mesh,
  op_sharding = mesh.get_op_sharding(partition_spec)
  annotate_func = torch_xla._XLAC._xla_mark_sharding
  annotate_func(unwrap_sharded_tensor(t), op_sharding)
-  return wrap_as_sharded_tensor(t)
+  # Pass mesh and partition spec information for DTensor compatibility
+  return wrap_as_sharded_tensor(


Let's do the same for the other APIs above, like annotate_custom_sharding and enable_manual_sharding?

fhaolinaws · 2025-07-18T05:34:56Z

torch_xla/distributed/spmd/xla_sharded_tensor.py

+
+    # use existing mesh_shape
+    if hasattr(self, 'mesh_shape') and self.mesh_shape:
+      import torch_xla.runtime as xr


nit: move the import outside if

fhaolinaws · 2025-07-18T05:36:15Z

torch_xla/distributed/spmd/xla_sharded_tensor.py

+      mesh = DeviceMesh("xla",
+                        torch.tensor(device_list).reshape(self.mesh_shape))
+    else:
+      # default to 1D mesh


Why do we have default instead of throwing error? Is it for auto wrapping?

fhaolinaws · 2025-07-18T05:42:23Z

torch_xla/distributed/spmd/xla_sharded_tensor.py

+    from torch.distributed.device_mesh import DeviceMesh
+    from torch.distributed.tensor.placement_types import Shard, Replicate
+
+    # use existing mesh_shape


It's better to extract the conversion into a function and put it around here

xla/torch_xla/distributed/spmd/api.py

Line 49 in e7dcc7b

def convert_to_xla_mesh(dt_mesh: DeviceMesh) -> "Mesh":

fhaolinaws · 2025-07-18T05:43:17Z

torch_xla/distributed/spmd/xla_sharded_tensor.py

+      import torch_xla.runtime as xr
+      device_count = xr.global_runtime_device_count()
+      device_list = list(range(device_count))
+      mesh = DeviceMesh("xla",


We need to take care of mesh dim names, too

fhaolinaws · 2025-07-18T05:44:17Z

torch_xla/distributed/spmd/xla_sharded_tensor.py

+      device_count = xr.global_runtime_device_count()
+      mesh = DeviceMesh("xla", list(range(device_count)))
+
+    # use existing partition_spec


Same as above

fhaolinaws · 2025-07-18T05:54:37Z

test/spmd/test_xla_dtensor_spec_conversion.py

+    converted_spec = xla_tensor._spec
+
+    assert converted_spec.mesh.device_type == "xla"
+    assert converted_spec.mesh.size() == device_count


May need to assert on mesh size each dim, too

bfolie · 2025-07-21T15:23:03Z

torch_xla/distributed/spmd/xla_sharded_tensor.py

+    # Return cached spec if available
+    if hasattr(self, '_cached_spec'):
+      return self._cached_spec


What if a call to wrap_as_sharded_tensor changes self.mesh_shape and/or self.partition_spec? Will you still get this cached value even though it's out of date?

bfolie · 2025-07-21T15:26:40Z

torch_xla/distributed/spmd/xla_sharded_tensor.py

+      return self._cached_spec
+
+    # use existing mesh_shape
+    if hasattr(self, 'mesh_shape') and self.mesh_shape:


Can self.mesh_shape and self.partition_spec be set to None at initialization? That removes the need to check hasattr. I don't think there's a semantic difference between the attribute not existing and its value being None, is there?

bfolie · 2025-07-21T15:34:45Z

torch_xla/distributed/spmd/xla_sharded_tensor.py

+        def find_sharded_info(x):
+          nonlocal mesh_shape, partition_spec
+          if isinstance(x, XLAShardedTensor):
+            if hasattr(x, 'mesh_shape') and x.mesh_shape:
+              mesh_shape = x.mesh_shape
+            if hasattr(x, 'partition_spec') and x.partition_spec:
+              partition_spec = x.partition_spec
+
+        tree_map(find_sharded_info, args)
+        if kwargs:
+          tree_map(find_sharded_info, kwargs)


I don't have enough experience with this codebase to understand the context. What are *args and **kwargs in practice and why do we expect they would have sharding information that is relevant for elem?

Also, is this "elem is not an XLAShardedTensor but there exists sharding information we want to acquire" path tested?

bfolie · 2025-07-21T16:07:32Z

test/spmd/test_xla_dtensor_spec_conversion.py

+    assert second_access_time * 10 < first_access_time, \
+        f"Cached access should be much faster: {first_access_time:.6f}s vs {second_access_time:.6f}s"
+


These sorts of tests that rely on the wall clock often lead to annoying flakes in my experience. I think it's sufficient to just test that self._cached_spec has a permanent value after the first call. If you really want to assert that a certain code path is called you could do something with mocks, but that seems like overkill for this, which is simple to confirm by looking at the code ("if the attribute exists, return it" is the first line of the property getter)

bfolie · 2025-07-21T16:20:11Z

torch_xla/distributed/spmd/xla_sharding.py

-    return XLAShardedTensor(t)
-  return t
+    return XLAShardedTensor(
+        t, mesh_shape=mesh_shape, partition_spec=partition_spec)


The fact that calling wrap_as_sharded_tensor in an XLAShardedTensor now does something other than trivially returning t is logically new. Is this resharding logic tested?

qihqi · 2025-07-21T16:25:33Z

torch_xla/distributed/spmd/xla_sharded_tensor.py

+    Convert XLA sharding information to DTensorSpec for DTensor interface compatibility.
+    """
+    # Return cached spec if available
+    if hasattr(self, '_cached_spec'):


can we assign self._cache_spec = None in the constructor; then check self._cached_spec is None here.

qihqi · 2025-07-21T17:21:14Z

torch_xla/distributed/spmd/xla_sharded_tensor.py

@@ -91,10 +95,15 @@ class XLAShardedTensor(torch.Tensor):
  # >> assert len(input.shape) == len(partition_spec)
  partition_spec: Tuple[int, None]

-  __slots__ = ['global_tensor']
+  __slots__ = ['global_tensor', 'mesh_shape', 'partition_spec', '_cached_spec']


can we get rid of slots? is it just for performance?

qihqi · 2025-07-21T17:22:09Z

torch_xla/distributed/spmd/xla_sharded_tensor.py

+            if hasattr(x, 'partition_spec') and x.partition_spec:
+              partition_spec = x.partition_spec
+
+        tree_map(find_sharded_info, args)


you can do tree_map_only(type, callable, args) then you can skip the isinstance check inside of teh callable

aws-cph force-pushed the aws-cph_dtensor_spec branch from bb4eb3b to 28c4f6f Compare July 17, 2025 20:40

jeffhataws requested review from rpsilva-aws, yaoshiang and bfolie July 17, 2025 22:35

jeffhataws reviewed Jul 17, 2025

View reviewed changes

fhaolinaws reviewed Jul 18, 2025

View reviewed changes

aws-cph force-pushed the aws-cph_dtensor_spec branch from 28c4f6f to 78a5639 Compare July 18, 2025 22:05

Implement XLAShardedTensor._spec and test

4dd06ad

aws-cph force-pushed the aws-cph_dtensor_spec branch from 78a5639 to 4dd06ad Compare July 18, 2025 22:08

bfolie reviewed Jul 21, 2025

View reviewed changes

qihqi reviewed Jul 21, 2025

View reviewed changes

		assert second_access_time * 10 < first_access_time, \
		f"Cached access should be much faster: {first_access_time:.6f}s vs {second_access_time:.6f}s"

Implement XLAShardedTensor._spec and test #9488

Are you sure you want to change the base?

Implement XLAShardedTensor._spec and test #9488

Conversation

aws-cph commented Jul 17, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fhaolinaws Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bfolie Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fhaolinaws Jul 18, 2025 •

edited

Loading

bfolie Jul 21, 2025 •

edited

Loading