Description
π Describe the bug
The following cases failed with "AssertionError: Tensor-likes are not close!β in ww17 weekly test, test platform is IDC PVC 1100 with XELINK.
Data Center GPU Max 1100 OpenCL 3.0 NEO [25.05.32567]
libigc2 2.7.11-1099~22.04
GPU 0/0 GPU 1/0 GPU 2/0 GPU 3/0 GPU 4/0 GPU 5/0 GPU 6/0 GPU 7/0 CPU Affinity
GPU 0/0 S XL8 XL8 XL8 SYS SYS SYS SYS 0-47,96-143
GPU 1/0 XL8 S XL8 XL8 SYS SYS SYS SYS 0-47,96-143
GPU 2/0 XL8 XL8 S XL8 SYS SYS SYS SYS 0-47,96-143
GPU 3/0 XL8 XL8 XL8 S SYS SYS SYS SYS 0-47,96-143
GPU 4/0 SYS SYS SYS SYS S XL8 XL8 XL8 48-95,144-191
GPU 5/0 SYS SYS SYS SYS XL8 S XL8 XL8 48-95,144-191
GPU 6/0 SYS SYS SYS SYS XL8 XL8 S XL8 48-95,144-191
GPU 7/0 SYS SYS SYS SYS XL8 XL8 XL8 S 48-95,144-191
ZE_AFFINITY_MASK=0,1,2,3
test.distributed._composable.fsdp.test_fully_shard_compile.TestFullyShardCompile | test_transformer_backend_inductor_fullgraph_True | failed | AssertionError: Scalars are not equal! | TOBE go through |
---|---|---|---|---|
test.distributed.fsdp.test_fsdp_comm_hooks.TestCommunicationHooks | test_fp16_hook_has_wrapping_False_sharding_strategy0 | failed | AssertionError: Tensor-likes are not close! | TOBE go through(MKL) |
test.distributed.fsdp.test_fsdp_use_orig_params.TestFSDPUseOrigParamsMultipleParamGroups | test_fsdp_compile | failed | AssertionError: Scalars are not close! | TOBE go through |
- test.distributed._composable.fsdp.test_fully_shard_compile.TestFullyShardCompile | test_transformer_backend_inductor_fullgraph_True | failed | AssertionError: Scalars are not equal!:
reproduce cmd:
pytest -vs test/distributed/_composable/fsdp/test_fully_shard_compile.py -k case_name(like:test_transformer_backend_inductor_fullgraph_True)
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] Traceback (most recent call last):
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] File "/home/sdp/.conda/envs/2025_ww17/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 761, in run_test
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] getattr(self, test_name)()
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] File "/home/sdp/.conda/envs/2025_ww17/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 634, in wrapper
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] fn()
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] File "/home/sdp/.conda/envs/2025_ww17/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3155, in wrapper
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] method(*args, **kwargs)
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] File "/home/sdp/.conda/envs/2025_ww17/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 1876, in wrapper
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] return fn(*args, **kwargs)
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] File "/home/sdp/.conda/envs/2025_ww17/lib/python3.10/contextlib.py", line 79, in inner
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] return func(*args, **kwds)
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] File "/home/sdp/jenkins/workspace/distributed_ut_regular/pytorch/test/distributed/_composable/fsdp/test_fully_shard_compile.py", line 974, in test_transformer_backend_inductor_fullgraph_True
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] _, triton_codes = run_and_get_code(
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] File "/home/sdp/.conda/envs/2025_ww17/lib/python3.10/site-packages/torch/_inductor/utils.py", line 1727, in run_and_get_code
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] result = fn(*args, **kwargs)
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] File "/home/sdp/jenkins/workspace/distributed_ut_regular/pytorch/test/distributed/_composable/fsdp/test_fully_shard_compile.py", line 975, in
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] lambda: self._test_traceable_fsdp(
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] File "/home/sdp/jenkins/workspace/distributed_ut_regular/pytorch/test/distributed/_composable/fsdp/test_fully_shard_compile.py", line 578, in _test_traceable_fsdp
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] losses_compiled = test_compiled()
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] File "/home/sdp/jenkins/workspace/distributed_ut_regular/pytorch/test/distributed/_composable/fsdp/test_fully_shard_compile.py", line 538, in test_compiled
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] res = run_iters(
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] File "/home/sdp/jenkins/workspace/distributed_ut_regular/pytorch/test/distributed/_composable/fsdp/test_fully_shard_compile.py", line 514, in run_iters
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] loss = fwd_bwd_func(inp)
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] File "/home/sdp/.conda/envs/2025_ww17/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 675, in _fn
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] raise e.remove_dynamo_frames() from None # see TORCHDYNAMO_VERBOSE=1
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] File "/home/sdp/.conda/envs/2025_ww17/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 1571, in _call_user_compiler
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] raise BackendCompilerFailed(
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] File "/home/sdp/.conda/envs/2025_ww17/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 1546, in _call_user_compiler
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] compiled_fn = compiler_fn(gm, self.example_inputs())
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] File "/home/sdp/.conda/envs/2025_ww17/lib/python3.10/site-packages/torch/_dynamo/repro/after_dynamo.py", line 150, in call
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] compiled_gm = compiler_fn(gm, example_inputs)
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/internal/common_distributed.py:768] File "/home/sdp/.conda/envs/2025_ww17/lib/python3.10/site-packages/torch/init.py", line 2365, in call
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/internal/common_distributed.py:768] return compile_fx(model, inputs, config_patches=self.config)
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] File "/home/sdp/.conda/envs/2025_ww17/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 2256, in compile_fx
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] return aot_autograd(
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] File "/home/sdp/.conda/envs/2025_ww17/lib/python3.10/site-packages/torch/_dynamo/backends/common.py", line 106, in call
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] cg = aot_module_simplified(gm, example_inputs, **self.kwargs)
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] File "/home/sdp/.conda/envs/2025_ww17/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py", line 1176, in aot_module_simplified
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] compiled_fn = dispatch_and_compile()
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] File "/home/sdp/.conda/envs/2025_ww17/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py", line 1150, in dispatch_and_compile
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] compiled_fn, _ = create_aot_dispatcher_function(
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] File "/home/sdp/.conda/envs/2025_ww17/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py", line 574, in create_aot_dispatcher_function
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] return _create_aot_dispatcher_function(
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] File "/home/sdp/.conda/envs/2025_ww17/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py", line 824, in _create_aot_dispatcher_function
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] compiled_fn, fw_metadata = compiler_fn(
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] File "/home/sdp/.conda/envs/2025_ww17/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py", line 1132, in aot_dispatch_autograd
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] compiled_fw_func = aot_config.fw_compiler(fw_module, adjusted_flat_args)
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] File "/home/sdp/.conda/envs/2025_ww17/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py", line 483, in call
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] return self.compiler_fn(gm, example_inputs)
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] File "/home/sdp/.conda/envs/2025_ww17/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 2095, in fw_compiler_base
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] return inner_compile(
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] File "/home/sdp/.conda/envs/2025_ww17/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 707, in compile_fx_inner
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] return wrap_compiler_debug(_compile_fx_inner, compiler_name="inductor")(
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] File "/home/sdp/.conda/envs/2025_ww17/lib/python3.10/site-packages/torch/_dynamo/repro/after_aot.py", line 124, in debug_wrapper
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] inner_compiled_fn = compiler_fn(gm, example_inputs)
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] File "/home/sdp/.conda/envs/2025_ww17/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 817, in _compile_fx_inner
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] mb_compiled_graph = fx_codegen_and_compile(
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] File "/home/sdp/.conda/envs/2025_ww17/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 1436, in fx_codegen_and_compile
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] return scheme.codegen_and_compile(gm, example_inputs, inputs_to_check, graph_kwargs)
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] File "/home/sdp/.conda/envs/2025_ww17/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 1121, in codegen_and_compile
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] _recursive_post_grad_passes(gm, is_inference=is_inference)
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] File "/home/sdp/.conda/envs/2025_ww17/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 467, in _recursive_post_grad_passes
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] post_grad_passes(gm, is_inference)
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] File "/home/sdp/.conda/envs/2025_ww17/lib/python3.10/site-packages/torch/_inductor/fx_passes/post_grad.py", line 177, in post_grad_passes
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] GraphTransformObserver(gm, "post_grad_custom_post_pass").apply_graph_pass(
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] File "/home/sdp/.conda/envs/2025_ww17/lib/python3.10/site-packages/torch/fx/passes/graph_transform_observer.py", line 85, in apply_graph_pass
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] return pass_fn(self.gm.graph)
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] File "/home/sdp/jenkins/workspace/distributed_ut_regular/pytorch/test/distributed/_composable/fsdp/test_fully_shard_compile.py", line 297, in _check_fsdp_copy_and_resize_ops_count_in_graph
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] _check_count(fwd_copy_count, fwd_resize_count) # fwd graph
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] File "/home/sdp/jenkins/workspace/distributed_ut_regular/pytorch/test/distributed/_composable/fsdp/test_fully_shard_compile.py", line 281, in _check_count
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] self.assertEqual(
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] File "/home/sdp/.conda/envs/2025_ww17/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4096, in assertEqual
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] raise error_metas.pop()[0].to_error( # type: ignore[index]
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] AssertionError: Scalars are not equal!
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768]
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] Expected 4 but got 0.
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] Absolute difference: 4
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] Relative difference: 1.0
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] Unexpected number of fsdp.copy_
ops (expected 4, got 0) in graph: graph():
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] %primals_1 : [num_users=1] = placeholder[target=primals_1]
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] %sum_1 : [num_users=1] = call_function[target=torch.ops.aten.sum.default](args = (%primals_1,), kwargs = {})
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] return (sum_1,)
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768]
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768]
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768]
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] To execute this test, run the following from the base repo dir:
[rank1]:E0425 19:24:56.377000 2457312 site-packages/torch/testing/_internal/common_distributed.py:768] PYTORCH_TEST_WITH_SLOW=1 python test/distributed/_composable/fsdp/test_fully_shard_compile.py TestFullyShardCompile.test_transformer_backend_inductor_fullgraph_True`
- test.distributed.fsdp.test_fsdp_comm_hooks.TestCommunicationHooks | test_fp16_hook_has_wrapping_False_sharding_strategy0 | failed | AssertionError: Tensor-likes are not close!:
../../../../test/distributed/fsdp/test_fsdp_comm_hooks.py::TestCommunicationHooks::test_fp16_hook_has_wrapping_False_sharding_strategy0 2025:04:25-15:46:48:(2164912) |CCL_WARN| did not find MPI-launcher specific variables, switch to ATL/OFI, to force enable ATL/MPI set CCL_ATL_TRANSPORT=mpi
2025:04:25-15:46:48:(2164912) |CCL_WARN| could not get local_idx/count from environment variables, trying to get them from ATL
2025:04:25-15:46:48:(2164913) |CCL_WARN| did not find MPI-launcher specific variables, switch to ATL/OFI, to force enable ATL/MPI set CCL_ATL_TRANSPORT=mpi
2025:04:25-15:46:48:(2164913) |CCL_WARN| could not get local_idx/count from environment variables, trying to get them from ATL
2025:04:25-15:46:48:(2164914) |CCL_WARN| did not find MPI-launcher specific variables, switch to ATL/OFI, to force enable ATL/MPI set CCL_ATL_TRANSPORT=mpi
2025:04:25-15:46:48:(2164914) |CCL_WARN| could not get local_idx/count from environment variables, trying to get them from ATL
2025:04:25-15:46:48:(2164915) |CCL_WARN| did not find MPI-launcher specific variables, switch to ATL/OFI, to force enable ATL/MPI set CCL_ATL_TRANSPORT=mpi
2025:04:25-15:46:48:(2164915) |CCL_WARN| could not get local_idx/count from environment variables, trying to get them from ATL
/home/sdp/jenkins/workspace/distributed_ut_regular/pytorch/test/distributed/fsdp/test_fsdp_comm_hooks.py:179: FutureWarning: The NO_SHARD
sharding strategy is deprecated. If having issues, please use DistributedDataParallel
instead.
return FSDP(
/home/sdp/jenkins/workspace/distributed_ut_regular/pytorch/test/distributed/fsdp/test_fsdp_comm_hooks.py:179: FutureWarning: The NO_SHARD
sharding strategy is deprecated. If having issues, please use DistributedDataParallel
instead.
return FSDP(
/home/sdp/jenkins/workspace/distributed_ut_regular/pytorch/test/distributed/fsdp/test_fsdp_comm_hooks.py:179: FutureWarning: The NO_SHARD
sharding strategy is deprecated. If having issues, please use DistributedDataParallel
instead.
return FSDP(
/home/sdp/jenkins/workspace/distributed_ut_regular/pytorch/test/distributed/fsdp/test_fsdp_comm_hooks.py:179: FutureWarning: The NO_SHARD
sharding strategy is deprecated. If having issues, please use DistributedDataParallel
instead.
return FSDP(
[rank2]:E0425 15:47:00.631000 2164914 site-packages/torch/testing/_internal/common_distributed.py:768] Caught exception:
[rank2]:E0425 15:47:00.631000 2164914 site-packages/torch/testing/_internal/common_distributed.py:768] Traceback (most recent call last):
[rank2]:E0425 15:47:00.631000 2164914 site-packages/torch/testing/_internal/common_distributed.py:768] File "/home/sdp/.conda/envs/2025_ww17/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 761, in run_test
[rank2]:E0425 15:47:00.631000 2164914 site-packages/torch/testing/_internal/common_distributed.py:768] getattr(self, test_name)()
[rank2]:E0425 15:47:00.631000 2164914 site-packages/torch/testing/_internal/common_distributed.py:768] File "/home/sdp/.conda/envs/2025_ww17/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 634, in wrapper
[rank2]:E0425 15:47:00.631000 2164914 site-packages/torch/testing/_internal/common_distributed.py:768] fn()
[rank2]:E0425 15:47:00.631000 2164914 site-packages/torch/testing/_internal/common_distributed.py:768] File "/home/sdp/.conda/envs/2025_ww17/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3155, in wrapper
[rank2]:E0425 15:47:00.631000 2164914 site-packages/torch/testing/_internal/common_distributed.py:768] method(*args, **kwargs)
[rank2]:E0425 15:47:00.631000 2164914 site-packages/torch/testing/_internal/common_distributed.py:768] File "/home/sdp/.conda/envs/2025_ww17/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 552, in instantiated_test
[rank2]:E0425 15:47:00.631000 2164914 site-packages/torch/testing/_internal/common_distributed.py:768] test(self, **param_kwargs)
[rank2]:E0425 15:47:00.631000 2164914 site-packages/torch/testing/_internal/common_distributed.py:768] File "/home/sdp/.conda/envs/2025_ww17/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 209, in wrapper
[rank2]:E0425 15:47:00.631000 2164914 site-packages/torch/testing/_internal/common_distributed.py:768] return func(*args, **kwargs)
[rank2]:E0425 15:47:00.631000 2164914 site-packages/torch/testing/_internal/common_distributed.py:768] File "/home/sdp/jenkins/workspace/distributed_ut_regular/pytorch/test/distributed/fsdp/test_fsdp_comm_hooks.py", line 401, in test_fp16_hook
[rank2]:E0425 15:47:00.631000 2164914 site-packages/torch/testing/_internal/common_distributed.py:768] self._check_low_precision_hook(
[rank2]:E0425 15:47:00.631000 2164914 site-packages/torch/testing/_internal/common_distributed.py:768] File "/home/sdp/jenkins/workspace/distributed_ut_regular/pytorch/test/distributed/fsdp/test_fsdp_comm_hooks.py", line 382, in _check_low_precision_hook
[rank2]:E0425 15:47:00.631000 2164914 site-packages/torch/testing/_internal/common_distributed.py:768] self.assertEqual(hook_param.grad, mp_param.grad)
[rank2]:E0425 15:47:00.631000 2164914 site-packages/torch/testing/_internal/common_distributed.py:768] File "/home/sdp/.conda/envs/2025_ww17/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4096, in assertEqual
[rank2]:E0425 15:47:00.631000 2164914 site-packages/torch/testing/_internal/common_distributed.py:768] raise error_metas.pop()[0].to_error( # type: ignore[index]
[rank2]:E0425 15:47:00.631000 2164914 site-packages/torch/testing/_internal/common_distributed.py:768] AssertionError: Tensor-likes are not close!
[rank2]:E0425 15:47:00.631000 2164914 site-packages/torch/testing/_internal/common_distributed.py:768]
[rank2]:E0425 15:47:00.631000 2164914 site-packages/torch/testing/_internal/common_distributed.py:768] Mismatched elements: 13 / 316 (4.1%)
[rank2]:E0425 15:47:00.631000 2164914 site-packages/torch/testing/_internal/common_distributed.py:768] Greatest absolute difference: 0.71484375 at index (134,) (up to 1e-05 allowed)
[rank2]:E0425 15:47:00.631000 2164914 site-packages/torch/testing/_internal/common_distributed.py:768] Greatest relative difference: 1.5662751197814941 at index (137,) (up to 1.3e-06 allowed)
[rank2]:E0425 15:47:00.631000 2164914 site-packages/torch/testing/_internal/common_distributed.py:768]
[rank2]:E0425 15:47:00.631000 2164914 site-packages/torch/testing/_internal/common_distributed.py:768] To execute this test, run the following from the base repo dir:
[rank2]:E0425 15:47:00.631000 2164914 site-packages/torch/testing/_internal/common_distributed.py:768] PYTORCH_TEST_WITH_SLOW=1 python test/distributed/fsdp/test_fsdp_comm_hooks.py TestCommunicationHooks.test_fp16_hook_has_wrapping_False_sharding_strategy0
[rank2]:E0425 15:47:00.631000 2164914 site-packages/torch/testing/_internal/common_distributed.py:768]
- test.distributed.fsdp.test_fsdp_use_orig_params.TestFSDPUseOrigParamsMultipleParamGroups | test_fsdp_compile
/home/sdp/.conda/envs/2025_ww17/lib/python3.10/site-packages/torch/testing/_comparison.py:1085: UserWarning: Converting a tensor with requires_grad=True to a scalar may lead to unexpected behavior.
Consider using tensor.detach() first. (Triggered internally at /home/sdp/jenkins/workspace/distributed_ut_regular/pytorch/aten/src/ATen/native/Scalar.cpp:22.)
actual.item(),
[rank1]:E0425 17:51:20.500000 2256818 site-packages/torch/testing/_internal/common_distributed.py:768] Caught exception:
[rank1]:E0425 17:51:20.500000 2256818 site-packages/torch/testing/_internal/common_distributed.py:768] Traceback (most recent call last):
[rank1]:E0425 17:51:20.500000 2256818 site-packages/torch/testing/_internal/common_distributed.py:768] File "/home/sdp/.conda/envs/2025_ww17/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 761, in run_test
[rank1]:E0425 17:51:20.500000 2256818 site-packages/torch/testing/_internal/common_distributed.py:768] getattr(self, test_name)()
[rank1]:E0425 17:51:20.500000 2256818 site-packages/torch/testing/_internal/common_distributed.py:768] File "/home/sdp/.conda/envs/2025_ww17/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 634, in wrapper
[rank1]:E0425 17:51:20.500000 2256818 site-packages/torch/testing/_internal/common_distributed.py:768] fn()
[rank1]:E0425 17:51:20.500000 2256818 site-packages/torch/testing/_internal/common_distributed.py:768] File "/home/sdp/.conda/envs/2025_ww17/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3155, in wrapper
[rank1]:E0425 17:51:20.500000 2256818 site-packages/torch/testing/_internal/common_distributed.py:768] method(*args, **kwargs)
[rank1]:E0425 17:51:20.500000 2256818 site-packages/torch/testing/_internal/common_distributed.py:768] File "/home/sdp/.conda/envs/2025_ww17/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 209, in wrapper
[rank1]:E0425 17:51:20.500000 2256818 site-packages/torch/testing/_internal/common_distributed.py:768] return func(*args, **kwargs)
[rank1]:E0425 17:51:20.500000 2256818 site-packages/torch/testing/_internal/common_distributed.py:768] File "/home/sdp/jenkins/workspace/distributed_ut_regular/pytorch/test/distributed/fsdp/test_fsdp_use_orig_params.py", line 226, in test_fsdp_compile
[rank1]:E0425 17:51:20.500000 2256818 site-packages/torch/testing/_internal/common_distributed.py:768] self.run_subtests(
[rank1]:E0425 17:51:20.500000 2256818 site-packages/torch/testing/_internal/common_distributed.py:768] File "/home/sdp/.conda/envs/2025_ww17/lib/python3.10/site-packages/torch/testing/_internal/common_fsdp.py", line 1188, in run_subtests
[rank1]:E0425 17:51:20.500000 2256818 site-packages/torch/testing/_internal/common_distributed.py:768] return run_subtests(self, *args, **kwargs)
[rank1]:E0425 17:51:20.500000 2256818 site-packages/torch/testing/_internal/common_distributed.py:768] File "/home/sdp/.conda/envs/2025_ww17/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 1030, in run_subtests
[rank1]:E0425 17:51:20.500000 2256818 site-packages/torch/testing/_internal/common_distributed.py:768] test_fn(*test_args, **test_kwargs, **subtest_kwargs)
[rank1]:E0425 17:51:20.500000 2256818 site-packages/torch/testing/_internal/common_distributed.py:768] File "/home/sdp/jenkins/workspace/distributed_ut_regular/pytorch/test/distributed/fsdp/test_fsdp_use_orig_params.py", line 274, in _test_fsdp_compile
[rank1]:E0425 17:51:20.500000 2256818 site-packages/torch/testing/_internal/common_distributed.py:768] self.assertEqual(losses[0], losses[1])
[rank1]:E0425 17:51:20.500000 2256818 site-packages/torch/testing/_internal/common_distributed.py:768] File "/home/sdp/.conda/envs/2025_ww17/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4096, in assertEqual
[rank1]:E0425 17:51:20.500000 2256818 site-packages/torch/testing/_internal/common_distributed.py:768] raise error_metas.pop()[0].to_error( # type: ignore[index]
[rank1]:E0425 17:51:20.500000 2256818 site-packages/torch/testing/_internal/common_distributed.py:768] AssertionError: Scalars are not close!
[rank1]:E0425 17:51:20.500000 2256818 site-packages/torch/testing/_internal/common_distributed.py:768]
[rank1]:E0425 17:51:20.500000 2256818 site-packages/torch/testing/_internal/common_distributed.py:768] Expected -298.0321044921875 but got -294.53314208984375.
[rank1]:E0425 17:51:20.500000 2256818 site-packages/torch/testing/_internal/common_distributed.py:768] Absolute difference: 3.49896240234375 (up to 1e-05 allowed)
[rank1]:E0425 17:51:20.500000 2256818 site-packages/torch/testing/_internal/common_distributed.py:768] Relative difference: 0.011740219760235496 (up to 1.3e-06 allowed)
[rank1]:E0425 17:51:20.500000 2256818 site-packages/torch/testing/_internal/common_distributed.py:768]
[rank1]:E0425 17:51:20.500000 2256818 site-packages/torch/testing/_internal/common_distributed.py:768] To execute this test, run the following from the base repo dir:
[rank1]:E0425 17:51:20.500000 2256818 site-packages/torch/testing/_internal/common_distributed.py:768] PYTORCH_TEST_WITH_SLOW=1 python test/distributed/fsdp/test_fsdp_use_orig_params.py TestFSDPUseOrigParamsMultipleParamGroups.test_fsdp_compile
[rank1]:E0425 17:51:20.500000 2256818 site-packages/torch/testing/_internal/common_distributed.py:768]
[rank1]:E0425 17:51:20.500000 2256818 site-packages/torch/testing/_internal/common_distributed.py:768] This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
[rank1]:E0425 17:51:20.500000 2256818 site-packages/torch/testing/_internal/common_distributed.py:768] exiting process 1 with exit code: 10
Versions
https://wiki.ith.intel.com/pages/viewpage.action?pageId=4126570065#distributedww17-Configuration