fix validate nightly binaries #3117

TroyGarden · 2025-06-20T00:48:30Z

context

validate_binaries.sh manually install torchrec dependencies and it often has an off-sync issue as below

+++ conda run -n build_binary python -c 'import torch; import fbgemm_gpu; import torchrec'
+++ local cmd=run
+++ case "$cmd" in
+++ __conda_exe run -n build_binary python -c 'import torch; import fbgemm_gpu; import torchrec'
+++ /opt/conda/bin/conda run -n build_binary python -c 'import torch; import fbgemm_gpu; import torchrec'
WARNING: overwriting environment variables set in the machine
overwriting variable {'LD_LIBRARY_PATH'}
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/pytorch/torchrec/torchrec/__init__.py", line 10, in <module>
    import torchrec.distributed  # noqa
  File "/pytorch/torchrec/torchrec/distributed/__init__.py", line 38, in <module>
    from torchrec.distributed.model_parallel import DistributedModelParallel  # noqa
  File "/pytorch/torchrec/torchrec/distributed/model_parallel.py", line 18, in <module>
    from fbgemm_gpu.split_table_batched_embeddings_ops_training import (
  File "/opt/conda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/split_table_batched_embeddings_ops_training.py", line 54, in <module>
    from fbgemm_gpu.tbe.stats import TBEBenchmarkParamsReporter
  File "/opt/conda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/tbe/stats/__init__.py", line 10, in <module>
    from .bench_params_reporter import TBEBenchmarkParamsReporter  # noqa F401
  File "/opt/conda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/tbe/stats/bench_params_reporter.py", line 19, in <module>
    from fbgemm_gpu.tbe.bench.tbe_data_config import (
  File "/opt/conda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/tbe/bench/__init__.py", line 12, in <module>
    from .bench_config import (  # noqa F401
Traceback (most recent call last):
  File "/home/ec2-user/actions-runner/_work/torchrec/torchrec/test-infra/.github/scripts/run_with_env_secrets.py", line 102, in <module>
  File "/opt/conda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/tbe/bench/bench_config.py", line 14, in <module>
    import click
ModuleNotFoundError: No module named 'click'

ERROR conda.cli.main_run:execute(47): `conda run python -c import torch; import fbgemm_gpu; import torchrec` failed. (See above for error)
    main()
  File "/home/ec2-user/actions-runner/_work/torchrec/torchrec/test-infra/.github/scripts/run_with_env_secrets.py", line 98, in main
    run_cmd_or_die(f"docker exec -t {container_name} /exec")
  File "/home/ec2-user/actions-runner/_work/torchrec/torchrec/test-infra/.github/scripts/run_with_env_secrets.py", line 39, in run_cmd_or_die
    raise RuntimeError(f"Command {cmd} failed with exit code {exit_code}")
RuntimeError: Command docker exec -t 96827edf14ff626b7bc16b6cfaa56aa27b4b660029e1fd7755d14bf20a3c4e96 /exec failed with exit code 1
Error: Process completed with exit code 1.

this diff install the requirements.txt
NOTE: the paths in workflow yaml file needs '' to protect

facebook-github-bot · 2025-06-20T00:48:39Z

This pull request was exported from Phabricator. Differential Revision: D76875546

Summary: # context * original diff D74366343 broke cogwheel test and was reverted * the error stack P1844048578 is shown below: ``` File "/dev/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/dev/torch/nn/modules/module.py", line 1784, in _call_impl return forward_call(*args, **kwargs) File "/dev/torchrec/distributed/train_pipeline/runtime_forwards.py", line 84, in __call__ data = request.wait() File "/dev/torchrec/distributed/types.py", line 334, in wait ret: W = self._wait_impl() File "/dev/torchrec/distributed/embedding_sharding.py", line 655, in _wait_impl kjts.append(w.wait()) File "/dev/torchrec/distributed/types.py", line 334, in wait ret: W = self._wait_impl() File "/dev/torchrec/distributed/dist_data.py", line 426, in _wait_impl return type(self._input).dist_init( File "/dev/torchrec/sparse/jagged_tensor.py", line 2993, in dist_init return kjt.sync() File "/dev/torchrec/sparse/jagged_tensor.py", line 2067, in sync self.length_per_key() File "/dev/torchrec/sparse/jagged_tensor.py", line 2281, in length_per_key _length_per_key = _maybe_compute_length_per_key( File "/dev/torchrec/sparse/jagged_tensor.py", line 1192, in _maybe_compute_length_per_key _length_per_key_from_stride_per_key(lengths, stride_per_key) File "/dev/torchrec/sparse/jagged_tensor.py", line 1144, in _length_per_key_from_stride_per_key if _use_segment_sum_csr(stride_per_key): File "/dev/torchrec/sparse/jagged_tensor.py", line 1131, in _use_segment_sum_csr elements_per_segment = sum(stride_per_key) / len(stride_per_key) ZeroDivisionError: division by zero ``` * the complaint is `stride_per_key` is an empty list, which comes from the following function call: ``` stride_per_key = _maybe_compute_stride_per_key( self._stride_per_key, self._stride_per_key_per_rank, self.stride(), self._keys, ) ``` * the only place this `stride_per_key` could be empty is when the `stride_per_key_per_rank.dim() != 2` ``` def _maybe_compute_stride_per_key( stride_per_key: Optional[List[int]], stride_per_key_per_rank: Optional[torch.IntTensor], stride: Optional[int], keys: List[str], ) -> Optional[List[int]]: if stride_per_key is not None: return stride_per_key elif stride_per_key_per_rank is not None: if stride_per_key_per_rank.dim() != 2: # after permute the kjt could be empty return [] rt: List[int] = stride_per_key_per_rank.sum(dim=1).tolist() if not torch.jit.is_scripting() and is_torchdynamo_compiling(): pt2_checks_all_is_size(rt) return rt elif stride is not None: return [stride] * len(keys) else: return None ``` # the main change from D74366343 is that the `stride_per_key_per_rank` in `dist_init`: * baseline ``` if stagger > 1: stride_per_key_per_rank_stagger: List[List[int]] = [] local_world_size = num_workers // stagger for i in range(len(keys)): stride_per_rank_stagger: List[int] = [] for j in range(local_world_size): stride_per_rank_stagger.extend( stride_per_key_per_rank[i][j::local_world_size] ) stride_per_key_per_rank_stagger.append(stride_per_rank_stagger) stride_per_key_per_rank = stride_per_key_per_rank_stagger ``` * D76875546 (correct, this diff) ``` if stagger > 1: indices = torch.arange(num_workers).view(stagger, -1).T.reshape(-1) stride_per_key_per_rank = stride_per_key_per_rank[:, indices] ``` * D74366343 (incorrect, reverted) ``` if stagger > 1: local_world_size = num_workers // stagger indices = [ list(range(i, num_workers, local_world_size)) for i in range(local_world_size) ] stride_per_key_per_rank = stride_per_key_per_rank[:, indices] ``` Differential Revision: D76875546

facebook-github-bot · 2025-06-20T00:51:57Z

This pull request was exported from Phabricator. Differential Revision: D76875546

Summary: # context * original diff D74366343 broke cogwheel test and was reverted * the error stack P1844048578 is shown below: ``` File "/dev/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/dev/torch/nn/modules/module.py", line 1784, in _call_impl return forward_call(*args, **kwargs) File "/dev/torchrec/distributed/train_pipeline/runtime_forwards.py", line 84, in __call__ data = request.wait() File "/dev/torchrec/distributed/types.py", line 334, in wait ret: W = self._wait_impl() File "/dev/torchrec/distributed/embedding_sharding.py", line 655, in _wait_impl kjts.append(w.wait()) File "/dev/torchrec/distributed/types.py", line 334, in wait ret: W = self._wait_impl() File "/dev/torchrec/distributed/dist_data.py", line 426, in _wait_impl return type(self._input).dist_init( File "/dev/torchrec/sparse/jagged_tensor.py", line 2993, in dist_init return kjt.sync() File "/dev/torchrec/sparse/jagged_tensor.py", line 2067, in sync self.length_per_key() File "/dev/torchrec/sparse/jagged_tensor.py", line 2281, in length_per_key _length_per_key = _maybe_compute_length_per_key( File "/dev/torchrec/sparse/jagged_tensor.py", line 1192, in _maybe_compute_length_per_key _length_per_key_from_stride_per_key(lengths, stride_per_key) File "/dev/torchrec/sparse/jagged_tensor.py", line 1144, in _length_per_key_from_stride_per_key if _use_segment_sum_csr(stride_per_key): File "/dev/torchrec/sparse/jagged_tensor.py", line 1131, in _use_segment_sum_csr elements_per_segment = sum(stride_per_key) / len(stride_per_key) ZeroDivisionError: division by zero ``` * the complaint is `stride_per_key` is an empty list, which comes from the following function call: ``` stride_per_key = _maybe_compute_stride_per_key( self._stride_per_key, self._stride_per_key_per_rank, self.stride(), self._keys, ) ``` * the only place this `stride_per_key` could be empty is when the `stride_per_key_per_rank.dim() != 2` ``` def _maybe_compute_stride_per_key( stride_per_key: Optional[List[int]], stride_per_key_per_rank: Optional[torch.IntTensor], stride: Optional[int], keys: List[str], ) -> Optional[List[int]]: if stride_per_key is not None: return stride_per_key elif stride_per_key_per_rank is not None: if stride_per_key_per_rank.dim() != 2: # after permute the kjt could be empty return [] rt: List[int] = stride_per_key_per_rank.sum(dim=1).tolist() if not torch.jit.is_scripting() and is_torchdynamo_compiling(): pt2_checks_all_is_size(rt) return rt elif stride is not None: return [stride] * len(keys) else: return None ``` # the main change from D74366343 is that the `stride_per_key_per_rank` in `dist_init`: * baseline ``` if stagger > 1: stride_per_key_per_rank_stagger: List[List[int]] = [] local_world_size = num_workers // stagger for i in range(len(keys)): stride_per_rank_stagger: List[int] = [] for j in range(local_world_size): stride_per_rank_stagger.extend( stride_per_key_per_rank[i][j::local_world_size] ) stride_per_key_per_rank_stagger.append(stride_per_rank_stagger) stride_per_key_per_rank = stride_per_key_per_rank_stagger ``` * D76875546 (correct, this diff) ``` if stagger > 1: indices = torch.arange(num_workers).view(stagger, -1).T.reshape(-1) stride_per_key_per_rank = stride_per_key_per_rank[:, indices] ``` * D74366343 (incorrect, reverted) ``` if stagger > 1: local_world_size = num_workers // stagger indices = [ list(range(i, num_workers, local_world_size)) for i in range(local_world_size) ] stride_per_key_per_rank = stride_per_key_per_rank[:, indices] ``` Differential Revision: D76875546

facebook-github-bot · 2025-06-20T00:54:17Z

This pull request was exported from Phabricator. Differential Revision: D76875546

Summary: # context * original diff D74366343 broke cogwheel test and was reverted * the error stack P1844048578 is shown below: ``` File "/dev/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/dev/torch/nn/modules/module.py", line 1784, in _call_impl return forward_call(*args, **kwargs) File "/dev/torchrec/distributed/train_pipeline/runtime_forwards.py", line 84, in __call__ data = request.wait() File "/dev/torchrec/distributed/types.py", line 334, in wait ret: W = self._wait_impl() File "/dev/torchrec/distributed/embedding_sharding.py", line 655, in _wait_impl kjts.append(w.wait()) File "/dev/torchrec/distributed/types.py", line 334, in wait ret: W = self._wait_impl() File "/dev/torchrec/distributed/dist_data.py", line 426, in _wait_impl return type(self._input).dist_init( File "/dev/torchrec/sparse/jagged_tensor.py", line 2993, in dist_init return kjt.sync() File "/dev/torchrec/sparse/jagged_tensor.py", line 2067, in sync self.length_per_key() File "/dev/torchrec/sparse/jagged_tensor.py", line 2281, in length_per_key _length_per_key = _maybe_compute_length_per_key( File "/dev/torchrec/sparse/jagged_tensor.py", line 1192, in _maybe_compute_length_per_key _length_per_key_from_stride_per_key(lengths, stride_per_key) File "/dev/torchrec/sparse/jagged_tensor.py", line 1144, in _length_per_key_from_stride_per_key if _use_segment_sum_csr(stride_per_key): File "/dev/torchrec/sparse/jagged_tensor.py", line 1131, in _use_segment_sum_csr elements_per_segment = sum(stride_per_key) / len(stride_per_key) ZeroDivisionError: division by zero ``` * the complaint is `stride_per_key` is an empty list, which comes from the following function call: ``` stride_per_key = _maybe_compute_stride_per_key( self._stride_per_key, self._stride_per_key_per_rank, self.stride(), self._keys, ) ``` * the only place this `stride_per_key` could be empty is when the `stride_per_key_per_rank.dim() != 2` ``` def _maybe_compute_stride_per_key( stride_per_key: Optional[List[int]], stride_per_key_per_rank: Optional[torch.IntTensor], stride: Optional[int], keys: List[str], ) -> Optional[List[int]]: if stride_per_key is not None: return stride_per_key elif stride_per_key_per_rank is not None: if stride_per_key_per_rank.dim() != 2: # after permute the kjt could be empty return [] rt: List[int] = stride_per_key_per_rank.sum(dim=1).tolist() if not torch.jit.is_scripting() and is_torchdynamo_compiling(): pt2_checks_all_is_size(rt) return rt elif stride is not None: return [stride] * len(keys) else: return None ``` # the main change from D74366343 is that the `stride_per_key_per_rank` in `dist_init`: * baseline ``` if stagger > 1: stride_per_key_per_rank_stagger: List[List[int]] = [] local_world_size = num_workers // stagger for i in range(len(keys)): stride_per_rank_stagger: List[int] = [] for j in range(local_world_size): stride_per_rank_stagger.extend( stride_per_key_per_rank[i][j::local_world_size] ) stride_per_key_per_rank_stagger.append(stride_per_rank_stagger) stride_per_key_per_rank = stride_per_key_per_rank_stagger ``` * D76875546 (correct, this diff) ``` if stagger > 1: indices = torch.arange(num_workers).view(stagger, -1).T.reshape(-1) stride_per_key_per_rank = stride_per_key_per_rank[:, indices] ``` * D74366343 (incorrect, reverted) ``` if stagger > 1: local_world_size = num_workers // stagger indices = [ list(range(i, num_workers, local_world_size)) for i in range(local_world_size) ] stride_per_key_per_rank = stride_per_key_per_rank[:, indices] ``` Differential Revision: D76875546

facebook-github-bot · 2025-06-20T00:58:51Z

This pull request was exported from Phabricator. Differential Revision: D76875546

Summary: # context * ``` +++ conda run -n build_binary python -c 'import torch; import fbgemm_gpu; import torchrec' +++ local cmd=run +++ case "$cmd" in +++ __conda_exe run -n build_binary python -c 'import torch; import fbgemm_gpu; import torchrec' +++ /opt/conda/bin/conda run -n build_binary python -c 'import torch; import fbgemm_gpu; import torchrec' WARNING: overwriting environment variables set in the machine overwriting variable {'LD_LIBRARY_PATH'} Traceback (most recent call last): File "<string>", line 1, in <module> File "/pytorch/torchrec/torchrec/__init__.py", line 10, in <module> import torchrec.distributed # noqa File "/pytorch/torchrec/torchrec/distributed/__init__.py", line 38, in <module> from torchrec.distributed.model_parallel import DistributedModelParallel # noqa File "/pytorch/torchrec/torchrec/distributed/model_parallel.py", line 18, in <module> from fbgemm_gpu.split_table_batched_embeddings_ops_training import ( File "/opt/conda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/split_table_batched_embeddings_ops_training.py", line 54, in <module> from fbgemm_gpu.tbe.stats import TBEBenchmarkParamsReporter File "/opt/conda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/tbe/stats/__init__.py", line 10, in <module> from .bench_params_reporter import TBEBenchmarkParamsReporter # noqa F401 File "/opt/conda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/tbe/stats/bench_params_reporter.py", line 19, in <module> from fbgemm_gpu.tbe.bench.tbe_data_config import ( File "/opt/conda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/tbe/bench/__init__.py", line 12, in <module> from .bench_config import ( # noqa F401 Traceback (most recent call last): File "/home/ec2-user/actions-runner/_work/torchrec/torchrec/test-infra/.github/scripts/run_with_env_secrets.py", line 102, in <module> File "/opt/conda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/tbe/bench/bench_config.py", line 14, in <module> import click ModuleNotFoundError: No module named 'click' ERROR conda.cli.main_run:execute(47): `conda run python -c import torch; import fbgemm_gpu; import torchrec` failed. (See above for error) main() File "/home/ec2-user/actions-runner/_work/torchrec/torchrec/test-infra/.github/scripts/run_with_env_secrets.py", line 98, in main run_cmd_or_die(f"docker exec -t {container_name} /exec") File "/home/ec2-user/actions-runner/_work/torchrec/torchrec/test-infra/.github/scripts/run_with_env_secrets.py", line 39, in run_cmd_or_die raise RuntimeError(f"Command {cmd} failed with exit code {exit_code}") RuntimeError: Command docker exec -t 96827edf14ff626b7bc16b6cfaa56aa27b4b660029e1fd7755d14bf20a3c4e96 /exec failed with exit code 1 Error: Process completed with exit code 1. ``` Differential Revision: D76875546

facebook-github-bot · 2025-06-20T01:02:39Z

This pull request was exported from Phabricator. Differential Revision: D76875546

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 20, 2025

facebook-github-bot added the fb-exported label Jun 20, 2025

TroyGarden force-pushed the export-D76875546 branch from 166b032 to 1c17a39 Compare June 20, 2025 00:51

TroyGarden force-pushed the export-D76875546 branch from 1c17a39 to a307c7c Compare June 20, 2025 00:54

TroyGarden force-pushed the export-D76875546 branch from a307c7c to cc7a7b9 Compare June 20, 2025 00:58

TroyGarden changed the title ~~fix stride_per_key_per_rank in stagger scenario in D74366343 (#3111)~~ fix validate nightly binaries Jun 20, 2025

TroyGarden force-pushed the export-D76875546 branch from cc7a7b9 to 36860c2 Compare June 20, 2025 01:02

TroyGarden closed this Jun 20, 2025

TroyGarden deleted the export-D76875546 branch June 21, 2025 01:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix validate nightly binaries #3117

fix validate nightly binaries #3117

Uh oh!

TroyGarden commented Jun 20, 2025 •

edited

Loading

Uh oh!

facebook-github-bot commented Jun 20, 2025

Uh oh!

facebook-github-bot commented Jun 20, 2025

Uh oh!

facebook-github-bot commented Jun 20, 2025

Uh oh!

facebook-github-bot commented Jun 20, 2025

Uh oh!

facebook-github-bot commented Jun 20, 2025

Uh oh!

Uh oh!

fix validate nightly binaries #3117

fix validate nightly binaries #3117

Uh oh!

Conversation

TroyGarden commented Jun 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

context

Uh oh!

facebook-github-bot commented Jun 20, 2025

Uh oh!

facebook-github-bot commented Jun 20, 2025

Uh oh!

facebook-github-bot commented Jun 20, 2025

Uh oh!

facebook-github-bot commented Jun 20, 2025

Uh oh!

facebook-github-bot commented Jun 20, 2025

Uh oh!

Uh oh!

TroyGarden commented Jun 20, 2025 •

edited

Loading