Skip to content

fix validate nightly binaries #3117

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

TroyGarden
Copy link
Contributor

@TroyGarden TroyGarden commented Jun 20, 2025

context

  • validate_binaries.sh manually install torchrec dependencies and it often has an off-sync issue as below
+++ conda run -n build_binary python -c 'import torch; import fbgemm_gpu; import torchrec'
+++ local cmd=run
+++ case "$cmd" in
+++ __conda_exe run -n build_binary python -c 'import torch; import fbgemm_gpu; import torchrec'
+++ /opt/conda/bin/conda run -n build_binary python -c 'import torch; import fbgemm_gpu; import torchrec'
WARNING: overwriting environment variables set in the machine
overwriting variable {'LD_LIBRARY_PATH'}
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/pytorch/torchrec/torchrec/__init__.py", line 10, in <module>
    import torchrec.distributed  # noqa
  File "/pytorch/torchrec/torchrec/distributed/__init__.py", line 38, in <module>
    from torchrec.distributed.model_parallel import DistributedModelParallel  # noqa
  File "/pytorch/torchrec/torchrec/distributed/model_parallel.py", line 18, in <module>
    from fbgemm_gpu.split_table_batched_embeddings_ops_training import (
  File "/opt/conda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/split_table_batched_embeddings_ops_training.py", line 54, in <module>
    from fbgemm_gpu.tbe.stats import TBEBenchmarkParamsReporter
  File "/opt/conda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/tbe/stats/__init__.py", line 10, in <module>
    from .bench_params_reporter import TBEBenchmarkParamsReporter  # noqa F401
  File "/opt/conda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/tbe/stats/bench_params_reporter.py", line 19, in <module>
    from fbgemm_gpu.tbe.bench.tbe_data_config import (
  File "/opt/conda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/tbe/bench/__init__.py", line 12, in <module>
    from .bench_config import (  # noqa F401
Traceback (most recent call last):
  File "/home/ec2-user/actions-runner/_work/torchrec/torchrec/test-infra/.github/scripts/run_with_env_secrets.py", line 102, in <module>
  File "/opt/conda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/tbe/bench/bench_config.py", line 14, in <module>
    import click
ModuleNotFoundError: No module named 'click'

ERROR conda.cli.main_run:execute(47): `conda run python -c import torch; import fbgemm_gpu; import torchrec` failed. (See above for error)
    main()
  File "/home/ec2-user/actions-runner/_work/torchrec/torchrec/test-infra/.github/scripts/run_with_env_secrets.py", line 98, in main
    run_cmd_or_die(f"docker exec -t {container_name} /exec")
  File "/home/ec2-user/actions-runner/_work/torchrec/torchrec/test-infra/.github/scripts/run_with_env_secrets.py", line 39, in run_cmd_or_die
    raise RuntimeError(f"Command {cmd} failed with exit code {exit_code}")
RuntimeError: Command docker exec -t 96827edf14ff626b7bc16b6cfaa56aa27b4b660029e1fd7755d14bf20a3c4e96 /exec failed with exit code 1
Error: Process completed with exit code 1.
  • this diff install the requirements.txt
    NOTE: the paths in workflow yaml file needs '' to protect

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 20, 2025
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D76875546

TroyGarden added a commit to TroyGarden/torchrec that referenced this pull request Jun 20, 2025
Summary:


# context
* original diff D74366343 broke cogwheel test and was reverted
* the error stack P1844048578 is shown below:
```
  File "/dev/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/dev/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
  File "/dev/torchrec/distributed/train_pipeline/runtime_forwards.py", line 84, in __call__
    data = request.wait()
  File "/dev/torchrec/distributed/types.py", line 334, in wait
    ret: W = self._wait_impl()
  File "/dev/torchrec/distributed/embedding_sharding.py", line 655, in _wait_impl
    kjts.append(w.wait())
  File "/dev/torchrec/distributed/types.py", line 334, in wait
    ret: W = self._wait_impl()
  File "/dev/torchrec/distributed/dist_data.py", line 426, in _wait_impl
    return type(self._input).dist_init(
  File "/dev/torchrec/sparse/jagged_tensor.py", line 2993, in dist_init
    return kjt.sync()
  File "/dev/torchrec/sparse/jagged_tensor.py", line 2067, in sync
    self.length_per_key()
  File "/dev/torchrec/sparse/jagged_tensor.py", line 2281, in length_per_key
    _length_per_key = _maybe_compute_length_per_key(
  File "/dev/torchrec/sparse/jagged_tensor.py", line 1192, in _maybe_compute_length_per_key
    _length_per_key_from_stride_per_key(lengths, stride_per_key)
  File "/dev/torchrec/sparse/jagged_tensor.py", line 1144, in _length_per_key_from_stride_per_key
    if _use_segment_sum_csr(stride_per_key):
  File "/dev/torchrec/sparse/jagged_tensor.py", line 1131, in _use_segment_sum_csr
    elements_per_segment = sum(stride_per_key) / len(stride_per_key)
ZeroDivisionError: division by zero
```
* the complaint is `stride_per_key` is an empty list, which comes from the following function call:
```
        stride_per_key = _maybe_compute_stride_per_key(
            self._stride_per_key,
            self._stride_per_key_per_rank,
            self.stride(),
            self._keys,
        )
```
* the only place this `stride_per_key` could be empty is when the `stride_per_key_per_rank.dim() != 2`
```
def _maybe_compute_stride_per_key(
    stride_per_key: Optional[List[int]],
    stride_per_key_per_rank: Optional[torch.IntTensor],
    stride: Optional[int],
    keys: List[str],
) -> Optional[List[int]]:
    if stride_per_key is not None:
        return stride_per_key
    elif stride_per_key_per_rank is not None:
        if stride_per_key_per_rank.dim() != 2:
            # after permute the kjt could be empty
            return []
        rt: List[int] = stride_per_key_per_rank.sum(dim=1).tolist()
        if not torch.jit.is_scripting() and is_torchdynamo_compiling():
            pt2_checks_all_is_size(rt)
        return rt
    elif stride is not None:
        return [stride] * len(keys)
    else:
        return None
```
# the main change from D74366343 is that the `stride_per_key_per_rank` in `dist_init`:
* baseline
```
            if stagger > 1:
                stride_per_key_per_rank_stagger: List[List[int]] = []
                local_world_size = num_workers // stagger
                for i in range(len(keys)):
                    stride_per_rank_stagger: List[int] = []
                    for j in range(local_world_size):
                        stride_per_rank_stagger.extend(
                            stride_per_key_per_rank[i][j::local_world_size]
                        )
                    stride_per_key_per_rank_stagger.append(stride_per_rank_stagger)
                stride_per_key_per_rank = stride_per_key_per_rank_stagger
```
* D76875546 (correct, this diff)
```
            if stagger > 1:
                indices = torch.arange(num_workers).view(stagger, -1).T.reshape(-1)
                stride_per_key_per_rank = stride_per_key_per_rank[:, indices]
```
* D74366343 (incorrect, reverted)
```
            if stagger > 1:
                local_world_size = num_workers // stagger
                indices = [
                    list(range(i, num_workers, local_world_size))
                    for i in range(local_world_size)
                ]
                stride_per_key_per_rank = stride_per_key_per_rank[:, indices]
```

Differential Revision: D76875546
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D76875546

TroyGarden added a commit to TroyGarden/torchrec that referenced this pull request Jun 20, 2025
Summary:


# context
* original diff D74366343 broke cogwheel test and was reverted
* the error stack P1844048578 is shown below:
```
  File "/dev/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/dev/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
  File "/dev/torchrec/distributed/train_pipeline/runtime_forwards.py", line 84, in __call__
    data = request.wait()
  File "/dev/torchrec/distributed/types.py", line 334, in wait
    ret: W = self._wait_impl()
  File "/dev/torchrec/distributed/embedding_sharding.py", line 655, in _wait_impl
    kjts.append(w.wait())
  File "/dev/torchrec/distributed/types.py", line 334, in wait
    ret: W = self._wait_impl()
  File "/dev/torchrec/distributed/dist_data.py", line 426, in _wait_impl
    return type(self._input).dist_init(
  File "/dev/torchrec/sparse/jagged_tensor.py", line 2993, in dist_init
    return kjt.sync()
  File "/dev/torchrec/sparse/jagged_tensor.py", line 2067, in sync
    self.length_per_key()
  File "/dev/torchrec/sparse/jagged_tensor.py", line 2281, in length_per_key
    _length_per_key = _maybe_compute_length_per_key(
  File "/dev/torchrec/sparse/jagged_tensor.py", line 1192, in _maybe_compute_length_per_key
    _length_per_key_from_stride_per_key(lengths, stride_per_key)
  File "/dev/torchrec/sparse/jagged_tensor.py", line 1144, in _length_per_key_from_stride_per_key
    if _use_segment_sum_csr(stride_per_key):
  File "/dev/torchrec/sparse/jagged_tensor.py", line 1131, in _use_segment_sum_csr
    elements_per_segment = sum(stride_per_key) / len(stride_per_key)
ZeroDivisionError: division by zero
```
* the complaint is `stride_per_key` is an empty list, which comes from the following function call:
```
        stride_per_key = _maybe_compute_stride_per_key(
            self._stride_per_key,
            self._stride_per_key_per_rank,
            self.stride(),
            self._keys,
        )
```
* the only place this `stride_per_key` could be empty is when the `stride_per_key_per_rank.dim() != 2`
```
def _maybe_compute_stride_per_key(
    stride_per_key: Optional[List[int]],
    stride_per_key_per_rank: Optional[torch.IntTensor],
    stride: Optional[int],
    keys: List[str],
) -> Optional[List[int]]:
    if stride_per_key is not None:
        return stride_per_key
    elif stride_per_key_per_rank is not None:
        if stride_per_key_per_rank.dim() != 2:
            # after permute the kjt could be empty
            return []
        rt: List[int] = stride_per_key_per_rank.sum(dim=1).tolist()
        if not torch.jit.is_scripting() and is_torchdynamo_compiling():
            pt2_checks_all_is_size(rt)
        return rt
    elif stride is not None:
        return [stride] * len(keys)
    else:
        return None
```
# the main change from D74366343 is that the `stride_per_key_per_rank` in `dist_init`:
* baseline
```
            if stagger > 1:
                stride_per_key_per_rank_stagger: List[List[int]] = []
                local_world_size = num_workers // stagger
                for i in range(len(keys)):
                    stride_per_rank_stagger: List[int] = []
                    for j in range(local_world_size):
                        stride_per_rank_stagger.extend(
                            stride_per_key_per_rank[i][j::local_world_size]
                        )
                    stride_per_key_per_rank_stagger.append(stride_per_rank_stagger)
                stride_per_key_per_rank = stride_per_key_per_rank_stagger
```
* D76875546 (correct, this diff)
```
            if stagger > 1:
                indices = torch.arange(num_workers).view(stagger, -1).T.reshape(-1)
                stride_per_key_per_rank = stride_per_key_per_rank[:, indices]
```
* D74366343 (incorrect, reverted)
```
            if stagger > 1:
                local_world_size = num_workers // stagger
                indices = [
                    list(range(i, num_workers, local_world_size))
                    for i in range(local_world_size)
                ]
                stride_per_key_per_rank = stride_per_key_per_rank[:, indices]
```

Differential Revision: D76875546
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D76875546

TroyGarden added a commit to TroyGarden/torchrec that referenced this pull request Jun 20, 2025
Summary:


# context
* original diff D74366343 broke cogwheel test and was reverted
* the error stack P1844048578 is shown below:
```
  File "/dev/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/dev/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
  File "/dev/torchrec/distributed/train_pipeline/runtime_forwards.py", line 84, in __call__
    data = request.wait()
  File "/dev/torchrec/distributed/types.py", line 334, in wait
    ret: W = self._wait_impl()
  File "/dev/torchrec/distributed/embedding_sharding.py", line 655, in _wait_impl
    kjts.append(w.wait())
  File "/dev/torchrec/distributed/types.py", line 334, in wait
    ret: W = self._wait_impl()
  File "/dev/torchrec/distributed/dist_data.py", line 426, in _wait_impl
    return type(self._input).dist_init(
  File "/dev/torchrec/sparse/jagged_tensor.py", line 2993, in dist_init
    return kjt.sync()
  File "/dev/torchrec/sparse/jagged_tensor.py", line 2067, in sync
    self.length_per_key()
  File "/dev/torchrec/sparse/jagged_tensor.py", line 2281, in length_per_key
    _length_per_key = _maybe_compute_length_per_key(
  File "/dev/torchrec/sparse/jagged_tensor.py", line 1192, in _maybe_compute_length_per_key
    _length_per_key_from_stride_per_key(lengths, stride_per_key)
  File "/dev/torchrec/sparse/jagged_tensor.py", line 1144, in _length_per_key_from_stride_per_key
    if _use_segment_sum_csr(stride_per_key):
  File "/dev/torchrec/sparse/jagged_tensor.py", line 1131, in _use_segment_sum_csr
    elements_per_segment = sum(stride_per_key) / len(stride_per_key)
ZeroDivisionError: division by zero
```
* the complaint is `stride_per_key` is an empty list, which comes from the following function call:
```
        stride_per_key = _maybe_compute_stride_per_key(
            self._stride_per_key,
            self._stride_per_key_per_rank,
            self.stride(),
            self._keys,
        )
```
* the only place this `stride_per_key` could be empty is when the `stride_per_key_per_rank.dim() != 2`
```
def _maybe_compute_stride_per_key(
    stride_per_key: Optional[List[int]],
    stride_per_key_per_rank: Optional[torch.IntTensor],
    stride: Optional[int],
    keys: List[str],
) -> Optional[List[int]]:
    if stride_per_key is not None:
        return stride_per_key
    elif stride_per_key_per_rank is not None:
        if stride_per_key_per_rank.dim() != 2:
            # after permute the kjt could be empty
            return []
        rt: List[int] = stride_per_key_per_rank.sum(dim=1).tolist()
        if not torch.jit.is_scripting() and is_torchdynamo_compiling():
            pt2_checks_all_is_size(rt)
        return rt
    elif stride is not None:
        return [stride] * len(keys)
    else:
        return None
```
# the main change from D74366343 is that the `stride_per_key_per_rank` in `dist_init`:
* baseline
```
            if stagger > 1:
                stride_per_key_per_rank_stagger: List[List[int]] = []
                local_world_size = num_workers // stagger
                for i in range(len(keys)):
                    stride_per_rank_stagger: List[int] = []
                    for j in range(local_world_size):
                        stride_per_rank_stagger.extend(
                            stride_per_key_per_rank[i][j::local_world_size]
                        )
                    stride_per_key_per_rank_stagger.append(stride_per_rank_stagger)
                stride_per_key_per_rank = stride_per_key_per_rank_stagger
```
* D76875546 (correct, this diff)
```
            if stagger > 1:
                indices = torch.arange(num_workers).view(stagger, -1).T.reshape(-1)
                stride_per_key_per_rank = stride_per_key_per_rank[:, indices]
```
* D74366343 (incorrect, reverted)
```
            if stagger > 1:
                local_world_size = num_workers // stagger
                indices = [
                    list(range(i, num_workers, local_world_size))
                    for i in range(local_world_size)
                ]
                stride_per_key_per_rank = stride_per_key_per_rank[:, indices]
```

Differential Revision: D76875546
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D76875546

@TroyGarden TroyGarden changed the title fix stride_per_key_per_rank in stagger scenario in D74366343 (#3111) fix validate nightly binaries Jun 20, 2025
Summary:
# context
* 
```
+++ conda run -n build_binary python -c 'import torch; import fbgemm_gpu; import torchrec'
+++ local cmd=run
+++ case "$cmd" in
+++ __conda_exe run -n build_binary python -c 'import torch; import fbgemm_gpu; import torchrec'
+++ /opt/conda/bin/conda run -n build_binary python -c 'import torch; import fbgemm_gpu; import torchrec'
WARNING: overwriting environment variables set in the machine
overwriting variable {'LD_LIBRARY_PATH'}
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/pytorch/torchrec/torchrec/__init__.py", line 10, in <module>
    import torchrec.distributed  # noqa
  File "/pytorch/torchrec/torchrec/distributed/__init__.py", line 38, in <module>
    from torchrec.distributed.model_parallel import DistributedModelParallel  # noqa
  File "/pytorch/torchrec/torchrec/distributed/model_parallel.py", line 18, in <module>
    from fbgemm_gpu.split_table_batched_embeddings_ops_training import (
  File "/opt/conda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/split_table_batched_embeddings_ops_training.py", line 54, in <module>
    from fbgemm_gpu.tbe.stats import TBEBenchmarkParamsReporter
  File "/opt/conda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/tbe/stats/__init__.py", line 10, in <module>
    from .bench_params_reporter import TBEBenchmarkParamsReporter  # noqa F401
  File "/opt/conda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/tbe/stats/bench_params_reporter.py", line 19, in <module>
    from fbgemm_gpu.tbe.bench.tbe_data_config import (
  File "/opt/conda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/tbe/bench/__init__.py", line 12, in <module>
    from .bench_config import (  # noqa F401
Traceback (most recent call last):
  File "/home/ec2-user/actions-runner/_work/torchrec/torchrec/test-infra/.github/scripts/run_with_env_secrets.py", line 102, in <module>
  File "/opt/conda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/tbe/bench/bench_config.py", line 14, in <module>
    import click
ModuleNotFoundError: No module named 'click'

ERROR conda.cli.main_run:execute(47): `conda run python -c import torch; import fbgemm_gpu; import torchrec` failed. (See above for error)
    main()
  File "/home/ec2-user/actions-runner/_work/torchrec/torchrec/test-infra/.github/scripts/run_with_env_secrets.py", line 98, in main
    run_cmd_or_die(f"docker exec -t {container_name} /exec")
  File "/home/ec2-user/actions-runner/_work/torchrec/torchrec/test-infra/.github/scripts/run_with_env_secrets.py", line 39, in run_cmd_or_die
    raise RuntimeError(f"Command {cmd} failed with exit code {exit_code}")
RuntimeError: Command docker exec -t 96827edf14ff626b7bc16b6cfaa56aa27b4b660029e1fd7755d14bf20a3c4e96 /exec failed with exit code 1
Error: Process completed with exit code 1.
```

Differential Revision: D76875546
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D76875546

@TroyGarden TroyGarden closed this Jun 20, 2025
@TroyGarden TroyGarden deleted the export-D76875546 branch June 21, 2025 01:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants