[WIP] Streaming DiLoCo prototype #203

H-Huang · 2025-05-28T16:44:04Z

Creating a small script to quickly hack on the implementation for streaming DiLoCo.

Run with (start lighthouse first by looking at command in README.md):

cd streaming_diloco_prototype
torchx run

Issues found:

Quantization only supports 2D tensors (can workaround)

replica_0/0     File "/home/howardhuang/.conda/envs/torchft/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 357, in wrapper
replica_0/0       return f(*args, **kwargs)
replica_0/0     File "/data/users/howardhuang/torchft/streaming_diloco_prototype/train.py", line 275, in streaming_diloco
replica_0/0       fut = allreduce_quantized(params_data, ReduceOp.AVG, pg)
replica_0/0     File "/data/users/howardhuang/torchft/torchft/collectives.py", line 104, in allreduce_quantized
replica_0/0       quantized_tensors = fused_quantize_into_fp8(tensors, world_size)
replica_0/0     File "/data/users/howardhuang/torchft/torchft/quantization.py", line 520, in fused_quantize_into_fp8
replica_0/0       ) = _prepare_quantize_fp8(inputs, all_reduce_group_size)
replica_0/0     File "/data/users/howardhuang/torchft/torchft/quantization.py", line 450, in _prepare_quantize_fp8
replica_0/0       assert len(inputs[i].shape) == 2, "Only 2D tensors are supported"
replica_0/0   AssertionError: Only 2D tensors are supported

triton runtime jit issue when calling _fused_kernel_quantize_into_fp8[grid]

replica_0/0     File "/data/users/howardhuang/torchft/streaming_diloco_prototype/train.py", line 275, in streaming_diloco
replica_0/0       fut = allreduce_quantized(params_data, ReduceOp.AVG, pg)
replica_0/0     File "/data/users/howardhuang/torchft/torchft/collectives.py", line 104, in allreduce_quantized
replica_0/0       quantized_tensors = fused_quantize_into_fp8(tensors, world_size)
replica_0/0     File "/data/users/howardhuang/torchft/torchft/quantization.py", line 531, in fused_quantize_into_fp8
replica_0/0       _fused_kernel_quantize_into_fp8[grid](
replica_0/0     File "/home/howardhuang/.conda/envs/torchft/lib/python3.10/site-packages/triton/runtime/jit.py", line 499, in run
replica_0/0       if key not in self.cache[device]:
replica_0/0   TypeError: unhashable type: 'constexpr'

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 28, 2025

H-Huang force-pushed the streaming_diloco branch 2 times, most recently from bbc34ff to 77d7c42 Compare May 28, 2025 17:29

Streaming DiLoCo prototype

04b3a37

H-Huang force-pushed the streaming_diloco branch from 77d7c42 to 04b3a37 Compare May 30, 2025 13:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] Streaming DiLoCo prototype #203

[WIP] Streaming DiLoCo prototype #203

Uh oh!

H-Huang commented May 28, 2025

Uh oh!

Uh oh!

[WIP] Streaming DiLoCo prototype #203

Are you sure you want to change the base?

[WIP] Streaming DiLoCo prototype #203

Uh oh!

Conversation

H-Huang commented May 28, 2025

Issues found:

Uh oh!

Uh oh!