Skip to content

allreduce benchmark

Qinlong Wang edited this page Jul 29, 2020 · 15 revisions

AllReduce Benchmark

Minikube

Batch size : 64 Number of batches per task: 50 Dataset: cifar10 and image size is (32, 32, 3)

Worker resource: cpu=0.3,memory=2048Mi,ephemeral-storage=1024Mi

Resnet50

Resnet50 is a computation-intensive model and its trainable parameters number for cifar10 is 23,555,082.

Workers computation/communication Speed Speedup Ratio
1 0% 3.1 images/s 1
2 10: 1 5.65 images/s 1.82

MobileNetV2

MobileNetV2 is a communication-intensive model and its trainable parameters number of MoblieNetV2 is 2,236,682.

Workers computation/communication Speed Speedup Ratio
1 - 29 images/s 1
2 10: 3 44.7 images/s 1.54
3 10: 6 57.2 images/s 1.97

ASI

CPU only

Worker resource: cpu=4,memory=8192Mi,ephemeral-storage=1024Mi

MobileNetV2

Workers communication Speed Speedup Ratio
1 0% 353.6 images/s 1
2 24% 503 images/s 1.42
4 44.7% 680 images/s 1.92
8 66.7% 648 images/s 1.83

Resnet50

Workers communication Speed Speedup Ratio
1 0% 26.7 images/s 1
2 18% 41 images/s 1.57
4 25% 68.4 images/s 2.56
8 32% 123 images/s 4.61

GPU

Setup

Data: ImageNet shape (256, 256, 3)

mini-batch size: 64

Images of a task: 1024

Resnet50

214 trainable weights

Trainable params: 23,739,492

Workers speed total task time allreduce time tensor.numpy() time apply_gradients
1 (local) 168 images/s 6.1s - - 4.16s
2 148 images/s 13.76s 10.36s 5.04s 1.35s
4 228 images/s 18s 14.67s 5.14s 1.30s

Resnet50

MobileNetV2

158 trainable weights

Trainable params: 2,386,084

Workers speed total task time allreduce time tensor.numpy() time apply_gradients
1 (local) 169 images/s 6.06s - - 5.59s
2 246 images/s 8.34s 7.25026 5.79s 0.6s
4 401 images/s 10.2029s 8.9s 5.78s 0.71s

MobileNetV2

Compression model with Conv2DTranspose

34 trainable weights

Trainable params: 11,238,723

Workers speed total task time allreduce time tensor.numpy() time apply_gradients
1 (local) 109 images/s 9.36s - - 8.95s
2 176 images/s 11.65s 1.47s 9.36s 0.42s
4 328 images/s 12.47s 2.44s 9.32s 0.37s

Image Compression

Summary for GPU:

  1. The speed-up ratio is better if the number of trainable weights is less.
  2. The speed-up ratio is better if the computation is more complex.
  3. It is weird that apply_gradients is so slow.