allreduce benchmark

AllReduce Benchmark

Minikube

Batch size : 64 Number of batches per task: 50 Dataset: cifar10 and image size is (32, 32, 3)

Worker resource: cpu=0.3,memory=2048Mi,ephemeral-storage=1024Mi

Resnet50

Resnet50 is a computation-intensive model and its trainable parameters number for cifar10 is 23,555,082.

Workers	computation/communication	Speed	Speedup Ratio
1	0%	3.1 images/s	1
2	10: 1	5.65 images/s	1.82

MobileNetV2

MobileNetV2 is a communication-intensive model and its trainable parameters number of MoblieNetV2 is 2,236,682.

Workers	computation/communication	Speed	Speedup Ratio
1	-	29 images/s	1
2	10: 3	44.7 images/s	1.54
3	10: 6	57.2 images/s	1.97

ASI

CPU only

Worker resource: cpu=4,memory=8192Mi,ephemeral-storage=1024Mi

MobileNetV2

Workers	communication	Speed	Speedup Ratio
1	0%	353.6 images/s	1
2	24%	503 images/s	1.42
4	44.7%	680 images/s	1.92
8	66.7%	648 images/s	1.83

Resnet50

Workers	communication	Speed	Speedup Ratio
1	0%	26.7 images/s	1
2	18%	41 images/s	1.57
4	25%	68.4 images/s	2.56
8	32%	123 images/s	4.61

GPU

Setup

Data: ImageNet shape (256, 256, 3)

mini-batch size: 64

Images of a task: 1024

Resnet50

214 trainable weights

Trainable params: 23,739,492

Workers	speed	total task time	allreduce time	tensor.numpy() time	apply_gradients
1 (local)	168 images/s	6.1s	-	-	4.16s
2	148 images/s	13.76s	10.36s	5.04s	1.35s
4	228 images/s	18s	14.67s	5.14s	1.30s

Resnet50

MobileNetV2

158 trainable weights

Trainable params: 2,386,084

Workers	speed	total task time	allreduce time	tensor.numpy() time	apply_gradients
1 (local)	169 images/s	6.06s	-	-	5.59s
2	246 images/s	8.34s	7.25026	5.79s	0.6s
4	401 images/s	10.2029s	8.9s	5.78s	0.71s

MobileNetV2

Compression model with Conv2DTranspose

34 trainable weights

Trainable params: 11,238,723

Workers	speed	total task time	allreduce time	tensor.numpy() time	apply_gradients
1 (local)	109 images/s	9.36s	-	-	8.95s
2	176 images/s	11.65s	1.47s	9.36s	0.42s
4	328 images/s	12.47s	2.44s	9.32s	0.37s

Image Compression

Summary for GPU:

The speed-up ratio is better if the number of trainable weights is less.
The speed-up ratio is better if the computation is more complex.
It is weird that apply_gradients is so slow.

allreduce benchmark

AllReduce Benchmark

Minikube

Resnet50

MobileNetV2

ASI

CPU only

GPU

Setup

Resnet50

MobileNetV2

Compression model with Conv2DTranspose

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally