-
Notifications
You must be signed in to change notification settings - Fork 115
allreduce benchmark
Qinlong Wang edited this page Jul 29, 2020
·
15 revisions
Batch size : 64 Number of batches per task: 50 Dataset: cifar10 and image size is (32, 32, 3)
Worker resource: cpu=0.3,memory=2048Mi,ephemeral-storage=1024Mi
Resnet50 is a computation-intensive model and its trainable parameters number for cifar10 is 23,555,082.
Workers | computation/communication | Speed | Speedup Ratio |
---|---|---|---|
1 | 0% | 3.1 images/s | 1 |
2 | 10: 1 | 5.65 images/s | 1.82 |
MobileNetV2 is a communication-intensive model and its trainable parameters number of MoblieNetV2 is 2,236,682.
Workers | computation/communication | Speed | Speedup Ratio |
---|---|---|---|
1 | - | 29 images/s | 1 |
2 | 10: 3 | 44.7 images/s | 1.54 |
3 | 10: 6 | 57.2 images/s | 1.97 |
Worker resource: cpu=4,memory=8192Mi,ephemeral-storage=1024Mi
MobileNetV2
Workers | communication | Speed | Speedup Ratio |
---|---|---|---|
1 | 0% | 353.6 images/s | 1 |
2 | 24% | 503 images/s | 1.42 |
4 | 44.7% | 680 images/s | 1.92 |
8 | 66.7% | 648 images/s | 1.83 |
Resnet50
Workers | communication | Speed | Speedup Ratio |
---|---|---|---|
1 | 0% | 26.7 images/s | 1 |
2 | 18% | 41 images/s | 1.57 |
4 | 25% | 68.4 images/s | 2.56 |
8 | 32% | 123 images/s | 4.61 |
Data: ImageNet shape (256, 256, 3)
mini-batch size: 64
Images of a task: 1024
214 trainable weights
Trainable params: 23,739,492
Workers | speed | total task time | allreduce time | tensor.numpy() time | apply_gradients |
---|---|---|---|---|---|
1 (local) | 168 images/s | 6.1s | - | - | 4.16s |
2 | 148 images/s | 13.76s | 10.36s | 5.04s | 1.35s |
4 | 228 images/s | 18s | 14.67s | 5.14s | 1.30s |
158 trainable weights
Trainable params: 2,386,084
Workers | speed | total task time | allreduce time | tensor.numpy() time | apply_gradients |
---|---|---|---|---|---|
1 (local) | 169 images/s | 6.06s | - | - | 5.59s |
2 | 246 images/s | 8.34s | 7.25026 | 5.79s | 0.6s |
4 | 401 images/s | 10.2029s | 8.9s | 5.78s | 0.71s |
34 trainable weights
Trainable params: 11,238,723
Workers | speed | total task time | allreduce time | tensor.numpy() time | apply_gradients |
---|---|---|---|---|---|
1 (local) | 109 images/s | 9.36s | - | - | 8.95s |
2 | 176 images/s | 11.65s | 1.47s | 9.36s | 0.42s |
4 | 328 images/s | 12.47s | 2.44s | 9.32s | 0.37s |
Summary for GPU:
- The speed-up ratio is better if the number of trainable weights is less.
- The speed-up ratio is better if the computation is more complex.
- It is weird that
apply_gradients
is so slow.