Understanding TPU efficiency in the examples #1408

greeneggsandyaml · 2021-07-05T00:45:08Z

greeneggsandyaml
Jul 5, 2021

Hello,

This is more a question about TPUs than about Flax, but this seemed like place to ask it. I'm trying to understand why TPUs seem to be extremely fast in some cases and not in others. In particular, I'm looking at the imagenet and pixelcnn examples here in the flax repo.

For ImageNet, it looks like 8x TPU v3 are much faster than 8x V100 (and about the same compared to the GPU running with mixed precision). For the PixelCNN, it looks like 8x TPU v3 are 3x slower than 8x V100! Is this correct, and if so, why is there such a large difference?

Thanks,

Answered by marcvanzee

Jul 5, 2021

The runtime comparisons for ImageNet is what we usually see when comparing GPUs and TPUS. However, since TPU uses bfloat16 for matrix multiplications, this can in some extreme cases affect training stability, which is what probably happens in PixelCNN++. In that example we expect the test loss to be below 2.92, which requires a very precise setup, and using TPU here actually slows down training.

Please note the slowness in PixelCNN++ is a known issue (#458). Copying @j-towns's latest response on that issue:

"Based on some other generative modelling work which I've been doing on TPU lately, it seems the precision parameter to layers like Conv makes a small but noticable difference to train…

View full answer

marcvanzee · 2021-07-05T04:59:44Z

marcvanzee
Jul 5, 2021
Maintainer

The runtime comparisons for ImageNet is what we usually see when comparing GPUs and TPUS. However, since TPU uses bfloat16 for matrix multiplications, this can in some extreme cases affect training stability, which is what probably happens in PixelCNN++. In that example we expect the test loss to be below 2.92, which requires a very precise setup, and using TPU here actually slows down training.

Please note the slowness in PixelCNN++ is a known issue (#458). Copying @j-towns's latest response on that issue:

"Based on some other generative modelling work which I've been doing on TPU lately, it seems the precision parameter to layers like Conv makes a small but noticable difference to training stability and to test performance. It might be worth finding out whether this affects PixelCNN++ and perhaps adding a command line argument to enable higher precisions."

We should probably add a link to the issue in the PixelCNN++ readme!

0 replies

greeneggsandyaml · 2021-07-05T20:05:08Z

greeneggsandyaml
Jul 5, 2021
Author

Great, thank you for the very comprehensive response!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Understanding TPU efficiency in the examples #1408

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Understanding TPU efficiency in the examples #1408

greeneggsandyaml Jul 5, 2021

Replies: 2 comments

marcvanzee Jul 5, 2021 Maintainer

greeneggsandyaml Jul 5, 2021 Author

greeneggsandyaml
Jul 5, 2021

marcvanzee
Jul 5, 2021
Maintainer

greeneggsandyaml
Jul 5, 2021
Author