Slow Training Speed in Kaggle Notebook (GPU) #1933

riven314 · 2022-02-22T18:03:12Z

riven314
Feb 22, 2022

To demonstrate how to participate in a Kaggle competition with Flax/ JAX, I have made a Kaggle notebook where I apply transfer learning with a pre-trained ResNet:
https://www.kaggle.com/alexlwh/happywhale-flax-jax-resnet-baseline

Currently I am running the notebook in GPU because TPU is always short of supply. I found that the training speed in GPU is surprisingly speed (i.e. ~2 sec/ iteration). I have been debugging that for a while but didn't get any luck. Therefore, I would like to share my work here and see if any experienced developers could help identify the issues in my notebook.

There is a lack of Flax use cases in Kaggle community, so I believe, if the issue gets fixed, this notebook could serve as a great reference for Flax users who wants to take part in Kaggle. Any help would be appreciated!

Answered by riven314

Feb 24, 2022

Hi @marcvanzee
Thanks for your response! After some debugging, I have found the main bottleneck to be resizing on-the-fly during dataloading.
Replacing it by a pre-resized datasets significantly improved the training speed.
As the bottleneck is not related to Flax, I will close this discussion for now.

View full answer

marcvanzee · 2022-02-23T09:58:36Z

marcvanzee
Feb 23, 2022
Maintainer

Hi @riven314, great you are experimenting with this!

Some questions:

How fast does it run on TPU?
What do you mean by "TPU is always short of supply"? Where is this the case?
What did you try when you say "you have been debugging that for a while"?
Can you perhaps make it a bit easier for us to debug this? E.g.: put your code in a Colab, try to minimize the example so that it is still slow. Right now it doesn't seem to be possible to run your code (I don't know the Kaggle environment very well).

Thanks in advance!

0 replies

riven314 · 2022-02-24T11:08:52Z

riven314
Feb 24, 2022
Author

Hi @marcvanzee
Thanks for your response! After some debugging, I have found the main bottleneck to be resizing on-the-fly during dataloading.
Replacing it by a pre-resized datasets significantly improved the training speed.
As the bottleneck is not related to Flax, I will close this discussion for now.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow Training Speed in Kaggle Notebook (GPU) #1933

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Slow Training Speed in Kaggle Notebook (GPU) #1933

riven314 Feb 22, 2022

Replies: 2 comments

marcvanzee Feb 23, 2022 Maintainer

riven314 Feb 24, 2022 Author

riven314
Feb 22, 2022

marcvanzee
Feb 23, 2022
Maintainer

riven314
Feb 24, 2022
Author