The claim in this article and the notebook is quite misleading.
pyTorch CNN implementation already uses the cuDNN or a faster fbfft module. For smaller kernel sizes, the Winograd algorithm is even faster.
Your comparison is against convolution done in the position space, using a python for loop.