Skip to content

Enhanced Adaptive Average Pooling 2D Backward Kernel: Performance Improvements and Code Simplification #1658

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

chunhuanMeng
Copy link
Contributor

@chunhuanMeng chunhuanMeng commented May 14, 2025

Refactors and enhances the adaptive_avg_pool2d_backward_kernel implementation in the src/ATen/native/xpu/sycl/AdaptiveAveragePooling2dKernels.cpp file. Key changes include removing redundant template parameters, adding a new kernel functor for channels-last memory format, and optimizing memory usage and thread configurations for better performance and maintainability.

Refactoring and Simplification:

  • Removed the is_channels_last template parameter from both AdaptiveAvgPool2dBwdKernelFunctor and AdaptiveAvgPool2dBwdSLMKernelFunctor, simplifying their implementations. This eliminates conditional logic based on memory format

New Kernel Functor:

  • Introduced AdaptiveAvgPool2dBwdSLMChannelsLastKernelFunctor, specifically designed to handle the channels-last memory format. This functor precomputes indices and pooling factors for efficient gradient computation, leveraging shared memory for intermediate storage.

Memory and Thread Optimization:

  • Added constants (XPU_MAX_THREADS, GROUP_STRIDE) and optimized thread group configurations to improve performance and reduce the number of groups launched.
  • Updated shared memory usage calculations and introduced logic to dynamically adjust thread configurations if memory limits are exceeded.

General Improvements:

  • Replaced hardcoded dimensions with dynamically calculated values (isizeH, isizeW, osizeH, osizeW) for better readability and maintainability.
  • Removed unused or redundant code.

@chunhuanMeng chunhuanMeng changed the title Update AdaptiveAveragePooling2dKernels.cpp Enhanced Adaptive Average Pooling 2D Backward Kernel: Performance Improvements and Code Simplification May 14, 2025
@chunhuanMeng chunhuanMeng requested a review from Copilot May 14, 2025 02:29
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the Adaptive Average Pooling 2D backward kernel to improve performance, simplify code logic, and add a new optimized kernel for channels-last format.

  • Removed the now redundant is_channels_last template parameter and its branches.
  • Introduced a new kernel (AdaptiveAvgPool2dBwdSLMKernelFunctorChannelLast) that leverages shared memory and group-based processing for enhanced performance.
  • Updated kernel launch configurations and added utility macros for standardized index calculations.

#define START_IND_INT(a, b, c) ((a * c) / b)
#define END_IND_INT(a, b, c) (((a + 1) * c + b - 1) / b)

#define XPU_MAX_THREADS 1024 // this is safe, in reality 256 is our limit
Copy link
Preview

Copilot AI May 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Consider clarifying the comment on XPU_MAX_THREADS to explain why 1024 is used despite the realistic limit being 256, to avoid future confusion for maintainers.

Copilot uses AI. Check for mistakes.

grad_input = at::empty_like(input_, smf);
}
template <typename index_t, typename scalar_t>
struct AdaptiveAvgPool2dBwdSLMKernelFunctorChannelLast
Copy link
Preview

Copilot AI May 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] It would be beneficial to add inline comments describing the strategy of shared memory caching and the layout calculation in this new channels-last kernel to help future readers understand the complex index and memory computations.

Copilot uses AI. Check for mistakes.

@chunhuanMeng chunhuanMeng requested a review from Copilot May 14, 2025 06:35
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the Adaptive Average Pooling 2D backward kernel to improve performance and simplify the code by removing redundant paths and introducing a new kernel optimized for channels-last memory format. Key changes include:

  • Removal of the is_channels_last template parameter to streamline the kernel functors.
  • Addition of a new channels-last kernel (AdaptiveAvgPool2dBwdSLMKernelFunctorChannelLast) that leverages shared memory caching.
  • Dynamic kernel launch configuration adjustments that ensure shared memory limits are respected.
Comments suppressed due to low confidence (1)

src/ATen/native/xpu/sycl/AdaptiveAveragePooling2dKernels.cpp:440

  • [nitpick] Consider adding an inline comment explaining the rationale behind dynamically reducing max_threads in the do-while loop to aid clarity and future maintenance.
do { ... max_threads adjustment ... } while (!done && max_threads);

@chunhuanMeng
Copy link
Contributor Author

dtype op shape ChannelsLast output_size original optimized
torch.bfloat16 adaptive_avg_pool2d_backward (8, 512, 32, 32) TRUE (7, 7) 153.176 96.264
torch.float16 adaptive_avg_pool2d_backward (8, 512, 32, 32) TRUE (7, 7) 151.984 96.392
torch.float32 adaptive_avg_pool2d_backward (8, 512, 32, 32) TRUE (7, 7) 152.44 99.832
torch.bfloat16 adaptive_avg_pool2d_backward (8, 256, 56, 56) TRUE (14, 14) 211.68 161.728
torch.float16 adaptive_avg_pool2d_backward (8, 256, 56, 56) TRUE (14, 14) 210.32 160.368
torch.float32 adaptive_avg_pool2d_backward (8, 256, 56, 56) TRUE (14, 14) 210.312 151.248

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant