-
Notifications
You must be signed in to change notification settings - Fork 39
Enhanced Adaptive Average Pooling 2D Backward Kernel: Performance Improvements and Code Simplification #1658
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR refactors the Adaptive Average Pooling 2D backward kernel to improve performance, simplify code logic, and add a new optimized kernel for channels-last format.
- Removed the now redundant is_channels_last template parameter and its branches.
- Introduced a new kernel (AdaptiveAvgPool2dBwdSLMKernelFunctorChannelLast) that leverages shared memory and group-based processing for enhanced performance.
- Updated kernel launch configurations and added utility macros for standardized index calculations.
#define START_IND_INT(a, b, c) ((a * c) / b) | ||
#define END_IND_INT(a, b, c) (((a + 1) * c + b - 1) / b) | ||
|
||
#define XPU_MAX_THREADS 1024 // this is safe, in reality 256 is our limit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Consider clarifying the comment on XPU_MAX_THREADS to explain why 1024 is used despite the realistic limit being 256, to avoid future confusion for maintainers.
Copilot uses AI. Check for mistakes.
grad_input = at::empty_like(input_, smf); | ||
} | ||
template <typename index_t, typename scalar_t> | ||
struct AdaptiveAvgPool2dBwdSLMKernelFunctorChannelLast |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] It would be beneficial to add inline comments describing the strategy of shared memory caching and the layout calculation in this new channels-last kernel to help future readers understand the complex index and memory computations.
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR refactors the Adaptive Average Pooling 2D backward kernel to improve performance and simplify the code by removing redundant paths and introducing a new kernel optimized for channels-last memory format. Key changes include:
- Removal of the is_channels_last template parameter to streamline the kernel functors.
- Addition of a new channels-last kernel (AdaptiveAvgPool2dBwdSLMKernelFunctorChannelLast) that leverages shared memory caching.
- Dynamic kernel launch configuration adjustments that ensure shared memory limits are respected.
Comments suppressed due to low confidence (1)
src/ATen/native/xpu/sycl/AdaptiveAveragePooling2dKernels.cpp:440
- [nitpick] Consider adding an inline comment explaining the rationale behind dynamically reducing max_threads in the do-while loop to aid clarity and future maintenance.
do { ... max_threads adjustment ... } while (!done && max_threads);
Co-authored-by: Copilot <[email protected]>
|
Refactors and enhances the
adaptive_avg_pool2d_backward_kernel
implementation in thesrc/ATen/native/xpu/sycl/AdaptiveAveragePooling2dKernels.cpp
file. Key changes include removing redundant template parameters, adding a new kernel functor for channels-last memory format, and optimizing memory usage and thread configurations for better performance and maintainability.Refactoring and Simplification:
is_channels_last
template parameter from bothAdaptiveAvgPool2dBwdKernelFunctor
andAdaptiveAvgPool2dBwdSLMKernelFunctor
, simplifying their implementations. This eliminates conditional logic based on memory formatNew Kernel Functor:
AdaptiveAvgPool2dBwdSLMChannelsLastKernelFunctor
, specifically designed to handle the channels-last memory format. This functor precomputes indices and pooling factors for efficient gradient computation, leveraging shared memory for intermediate storage.Memory and Thread Optimization:
XPU_MAX_THREADS
,GROUP_STRIDE
) and optimized thread group configurations to improve performance and reduce the number of groups launched.General Improvements:
isizeH
,isizeW
,osizeH
,osizeW
) for better readability and maintainability.