Skip to content

Group Norm Backward Optimization with vectorization and parallel reduction#1652

Merged
toyxu merged 17 commits intomainfrom
yucai/gn_bw
May 30, 2025
Merged

Group Norm Backward Optimization with vectorization and parallel reduction#1652
toyxu merged 17 commits intomainfrom
yucai/gn_bw

Conversation

@yucai-intel
Copy link
Copy Markdown
Contributor

@yucai-intel yucai-intel commented May 11, 2025

  • Add vectorization implementations of group norm backward kernels, which increases the bandwidth of data reading and thus improves performance.
  • Optimize GroupReduceSum function with parallel reduction, which improves computational efficiency.

@toyxu
Copy link
Copy Markdown
Contributor

toyxu commented May 12, 2025

Please show performance impact

@EikanWang EikanWang requested a review from Copilot May 13, 2025 15:09
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds a vectorized functor version for the Group Norm Backward kernel to improve performance on systems supporting vectorized operations. Key changes include:

  • Addition of ComputeInternalGradientsVectorizedFunctor with vectorized reduction logic.
  • Conditional kernel launch based on vectorization capability.
  • Updated work-group size computation to accommodate the vectorized implementation.

Comment on lines +961 to +962
sum1_vec[v] = static_cast<T_ACC>(vec_dY_[iv] * vec_X_[iv]);
sum2_vec[v] = static_cast<T_ACC>(vec_dY_[iv]);
Copy link

Copilot AI May 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears that inside the inner loop the value of sum1_vec[v] is overwritten in each iteration rather than accumulated. Consider using '+=' to aggregate results across iterations if that was the intended behavior.

Suggested change
sum1_vec[v] = static_cast<T_ACC>(vec_dY_[iv] * vec_X_[iv]);
sum2_vec[v] = static_cast<T_ACC>(vec_dY_[iv]);
sum1_vec[v] += static_cast<T_ACC>(vec_dY_[iv] * vec_X_[iv]);
sum2_vec[v] += static_cast<T_ACC>(vec_dY_[iv]);

Copilot uses AI. Check for mistakes.
Comment on lines +961 to +962
sum1_vec[v] = static_cast<T_ACC>(vec_dY_[iv] * vec_X_[iv]);
sum2_vec[v] = static_cast<T_ACC>(vec_dY_[iv]);
Copy link

Copilot AI May 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the sum1_vec update, sum2_vec[v] is overwritten on each iteration of the inner loop instead of accumulating the results. If accumulation is intended, replace '=' with '+='.

Suggested change
sum1_vec[v] = static_cast<T_ACC>(vec_dY_[iv] * vec_X_[iv]);
sum2_vec[v] = static_cast<T_ACC>(vec_dY_[iv]);
sum1_vec[v] += static_cast<T_ACC>(vec_dY_[iv] * vec_X_[iv]);
sum2_vec[v] += static_cast<T_ACC>(vec_dY_[iv]);

Copilot uses AI. Check for mistakes.
@yucai-intel
Copy link
Copy Markdown
Contributor Author

The performance is improved by 10%-40% under different shape settings.
image

@EikanWang
Copy link
Copy Markdown
Contributor

Pls. update the PR description to elaborate on why the changes can improve the performance and the detailed performance data

Copy link
Copy Markdown
Contributor

@EikanWang EikanWang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Informative PR description and comments are required.

@toyxu toyxu requested a review from EikanWang May 27, 2025 06:44
Copy link
Copy Markdown
Contributor

@EikanWang EikanWang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, the optimization looks good to me. However, pls. address two common issues.

  • Pls. avoid using non-common abbreviations
  • Update the PR description by elaborating on the detailed optimization ideas and detailed performance improvements

using vec_t = memory::aligned_vector<T, VEC_SIZE>;
using vec_td = memory::aligned_vector<T_ACC, VEC_SIZE>;

[[intel::reqd_sub_group_size(SIMD)]] void operator()(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xytintel , @fengyuan14 , @gujinghui , could you help check the behavior of [[intel::reqd_sub_group_size(SIMD)]] on the latest XE?

Comment on lines +940 to +942
using T_ACC = acc_type_device<T, kXPU>;
using vec_t = memory::aligned_vector<T, VEC_SIZE>;
using vec_td = memory::aligned_vector<T_ACC, VEC_SIZE>;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the rule to use UPPER and lower to define the namespace using

Comment on lines +940 to +942
using T_ACC = acc_type_device<T, kXPU>;
using vec_t = memory::aligned_vector<T, VEC_SIZE>;
using vec_td = memory::aligned_vector<T_ACC, VEC_SIZE>;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the meanings of _t and _td accordingly?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use acc_vec_t instead to align with the overall code.
Vec_t and acc_vec_t represent vectors created with the corresponding datatype.

sycl::nd_item<1> item) const {
vec_td sum1_vec = {};
vec_td sum2_vec = {};
auto g_start = item.get_group(0) * VEC_SIZE;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the meaning of g_? group or global?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It means group, use group_start instead.


#pragma unroll
for (int v = 0; v < VEC_SIZE; ++v) {
const int64_t nc = g_start + v;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

v is a variable, why is nc a constant variable?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the abbreviation of nc?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nc is not an abbreviation, it means n*c in NCHW, and cuda also uses this variable name in the context.
Although v is a variable, it remains unchanged in a single loop, so nc is constant.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then why nc is defined within the loop?!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms of nc, pls. add comments.

Comment thread src/ATen/native/xpu/sycl/GroupNormKernels.cpp
@yucai-intel yucai-intel changed the title Add vectorized functor version for Group Norm Backward Group Norm Backward Optimization with vectorization and parallel reduction May 27, 2025
@toyxu toyxu requested a review from EikanWang May 28, 2025 01:29
Comment thread src/ATen/native/xpu/sycl/GroupNormKernels.cpp Outdated
@toyxu toyxu enabled auto-merge May 30, 2025 01:19
@toyxu toyxu dismissed EikanWang’s stale review May 30, 2025 01:29

All the requested changes have been updated.

@toyxu toyxu added this pull request to the merge queue May 30, 2025
Merged via the queue into main with commit 5907931 May 30, 2025
7 checks passed
@toyxu toyxu deleted the yucai/gn_bw branch May 30, 2025 01:29
@EikanWang
Copy link
Copy Markdown
Contributor

@xytintel , I requested changes for this PR. May I know why you landed it directly? Meanwhile, my comments are not addressed fully.

@EikanWang
Copy link
Copy Markdown
Contributor

Add vectorization implementations of group norm backward kernels, which increases the bandwidth of data reading and thus improves performance.

Any data to support the conclusion - "which increases the bandwidth of data reading and thus improves performance."? Show me the data?

@EikanWang
Copy link
Copy Markdown
Contributor

ditto - Optimize GroupReduceSum function with parallel reduction, which improves computational efficiency.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants