Group Norm Backward Optimization with vectorization and parallel reduction by yucai-intel · Pull Request #1652 · intel/torch-xpu-ops

yucai-intel · 2025-05-11T14:15:00Z

Add vectorization implementations of group norm backward kernels, which increases the bandwidth of data reading and thus improves performance.
Optimize GroupReduceSum function with parallel reduction, which improves computational efficiency.

toyxu · 2025-05-12T02:01:16Z

Please show performance impact

…into yucai/gn_bw

Copilot

Pull Request Overview

This PR adds a vectorized functor version for the Group Norm Backward kernel to improve performance on systems supporting vectorized operations. Key changes include:

Addition of ComputeInternalGradientsVectorizedFunctor with vectorized reduction logic.
Conditional kernel launch based on vectorization capability.
Updated work-group size computation to accommodate the vectorized implementation.

Copilot · 2025-05-13T15:11:02Z

+          sum1_vec[v] = static_cast<T_ACC>(vec_dY_[iv] * vec_X_[iv]);
+          sum2_vec[v] = static_cast<T_ACC>(vec_dY_[iv]);


It appears that inside the inner loop the value of sum1_vec[v] is overwritten in each iteration rather than accumulated. Consider using '+=' to aggregate results across iterations if that was the intended behavior.

Suggested change

sum1_vec[v] = static_cast<T_ACC>(vec_dY_[iv] * vec_X_[iv]);

sum2_vec[v] = static_cast<T_ACC>(vec_dY_[iv]);

sum1_vec[v] += static_cast<T_ACC>(vec_dY_[iv] * vec_X_[iv]);

sum2_vec[v] += static_cast<T_ACC>(vec_dY_[iv]);

Copilot · 2025-05-13T15:11:03Z

+          sum1_vec[v] = static_cast<T_ACC>(vec_dY_[iv] * vec_X_[iv]);
+          sum2_vec[v] = static_cast<T_ACC>(vec_dY_[iv]);


Similar to the sum1_vec update, sum2_vec[v] is overwritten on each iteration of the inner loop instead of accumulating the results. If accumulation is intended, replace '=' with '+='.

Suggested change

sum1_vec[v] = static_cast<T_ACC>(vec_dY_[iv] * vec_X_[iv]);

sum2_vec[v] = static_cast<T_ACC>(vec_dY_[iv]);

sum1_vec[v] += static_cast<T_ACC>(vec_dY_[iv] * vec_X_[iv]);

sum2_vec[v] += static_cast<T_ACC>(vec_dY_[iv]);

yucai-intel · 2025-05-15T06:16:58Z

The performance is improved by 10%-40% under different shape settings.

EikanWang · 2025-05-21T23:01:24Z

Pls. update the PR description to elaborate on why the changes can improve the performance and the detailed performance data

EikanWang

Informative PR description and comments are required.

EikanWang

In general, the optimization looks good to me. However, pls. address two common issues.

Pls. avoid using non-common abbreviations
Update the PR description by elaborating on the detailed optimization ideas and detailed performance improvements

EikanWang · 2025-05-27T06:49:45Z

+  using vec_t = memory::aligned_vector<T, VEC_SIZE>;
+  using vec_td = memory::aligned_vector<T_ACC, VEC_SIZE>;
+
+  [[intel::reqd_sub_group_size(SIMD)]] void operator()(


@xytintel , @fengyuan14 , @gujinghui , could you help check the behavior of [[intel::reqd_sub_group_size(SIMD)]] on the latest XE?

EikanWang · 2025-05-27T06:51:19Z

+  using T_ACC = acc_type_device<T, kXPU>;
+  using vec_t = memory::aligned_vector<T, VEC_SIZE>;
+  using vec_td = memory::aligned_vector<T_ACC, VEC_SIZE>;


What's the rule to use UPPER and lower to define the namespace using

EikanWang · 2025-05-27T06:52:32Z

+  using T_ACC = acc_type_device<T, kXPU>;
+  using vec_t = memory::aligned_vector<T, VEC_SIZE>;
+  using vec_td = memory::aligned_vector<T_ACC, VEC_SIZE>;


What are the meanings of _t and _td accordingly?

Use acc_vec_t instead to align with the overall code.
Vec_t and acc_vec_t represent vectors created with the corresponding datatype.

EikanWang · 2025-05-27T06:56:02Z

+      sycl::nd_item<1> item) const {
+    vec_td sum1_vec = {};
+    vec_td sum2_vec = {};
+    auto g_start = item.get_group(0) * VEC_SIZE;


What's the meaning of g_? group or global?

It means group, use group_start instead.

EikanWang · 2025-05-27T06:59:51Z

+
+#pragma unroll
+    for (int v = 0; v < VEC_SIZE; ++v) {
+      const int64_t nc = g_start + v;


v is a variable, why is nc a constant variable?

What's the abbreviation of nc?

nc is not an abbreviation, it means n*c in NCHW, and cuda also uses this variable name in the context.
Although v is a variable, it remains unchanged in a single loop, so nc is constant.

Then why nc is defined within the loop?!

In terms of nc, pls. add comments.

All the requested changes have been updated.

EikanWang · 2025-05-30T06:32:17Z

@xytintel , I requested changes for this PR. May I know why you landed it directly? Meanwhile, my comments are not addressed fully.

EikanWang · 2025-05-30T06:33:57Z

Add vectorization implementations of group norm backward kernels, which increases the bandwidth of data reading and thus improves performance.

Any data to support the conclusion - "which increases the bandwidth of data reading and thus improves performance."? Show me the data?

EikanWang · 2025-05-30T06:34:35Z

ditto - Optimize GroupReduceSum function with parallel reduction, which improves computational efficiency.

add vectorized version for gn backward

1e0f35c

yucai-intel and others added 2 commits May 11, 2025 22:22

format

64467b9

Merge branch 'main' into yucai/gn_bw

f502f57

toyxu added the kernel_optimization label May 12, 2025

toyxu and others added 4 commits May 12, 2025 17:38

Merge branch 'main' into yucai/gn_bw

5db3fcc

revise

fbb215f

Merge branch 'yucai/gn_bw' of https://github.com/intel/torch-xpu-ops …

1baff5d

…into yucai/gn_bw

Merge branch 'main' into yucai/gn_bw

c5b8c3a

EikanWang requested a review from Copilot May 13, 2025 15:09

Copilot AI reviewed May 13, 2025

View reviewed changes

fix err

1f95f15

yucai-intel and others added 2 commits May 15, 2025 01:33

update

fd0be99

Merge branch 'main' into yucai/gn_bw

e178bdb

EikanWang requested changes May 21, 2025

View reviewed changes

toyxu requested a review from EikanWang May 27, 2025 06:44

EikanWang previously requested changes May 27, 2025

View reviewed changes

yucai-intel and others added 2 commits May 27, 2025 01:42

format

a3d0a75

Update GroupReduceUtils.h

b990dbe

yucai-intel changed the title ~~Add vectorized functor version for Group Norm Backward~~ Group Norm Backward Optimization with vectorization and parallel reduction May 27, 2025

Merge branch 'main' into yucai/gn_bw

ca1162b

toyxu requested a review from EikanWang May 28, 2025 01:29

Update GroupReduceUtils.h

9eea326

toyxu reviewed May 29, 2025

View reviewed changes

Comment thread src/ATen/native/xpu/sycl/GroupNormKernels.cpp Outdated

toyxu added 2 commits May 29, 2025 13:44

Update GroupNormKernels.cpp

e8bea4e

Merge branch 'main' into yucai/gn_bw

32ba2fc

toyxu approved these changes May 29, 2025

View reviewed changes

Merge branch 'main' into yucai/gn_bw

cacb188

toyxu enabled auto-merge May 30, 2025 01:19

toyxu added this pull request to the merge queue May 30, 2025

Merged via the queue into main with commit 5907931 May 30, 2025
7 checks passed

toyxu deleted the yucai/gn_bw branch May 30, 2025 01:29

		sum1_vec[v] = static_cast<T_ACC>(vec_dY_[iv] * vec_X_[iv]);
		sum2_vec[v] = static_cast<T_ACC>(vec_dY_[iv]);

Conversation

yucai-intel commented May 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

toyxu commented May 12, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI May 13, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI May 13, 2025

Choose a reason for hiding this comment

Uh oh!

yucai-intel commented May 15, 2025

Uh oh!

EikanWang commented May 21, 2025

Uh oh!

EikanWang left a comment

Choose a reason for hiding this comment

Uh oh!

EikanWang left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

EikanWang commented May 30, 2025

Uh oh!

EikanWang commented May 30, 2025

Uh oh!

EikanWang commented May 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

yucai-intel commented May 11, 2025 •

edited

Loading