Group Norm Backward Optimization with vectorization and parallel reduction#1652
Group Norm Backward Optimization with vectorization and parallel reduction#1652
Conversation
|
Please show performance impact |
There was a problem hiding this comment.
Pull Request Overview
This PR adds a vectorized functor version for the Group Norm Backward kernel to improve performance on systems supporting vectorized operations. Key changes include:
- Addition of ComputeInternalGradientsVectorizedFunctor with vectorized reduction logic.
- Conditional kernel launch based on vectorization capability.
- Updated work-group size computation to accommodate the vectorized implementation.
| sum1_vec[v] = static_cast<T_ACC>(vec_dY_[iv] * vec_X_[iv]); | ||
| sum2_vec[v] = static_cast<T_ACC>(vec_dY_[iv]); |
There was a problem hiding this comment.
It appears that inside the inner loop the value of sum1_vec[v] is overwritten in each iteration rather than accumulated. Consider using '+=' to aggregate results across iterations if that was the intended behavior.
| sum1_vec[v] = static_cast<T_ACC>(vec_dY_[iv] * vec_X_[iv]); | |
| sum2_vec[v] = static_cast<T_ACC>(vec_dY_[iv]); | |
| sum1_vec[v] += static_cast<T_ACC>(vec_dY_[iv] * vec_X_[iv]); | |
| sum2_vec[v] += static_cast<T_ACC>(vec_dY_[iv]); |
| sum1_vec[v] = static_cast<T_ACC>(vec_dY_[iv] * vec_X_[iv]); | ||
| sum2_vec[v] = static_cast<T_ACC>(vec_dY_[iv]); |
There was a problem hiding this comment.
Similar to the sum1_vec update, sum2_vec[v] is overwritten on each iteration of the inner loop instead of accumulating the results. If accumulation is intended, replace '=' with '+='.
| sum1_vec[v] = static_cast<T_ACC>(vec_dY_[iv] * vec_X_[iv]); | |
| sum2_vec[v] = static_cast<T_ACC>(vec_dY_[iv]); | |
| sum1_vec[v] += static_cast<T_ACC>(vec_dY_[iv] * vec_X_[iv]); | |
| sum2_vec[v] += static_cast<T_ACC>(vec_dY_[iv]); |
|
Pls. update the PR description to elaborate on why the changes can improve the performance and the detailed performance data |
EikanWang
left a comment
There was a problem hiding this comment.
Informative PR description and comments are required.
EikanWang
left a comment
There was a problem hiding this comment.
In general, the optimization looks good to me. However, pls. address two common issues.
- Pls. avoid using non-common abbreviations
- Update the PR description by elaborating on the detailed optimization ideas and detailed performance improvements
| using vec_t = memory::aligned_vector<T, VEC_SIZE>; | ||
| using vec_td = memory::aligned_vector<T_ACC, VEC_SIZE>; | ||
|
|
||
| [[intel::reqd_sub_group_size(SIMD)]] void operator()( |
There was a problem hiding this comment.
@xytintel , @fengyuan14 , @gujinghui , could you help check the behavior of [[intel::reqd_sub_group_size(SIMD)]] on the latest XE?
| using T_ACC = acc_type_device<T, kXPU>; | ||
| using vec_t = memory::aligned_vector<T, VEC_SIZE>; | ||
| using vec_td = memory::aligned_vector<T_ACC, VEC_SIZE>; |
There was a problem hiding this comment.
What's the rule to use UPPER and lower to define the namespace using
| using T_ACC = acc_type_device<T, kXPU>; | ||
| using vec_t = memory::aligned_vector<T, VEC_SIZE>; | ||
| using vec_td = memory::aligned_vector<T_ACC, VEC_SIZE>; |
There was a problem hiding this comment.
What are the meanings of _t and _td accordingly?
There was a problem hiding this comment.
Use acc_vec_t instead to align with the overall code.
Vec_t and acc_vec_t represent vectors created with the corresponding datatype.
| sycl::nd_item<1> item) const { | ||
| vec_td sum1_vec = {}; | ||
| vec_td sum2_vec = {}; | ||
| auto g_start = item.get_group(0) * VEC_SIZE; |
There was a problem hiding this comment.
What's the meaning of g_? group or global?
There was a problem hiding this comment.
It means group, use group_start instead.
|
|
||
| #pragma unroll | ||
| for (int v = 0; v < VEC_SIZE; ++v) { | ||
| const int64_t nc = g_start + v; |
There was a problem hiding this comment.
v is a variable, why is nc a constant variable?
There was a problem hiding this comment.
What's the abbreviation of nc?
There was a problem hiding this comment.
nc is not an abbreviation, it means n*c in NCHW, and cuda also uses this variable name in the context.
Although v is a variable, it remains unchanged in a single loop, so nc is constant.
There was a problem hiding this comment.
Then why nc is defined within the loop?!
There was a problem hiding this comment.
In terms of nc, pls. add comments.
All the requested changes have been updated.
|
@xytintel , I requested changes for this PR. May I know why you landed it directly? Meanwhile, my comments are not addressed fully. |
Any data to support the conclusion - "which increases the bandwidth of data reading and thus improves performance."? Show me the data? |
|
ditto - |

Uh oh!
There was an error while loading. Please reload this page.