cpu: aarch64: Support for Bf16 LUT Eltwise Operations#4827
cpu: aarch64: Support for Bf16 LUT Eltwise Operations#4827nikhil-arm wants to merge 7 commits intouxlfoundation:mainfrom
Conversation
Signed-off-by: Nikhil Gupta <nikhil.gupta2@arm.com>
Signed-off-by: Nikhil Gupta <nikhil.gupta2@arm.com>
Signed-off-by: Nikhil Gupta <nikhil.gupta2@arm.com>
|
Performance Improvements on 128 bit machine BenchDNN commands
|
1707aa2 to
2e02dd6
Compare
This reverts commit 67d089324e046072bc76b5244f1572e9ec1f0393.
Signed-off-by: Nikhil Gupta <nikhil.gupta2@arm.com>
Signed-off-by: Nikhil Gupta <nikhil.gupta2@arm.com>
jondea
left a comment
There was a problem hiding this comment.
Lovely speedups and such an elegant idea, thank you!
| VDISPATCH_ELTWISE(data_type == ::dnnl::impl::data_type::bf16, | ||
| VERBOSE_UNSUPPORTED_DT); | ||
|
|
||
| const auto *spec = get_bf16_fwd_lut_spec_(desc()->alg_kind); |
There was a problem hiding this comment.
Do we need all this spec machinery? Can we replace it all with
if (!swish){
alpha = 0;
}
beta = 0;
| const float alpha = spec->ignore_alpha_beta ? 0.f : desc()->alpha; | ||
| const float beta = spec->ignore_alpha_beta ? 0.f : desc()->beta; | ||
| bf16_lut_.resize(1u << 16); | ||
| for (uint32_t raw = 0; raw < (1u << 16); ++raw) { |
There was a problem hiding this comment.
raw is a bit vague, consider raw -> x_u16
| CPU_INSTANCE_X64(jit_uni_eltwise_int_fwd_t<avx512_core>) | ||
| CPU_INSTANCE_X64(jit_uni_eltwise_int_fwd_t<avx2>) | ||
| CPU_INSTANCE_X64(jit_uni_eltwise_int_fwd_t<sse41>) | ||
| CPU_INSTANCE_AARCH64(ref_eltwise_lut_fwd_t<bf16>) |
There was a problem hiding this comment.
I'm not sure I'd call it a ref, when we see ref in a verbose output, it usually implies it is unoptimized. I'd just call it eltwise_lut_fwd_t, I don't think we even need the template, it's not really doing anything. If we come to extend this to other data types, I think we can reconsider how.
| const auto *src = reinterpret_cast<const data_t *>(src_u8); | ||
| auto *dst = reinterpret_cast<data_t *>(dst_u8); | ||
|
|
||
| static_assert(sizeof(data_t) == sizeof(bfloat16_t), |
There was a problem hiding this comment.
I don't think this is needed, validation should be kept to the pd_t::init() if possible
|
|
||
| const auto offset_bytes = types::elements_to_bytes( | ||
| pd()->src_md()->data_type, data_d.offset0()); | ||
| src_u8 += offset_bytes; |
There was a problem hiding this comment.
I don't think we need to cast back and forward to u8. We can just use bf16 and do
src += data_d.offset0();
etc
| VDISPATCH_ELTWISE(everyone_is(data_type, src_md()->data_type, | ||
| dst_md()->data_type), | ||
| VERBOSE_UNSUPPORTED_DT); | ||
| VDISPATCH_ELTWISE(platform::has_data_type_support(data_type), |
There was a problem hiding this comment.
I don't think we need to enforce this. I think this check is for bfdot, but your code doesn't need to use any bf16 instructions. The calculation of the LUT should automatically fallback, and the LUT access doesn't use bf16.
| VDISPATCH_ELTWISE( | ||
| set_default_formats_common(), VERBOSE_UNSUPPORTED_TAG); | ||
| VDISPATCH_ELTWISE(src_d.is_dense(true), VERBOSE_NONTRIVIAL_STRIDE); | ||
| VDISPATCH_ELTWISE( |
| VDISPATCH_ELTWISE( | ||
| attr()->has_default_values(), VERBOSE_UNSUPPORTED_ATTR); | ||
|
|
||
| VDISPATCH_ELTWISE(data_type == ::dnnl::impl::data_type::bf16, |
| return status::success; | ||
| } | ||
|
|
||
| std::vector<bfloat16_t> bf16_lut_; |
There was a problem hiding this comment.
I think that the LUT should be stored in the primitive rather than primitive::pd_t, similar to how the JIT primitives do codegen.
Description
Please include a summary of the change. Please also include relevant motivation and context. See contribution guidelines for more details. If the change fixes an issue not documented in the project's Github issue tracker, please document all steps necessary to reproduce it.
Fixes # (github issue)
Checklist
General
make testandmake test_benchdnn_*) pass locally for each commit?Performance improvements
New features
Bug fixes
RFC PR