Release fixes/fix nhwc support #3657

BrianHarrisonAMD · 2025-03-27T16:57:21Z

Enable NHWC support for BN, and include performance fixes.

* enable nhwc in ocl * disable nhwc for ocl kernel and only enable ocl kernel for nchw and variant = 2 * enable nhwc in bn infer, address review comments * fix clang format * fix hip tidy issue * hip cleanups * fix index for nchw and nhwc for MIOpenBatchNormFwdInferPerAct.cl * disable ck bn solver * bg/enable_nhwc_in_ocl : fix tensor initialization of driver. Make it similar to gtest * 3 test cases for backward spatial single and multiple * fix clang format * fix type issue * remove ncwh guard for fwd train multiple * Fix tolerance issue for bwd FP32 test * fix mix bwd var 2 * fixed clang format * place guard around fp32 * clang format * Enable fwd training variant 2 for NHWC * Enable support for variant 2 bwd (FP32 works, mixed precision fails) * Optimize memory accesses in fwd inference for NHWC * Fix timers in batch norm * Optimize variant 2 of bwd for NHWC * the kernels now support xlocalsize > 1, usually it's 64; * the "Final" kernels are launched with one workgroup in y dim; * variant 2 is enabled by default for NHWC and float because other variants are not optimized yet. * Optimize variant 2 of fwd for NHWC Same strategy as in bwd * Fix stash indices for bwd with fp16 and NHWC * Fix fwd variant 2 with fp16 * Use variant 2 as default for fwd with NHWC * Fix 3D case for NHWC * Fix fwd logic * Add restiction checking for bwd variant 2 * BnBwdTrainingSpatialSingle checks first if BnBwdTrainingSpatialMultiple is applicable (variant 2); * Change stash indices for NHWC and mixed precision; * Add restictions for c and h*w for stashing values correctly; * remove unused macro * clang format * Refactor fwd restriction checking Consistency with bwd and clearer logic * fix clang format * remove unused variable from fwd train ocl bn * [gTest] Integrate Per Activation (#3498) * Added miopenBNPerActivation for bn_fwd, bn_infer, bn_fwd_train * Update CPU Inference and Fwd Train * Unified interface of batchNormSpatialHostBwdTrain, and batchNormPerActHostBwdTrain * Update dx_out data type --------- Co-authored-by: Bibek Ghimire <[email protected]> * add comments * address review comments partially * Fix conditions for 2d reduction, gcn_reduce2 was never used enen when possible * Unify stash index calculation * Reuse stash indices fwd needs 0,1, bwd This is possible because the new Final kernels run with one workgroup in y dim, so there are no load/store conflicts between workgroups. * Use correct out tensor in the driver for bwd per act * fix bn driver lamda init * clang format * add more coverage and few cleanups * add bn edge test cases * use double as accumulator and fix minor issues * fix bn driver for actiation and cleanups * fix driver's activation issue --------- Co-authored-by: xinlipn <[email protected]> Co-authored-by: Enrico Degregori <[email protected]> Co-authored-by: Anton Gorenko <[email protected]> Co-authored-by: BrianHarrisonAMD <[email protected]>

* undo code change and fix issue from cmake * seperate large tensor test in batch norm to run serially

* undo code change and fix issue from cmake * seperate large tensor test in batch norm to run serially * follow up review comments

* Vectorization fwd variant 2 for NHWC and NCHW * Vectorization bwd variant 2 for NHWC and NCHW * Vectorization inference for NHWC and NCHW * Fix warning * Fwd and Bwd dispatch with vectorization (NHWC) * fwd train variant 2: remove barrier(GLOBAL_MEM) (not needed) * Remove some restrictions in variant 2 by using different stash methods Currently all tests with NHWC use variant 2 (both fwd and bwd) * Extract common for spatial multiple * Bug fix: vectorization not supported for inference activation * Add vectorization to inference per activation * clang format * Fix stashing for mixed NCHW when H*W is not even The address must be aligned by 32 bit for f32 mean/variance. * Use smaller workgroups when H*W is small * Fix clang format --------- Co-authored-by: Enrico Degregori <[email protected]>

* remove variant calculation in batch norm's network config * add more infor for network config

* fix bn tuning issue with CK batchnorm solvers * work around for CK's 1184 * make batch norm's driver default iteration same as conv driver * comment CK BN for now

* Add tuning for batch norm OpenCL kernels: - during tuning all applicable variants are run and the fastest is picked by the tunign system - spatial single and multiple are merged in a single solver * General improvements from review * Update submodule fin to point to develop branch * clang tidy clean ups * clang tidy stuff * fix fin interface test --------- Co-authored-by: Enrico Degregori <[email protected]>

* fix bn driver signed init * fix comment

amd-jnovotny

Changelog OK.

BradPepersAMD

good to go!

bghimireamd and others added 8 commits March 27, 2025 16:34

undo code change and fix issue from cmake (#3545)

69fa859

[BN] Serial run large tensors test cases (#3553)

2414a92

* undo code change and fix issue from cmake * seperate large tensor test in batch norm to run serially

[BN] run large size tensor test cases serially (#3566)

86a7994

* undo code change and fix issue from cmake * seperate large tensor test in batch norm to run serially * follow up review comments

remove variant calculation in batch norm's network config (#3636)

4b75c49

* remove variant calculation in batch norm's network config * add more infor for network config

fix bn tuning issue with CK batchnorm solvers (#3583)

1ca0a2a

* fix bn tuning issue with CK batchnorm solvers * work around for CK's 1184 * make batch norm's driver default iteration same as conv driver * comment CK BN for now

BrianHarrisonAMD requested a review from bghimireamd March 27, 2025 16:57

BrianHarrisonAMD assigned bghimireamd Mar 27, 2025

BrianHarrisonAMD requested a review from BradPepersAMD as a code owner March 27, 2025 16:57

[BN] BN driver signed input initialization (#3660)

122a728

* fix bn driver signed init * fix comment

bghimireamd approved these changes Mar 28, 2025

View reviewed changes

BrianHarrisonAMD and others added 2 commits March 31, 2025 14:15

Update jenkins build to resolve CI issue

1148527

Update 6.4 change log to include batchnorm changes (#3663)

716e7c3

BrianHarrisonAMD requested a review from a team as a code owner April 1, 2025 13:37

amd-jnovotny approved these changes Apr 1, 2025

View reviewed changes

BradPepersAMD approved these changes Apr 2, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release fixes/fix nhwc support #3657

Release fixes/fix nhwc support #3657

BrianHarrisonAMD commented Mar 27, 2025

amd-jnovotny left a comment

BradPepersAMD left a comment

Release fixes/fix nhwc support #3657

Are you sure you want to change the base?

Release fixes/fix nhwc support #3657

Conversation

BrianHarrisonAMD commented Mar 27, 2025

amd-jnovotny left a comment

Choose a reason for hiding this comment

BradPepersAMD left a comment

Choose a reason for hiding this comment