Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release fixes/fix nhwc support #3657

Open
wants to merge 11 commits into
base: release/rocm-rel-6.4
Choose a base branch
from

Conversation

BrianHarrisonAMD
Copy link
Contributor

Enable NHWC support for BN, and include performance fixes.

bghimireamd and others added 8 commits March 27, 2025 16:34
* enable nhwc in ocl

* disable nhwc for ocl kernel and only enable ocl kernel for nchw and variant = 2

* enable nhwc in bn infer, address review comments

* fix clang format

* fix hip tidy issue

* hip cleanups

* fix index for nchw and nhwc for MIOpenBatchNormFwdInferPerAct.cl

* disable ck bn solver

* bg/enable_nhwc_in_ocl : fix tensor initialization of driver. Make it similar to gtest

* 3 test cases for backward spatial single and multiple

* fix clang format

* fix type issue

* remove ncwh guard for fwd train multiple

* Fix tolerance issue for bwd FP32 test

* fix mix bwd var 2

* fixed clang format

* place guard around fp32

* clang format

* Enable fwd training variant 2 for NHWC

* Enable support for variant 2 bwd (FP32 works, mixed precision fails)

* Optimize memory accesses in fwd inference for NHWC

* Fix timers in batch norm

* Optimize variant 2 of bwd for NHWC

* the kernels now support xlocalsize > 1, usually it's 64;
* the "Final" kernels are launched with one workgroup in y dim;
* variant 2 is enabled by default for NHWC and float because other variants are not optimized yet.

* Optimize variant 2 of fwd for NHWC

Same strategy as in bwd

* Fix stash indices for bwd with fp16 and NHWC

* Fix fwd variant 2 with fp16

* Use variant 2 as default for fwd with NHWC

* Fix 3D case for NHWC

* Fix fwd logic

* Add restiction checking for bwd variant 2

* BnBwdTrainingSpatialSingle checks first if BnBwdTrainingSpatialMultiple is applicable (variant 2);
* Change stash indices for NHWC and mixed precision;
* Add restictions for c and h*w for stashing values correctly;

* remove unused macro

* clang format

* Refactor fwd restriction checking

Consistency with bwd and clearer logic

* fix clang format

* remove unused variable from fwd train ocl bn

* [gTest] Integrate Per Activation  (#3498)

* Added miopenBNPerActivation for bn_fwd, bn_infer, bn_fwd_train

* Update CPU Inference and Fwd Train

* Unified interface of batchNormSpatialHostBwdTrain, and batchNormPerActHostBwdTrain

* Update dx_out data type

---------

Co-authored-by: Bibek Ghimire <[email protected]>

* add comments

* address review comments partially

* Fix conditions for 2d reduction, gcn_reduce2 was never used enen when possible

* Unify stash index calculation

* Reuse stash indices

fwd needs 0,1, bwd
This is possible because the new Final kernels run with one
workgroup in y dim, so there are no load/store conflicts between workgroups.

* Use correct out tensor in the driver for bwd per act

* fix bn driver lamda init

* clang format

* add more coverage and few cleanups

* add bn edge test cases

* use double as accumulator and fix minor issues

* fix bn driver for actiation and cleanups

* fix driver's activation issue

---------

Co-authored-by: xinlipn <[email protected]>
Co-authored-by: Enrico Degregori <[email protected]>
Co-authored-by: Anton Gorenko <[email protected]>
Co-authored-by: BrianHarrisonAMD <[email protected]>
* undo code change and fix issue from cmake

* seperate large tensor test in batch norm to run serially
* undo code change and fix issue from cmake

* seperate large tensor test in batch norm to run serially

* follow up review comments
* Vectorization fwd variant 2 for NHWC and NCHW

* Vectorization bwd variant 2 for NHWC and NCHW

* Vectorization inference for NHWC and NCHW

* Fix warning

* Fwd and Bwd dispatch with vectorization (NHWC)

* fwd train variant 2: remove barrier(GLOBAL_MEM) (not needed)

* Remove some restrictions in variant 2 by using different stash methods

Currently all tests with NHWC use variant 2 (both fwd and bwd)

* Extract common for spatial multiple

* Bug fix: vectorization not supported for inference activation

* Add vectorization to inference per activation

* clang format

* Fix stashing for mixed NCHW when H*W is not even

The address must be aligned by 32 bit for f32 mean/variance.

* Use smaller workgroups when H*W is small

* Fix clang format

---------

Co-authored-by: Enrico Degregori <[email protected]>
* remove variant calculation in batch norm's network config

* add more infor for network config
* fix bn tuning issue with CK batchnorm solvers

* work around for CK's 1184

* make batch norm's driver default iteration same as conv driver

* comment CK BN for now
* Add tuning for batch norm OpenCL kernels:

 - during tuning all applicable variants are run and the fastest is picked by the tunign system
 - spatial single and multiple are merged in a single solver

* General improvements from review

* Update submodule fin to point to develop branch

* clang tidy clean ups

* clang tidy stuff

* fix fin interface test

---------

Co-authored-by: Enrico Degregori <[email protected]>
* fix bn driver signed init

* fix comment
@BrianHarrisonAMD BrianHarrisonAMD requested a review from a team as a code owner April 1, 2025 13:37
Copy link
Contributor

@amd-jnovotny amd-jnovotny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changelog OK.

Copy link
Collaborator

@BradPepersAMD BradPepersAMD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good to go!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants