-
Notifications
You must be signed in to change notification settings - Fork 67
Single pass downsampling #954
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: mortons
Are you sure you want to change the base?
Conversation
| #define NBL_CONCEPT_PARAM_1 (val, V) | ||
| #define NBL_CONCEPT_PARAM_2 (index, I) | ||
| NBL_CONCEPT_BEGIN(3) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what did you do, this is needed!
| LIST_BUILTIN_RESOURCE(NBL_RESOURCES_TO_EMBED "hlsl/spirv_intrinsics/subgroup_arithmetic.hlsl") | ||
| LIST_BUILTIN_RESOURCE(NBL_RESOURCES_TO_EMBED "hlsl/spirv_intrinsics/subgroup_ballot.hlsl") | ||
| LIST_BUILTIN_RESOURCE(NBL_RESOURCES_TO_EMBED "hlsl/spirv_intrinsics/subgroup_basic.hlsl") | ||
| LIST_BUILTIN_RESOURCE(NBL_RESOURCES_TO_EMBED "hlsl/spirv_intrinsics/subgroup_quad.hlsl") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
forgot to add files?
| LIST_BUILTIN_RESOURCE(NBL_RESOURCES_TO_EMBED "hlsl/glsl_compat/subgroup_arithmetic.hlsl") | ||
| LIST_BUILTIN_RESOURCE(NBL_RESOURCES_TO_EMBED "hlsl/glsl_compat/subgroup_ballot.hlsl") | ||
| LIST_BUILTIN_RESOURCE(NBL_RESOURCES_TO_EMBED "hlsl/glsl_compat/subgroup_basic.hlsl") | ||
| LIST_BUILTIN_RESOURCE(NBL_RESOURCES_TO_EMBED "hlsl/glsl_compat/subgroup_quad.hlsl") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
forgot to add files?
|
|
||
| struct SPD | ||
| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to template your struct on a Config struct like workgroup scans or BxDFs have but you'll have:
- arithmetic texel type (what you vall your binop with)
- binop type, e.g.
nbl::hlsl::plus<arithmetic_texel_t> - storage texel type (what you pump into output image and scratch)
- conversion method between arithmetic and storage texel type
- input "tile" size (how many mip levels you can reduce with a single workgroup)
- output mipmap count (absolute max is 15, because thats a Max HW texture size)
- how many rounds of workgroup reduction are needed to downsample the whole image (e.g. if workgroup can do 6 or 7 at once, you simply divide the output mipmap count by this number and round up)
- how many workgroups output to a single input in the final round
- number dwords (uints) reserved for the scheduler (to do "last one out closes the door" single pass downsampling)
- subgroup size
- workgroup size
The last two you may store indirectly because a Workgroup2 Reduction Config will be needed, so subgroup size and workgroup size come into play there. Although it might end up being that each round needs its own workgroup2 reuction config.
Then your __call needs to template on and have arguments:
- Input/Output Accessor (Loadable and Storable Mip Mapped Image, but also a Global/Device Scope memory barrier method)
- Global Scratch Accessor (has to have atomicAdd supporting Acquire/Release semantic and scope flags - can be template args instead of regular args, and a
set<type_of_your_texel>method) - Workgroup Scratch for the
workgroup2::reduce
In the actual usage of the algo you can assume that user will do first mip level reduction by themselves because of cheap tricks like textureGather and applying the binary operation themselves, or tapping inbetween 2x2 pixels and using a bilinear or Min/Max sampler.
This first user-space mipmapping step is not taken into account by the SPD algorithm, so if you do 2048 input, then you only launch SPD with a tile input of 1024.
You need to document that the Global Scratch Accessor needs to have the first Config::SchedulerDWORDs cleared to 0s because thats needed for "last one out closes the door" single dispatch - basically all workgroups increment that counter AFTER they're done writing the output
For example for a 32k x 16k downsample, after the one-off userspace downsample you need to perform SPD on 16k x 8k
This means 14 output mip-maps.
Now suppose your workgroup can do 4096 inputs at once, and reduce a 64x64 patch. Thats 6 output mip levels per round.
If you use Morton codes properly for your 1D Global Virtual Invocation Index, then your first 4096 WORKGROUPS will output one texel each at mip level 6 relative to the base (which is the 16k x 8k).
To run a second round of SPD, you need a patch of 64x64 workgroups to store their values to the mip level 6. Now you make the LAST WORKGROUP which stores its texel to mip level 6, perform the SPD on that 64x64 patch!
How do you do this?
With an Atomic + Barrier! Everyone stores to mip level 6, issues a global memory barrier (not execution barrier) on the Input/Output accessor (1), and only then increments the atomic assigned to the 64x64 workgroup output patch with Device Scope ACQUIRE+RELEASE semantics.
The workgroup for which this atomicAdd(1)/atomicIncr returns 4095 (SPIR-V atomic always returns pre-modification value) is the last one, and can now begin to read the 64x64 values other workgroups wrote.
P.S. This is why I'd make a __round(MortonCodeInMip, GlobalSchedulerOffset) method to build the __call out of.
P.P.S. I can see how "adjusting" the SPD size per round could be more efficient, because in the example I gave after the first round, the relative mip level 6 (real mip level 7 of the 32k x 16k) has 256x128 resolution and if done with a 64x64 round, will produce 4x2 which will severely underutilize the last workgroup in round 3 which forms the cricial path. So just like the workgroup2 scans and reductions, while it makes sense to go as aggressive as possible on the first round of a 3+ round algorithm, when you only have 2 rounds remaining it pays off to split the workload more equally, e.g. in the example given, use 16x16 on rounds 2 and 3 if you have a workgroup size of 256, or 32x32 if you have a workgroup size of 512 or more.
Description
Implementation of amd single pass downsampling in nabla and hlsl
Testing
Will be tested using example tests by downsampling different sizes of texture(PoT texture, non PoT texture, texture cube)
TODO list: