Skip to content

Conversation

@kevyuu
Copy link
Contributor

@kevyuu kevyuu commented Nov 26, 2025

Description

Implementation of amd single pass downsampling in nabla and hlsl

Testing

Will be tested using example tests by downsampling different sizes of texture(PoT texture, non PoT texture, texture cube)

TODO list:

@devshgraphicsprogramming devshgraphicsprogramming changed the base branch from master to mortons December 1, 2025 09:44
#define NBL_CONCEPT_PARAM_1 (val, V)
#define NBL_CONCEPT_PARAM_2 (index, I)
NBL_CONCEPT_BEGIN(3)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what did you do, this is needed!

LIST_BUILTIN_RESOURCE(NBL_RESOURCES_TO_EMBED "hlsl/spirv_intrinsics/subgroup_arithmetic.hlsl")
LIST_BUILTIN_RESOURCE(NBL_RESOURCES_TO_EMBED "hlsl/spirv_intrinsics/subgroup_ballot.hlsl")
LIST_BUILTIN_RESOURCE(NBL_RESOURCES_TO_EMBED "hlsl/spirv_intrinsics/subgroup_basic.hlsl")
LIST_BUILTIN_RESOURCE(NBL_RESOURCES_TO_EMBED "hlsl/spirv_intrinsics/subgroup_quad.hlsl")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

forgot to add files?

LIST_BUILTIN_RESOURCE(NBL_RESOURCES_TO_EMBED "hlsl/glsl_compat/subgroup_arithmetic.hlsl")
LIST_BUILTIN_RESOURCE(NBL_RESOURCES_TO_EMBED "hlsl/glsl_compat/subgroup_ballot.hlsl")
LIST_BUILTIN_RESOURCE(NBL_RESOURCES_TO_EMBED "hlsl/glsl_compat/subgroup_basic.hlsl")
LIST_BUILTIN_RESOURCE(NBL_RESOURCES_TO_EMBED "hlsl/glsl_compat/subgroup_quad.hlsl")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

forgot to add files?

Comment on lines +163 to +165

struct SPD
{

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to template your struct on a Config struct like workgroup scans or BxDFs have but you'll have:

  • arithmetic texel type (what you vall your binop with)
  • binop type, e.g. nbl::hlsl::plus<arithmetic_texel_t>
  • storage texel type (what you pump into output image and scratch)
  • conversion method between arithmetic and storage texel type
  • input "tile" size (how many mip levels you can reduce with a single workgroup)
  • output mipmap count (absolute max is 15, because thats a Max HW texture size)
  • how many rounds of workgroup reduction are needed to downsample the whole image (e.g. if workgroup can do 6 or 7 at once, you simply divide the output mipmap count by this number and round up)
  • how many workgroups output to a single input in the final round
  • number dwords (uints) reserved for the scheduler (to do "last one out closes the door" single pass downsampling)
  • subgroup size
  • workgroup size

The last two you may store indirectly because a Workgroup2 Reduction Config will be needed, so subgroup size and workgroup size come into play there. Although it might end up being that each round needs its own workgroup2 reuction config.

Then your __call needs to template on and have arguments:

  1. Input/Output Accessor (Loadable and Storable Mip Mapped Image, but also a Global/Device Scope memory barrier method)
  2. Global Scratch Accessor (has to have atomicAdd supporting Acquire/Release semantic and scope flags - can be template args instead of regular args, and a set<type_of_your_texel> method)
  3. Workgroup Scratch for the workgroup2::reduce

In the actual usage of the algo you can assume that user will do first mip level reduction by themselves because of cheap tricks like textureGather and applying the binary operation themselves, or tapping inbetween 2x2 pixels and using a bilinear or Min/Max sampler.

This first user-space mipmapping step is not taken into account by the SPD algorithm, so if you do 2048 input, then you only launch SPD with a tile input of 1024.

You need to document that the Global Scratch Accessor needs to have the first Config::SchedulerDWORDs cleared to 0s because thats needed for "last one out closes the door" single dispatch - basically all workgroups increment that counter AFTER they're done writing the output

For example for a 32k x 16k downsample, after the one-off userspace downsample you need to perform SPD on 16k x 8k

This means 14 output mip-maps.

Now suppose your workgroup can do 4096 inputs at once, and reduce a 64x64 patch. Thats 6 output mip levels per round.

If you use Morton codes properly for your 1D Global Virtual Invocation Index, then your first 4096 WORKGROUPS will output one texel each at mip level 6 relative to the base (which is the 16k x 8k).

To run a second round of SPD, you need a patch of 64x64 workgroups to store their values to the mip level 6. Now you make the LAST WORKGROUP which stores its texel to mip level 6, perform the SPD on that 64x64 patch!

How do you do this?

With an Atomic + Barrier! Everyone stores to mip level 6, issues a global memory barrier (not execution barrier) on the Input/Output accessor (1), and only then increments the atomic assigned to the 64x64 workgroup output patch with Device Scope ACQUIRE+RELEASE semantics.

The workgroup for which this atomicAdd(1)/atomicIncr returns 4095 (SPIR-V atomic always returns pre-modification value) is the last one, and can now begin to read the 64x64 values other workgroups wrote.

P.S. This is why I'd make a __round(MortonCodeInMip, GlobalSchedulerOffset) method to build the __call out of.

P.P.S. I can see how "adjusting" the SPD size per round could be more efficient, because in the example I gave after the first round, the relative mip level 6 (real mip level 7 of the 32k x 16k) has 256x128 resolution and if done with a 64x64 round, will produce 4x2 which will severely underutilize the last workgroup in round 3 which forms the cricial path. So just like the workgroup2 scans and reductions, while it makes sense to go as aggressive as possible on the first round of a 3+ round algorithm, when you only have 2 rounds remaining it pays off to split the workload more equally, e.g. in the example given, use 16x16 on rounds 2 and 3 if you have a workgroup size of 256, or 32x32 if you have a workgroup size of 512 or more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants