Single pass downsampling #954

kevyuu · 2025-11-26T15:41:54Z

Description

Implementation of amd single pass downsampling in nabla and hlsl

Testing

Will be tested using example tests by downsampling different sizes of texture(PoT texture, non PoT texture, texture cube)

TODO list:

devshgraphicsprogramming · 2025-12-01T09:45:06Z

include/nbl/builtin/hlsl/concepts/accessors/generic_shared_data.hlsl

 #define NBL_CONCEPT_PARAM_1 (val, V)
 #define NBL_CONCEPT_PARAM_2 (index, I)
-NBL_CONCEPT_BEGIN(3)
+


what did you do, this is needed!

devshgraphicsprogramming · 2025-12-01T09:45:59Z

src/nbl/builtin/CMakeLists.txt

 LIST_BUILTIN_RESOURCE(NBL_RESOURCES_TO_EMBED "hlsl/spirv_intrinsics/subgroup_arithmetic.hlsl")
 LIST_BUILTIN_RESOURCE(NBL_RESOURCES_TO_EMBED "hlsl/spirv_intrinsics/subgroup_ballot.hlsl")
 LIST_BUILTIN_RESOURCE(NBL_RESOURCES_TO_EMBED "hlsl/spirv_intrinsics/subgroup_basic.hlsl")
+LIST_BUILTIN_RESOURCE(NBL_RESOURCES_TO_EMBED "hlsl/spirv_intrinsics/subgroup_quad.hlsl")


forgot to add files?

devshgraphicsprogramming · 2025-12-01T09:46:03Z

src/nbl/builtin/CMakeLists.txt

 LIST_BUILTIN_RESOURCE(NBL_RESOURCES_TO_EMBED "hlsl/glsl_compat/subgroup_arithmetic.hlsl")
 LIST_BUILTIN_RESOURCE(NBL_RESOURCES_TO_EMBED "hlsl/glsl_compat/subgroup_ballot.hlsl")
 LIST_BUILTIN_RESOURCE(NBL_RESOURCES_TO_EMBED "hlsl/glsl_compat/subgroup_basic.hlsl")
+LIST_BUILTIN_RESOURCE(NBL_RESOURCES_TO_EMBED "hlsl/glsl_compat/subgroup_quad.hlsl")


forgot to add files?

devshgraphicsprogramming · 2025-12-01T10:34:04Z

include/nbl/builtin/hlsl/workgroup/spd.hlsl

+
+struct SPD 
+{


you need to template your struct on a Config struct like workgroup scans or BxDFs have but you'll have:

arithmetic texel type (what you vall your binop with)

binop type, e.g. nbl::hlsl::plus<arithmetic_texel_t>

storage texel type (what you pump into output image and scratch)

conversion method between arithmetic and storage texel type

input "tile" size (how many mip levels you can reduce with a single workgroup)

output mipmap count (absolute max is 15, because thats a Max HW texture size)

how many rounds of workgroup reduction are needed to downsample the whole image (e.g. if workgroup can do 6 or 7 at once, you simply divide the output mipmap count by this number and round up)

how many workgroups output to a single input in the final round

number dwords (uints) reserved for the scheduler (to do "last one out closes the door" single pass downsampling)

subgroup size

workgroup size

The last two you may store indirectly because a Workgroup2 Reduction Config will be needed, so subgroup size and workgroup size come into play there. Although it might end up being that each round needs its own workgroup2 reuction config.

Then your __call needs to template on and have arguments:

Input/Output Accessor (Loadable and Storable Mip Mapped Image, but also a Global/Device Scope memory barrier method)

Global Scratch Accessor (has to have atomicAdd supporting Acquire/Release semantic and scope flags - can be template args instead of regular args, and a set<type_of_your_texel> method)

Workgroup Scratch for the workgroup2::reduce

In the actual usage of the algo you can assume that user will do first mip level reduction by themselves because of cheap tricks like textureGather and applying the binary operation themselves, or tapping inbetween 2x2 pixels and using a bilinear or Min/Max sampler.

This first user-space mipmapping step is not taken into account by the SPD algorithm, so if you do 2048 input, then you only launch SPD with a tile input of 1024.

You need to document that the Global Scratch Accessor needs to have the first Config::SchedulerDWORDs cleared to 0s because thats needed for "last one out closes the door" single dispatch - basically all workgroups increment that counter AFTER they're done writing the output

For example for a 32k x 16k downsample, after the one-off userspace downsample you need to perform SPD on 16k x 8k

This means 14 output mip-maps.

Now suppose your workgroup can do 4096 inputs at once, and reduce a 64x64 patch. Thats 6 output mip levels per round.

If you use Morton codes properly for your 1D Global Virtual Invocation Index, then your first 4096 WORKGROUPS will output one texel each at mip level 6 relative to the base (which is the 16k x 8k).

To run a second round of SPD, you need a patch of 64x64 workgroups to store their values to the mip level 6. Now you make the LAST WORKGROUP which stores its texel to mip level 6, perform the SPD on that 64x64 patch!

How do you do this?

With an Atomic + Barrier! Everyone stores to mip level 6, issues a global memory barrier (not execution barrier) on the Input/Output accessor (1), and only then increments the atomic assigned to the 64x64 workgroup output patch with Device Scope ACQUIRE+RELEASE semantics.

The workgroup for which this atomicAdd(1)/atomicIncr returns 4095 (SPIR-V atomic always returns pre-modification value) is the last one, and can now begin to read the 64x64 values other workgroups wrote.

P.S. This is why I'd make a __round(MortonCodeInMip, GlobalSchedulerOffset) method to build the __call out of.

P.P.S. I can see how "adjusting" the SPD size per round could be more efficient, because in the example I gave after the first round, the relative mip level 6 (real mip level 7 of the 32k x 16k) has 256x128 resolution and if done with a 64x64 round, will produce 4x2 which will severely underutilize the last workgroup in round 3 which forms the cricial path. So just like the workgroup2 scans and reductions, while it makes sense to go as aggressive as possible on the first round of a 3+ round algorithm, when you only have 2 rounds remaining it pays off to split the workload more equally, e.g. in the example given, use 16x16 on rounds 2 and 3 if you have a workgroup size of 256, or 32x32 if you have a workgroup size of 512 or more.

kevyuu added 2 commits November 26, 2025 22:30

First draft of single pass downsampling

d6ff5fc

First draft of spd workgroup implementation

a9b7331

devshgraphicsprogramming changed the base branch from master to mortons December 1, 2025 09:44

devshgraphicsprogramming reviewed Dec 1, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Single pass downsampling #954

Single pass downsampling #954

Uh oh!

kevyuu commented Nov 26, 2025

Uh oh!

devshgraphicsprogramming Dec 1, 2025

Uh oh!

devshgraphicsprogramming Dec 1, 2025

Uh oh!

devshgraphicsprogramming Dec 1, 2025

Uh oh!

devshgraphicsprogramming Dec 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Single pass downsampling #954

Are you sure you want to change the base?

Single pass downsampling #954

Uh oh!

Conversation

kevyuu commented Nov 26, 2025

Description

Testing

TODO list:

Uh oh!

devshgraphicsprogramming Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

devshgraphicsprogramming Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

devshgraphicsprogramming Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

devshgraphicsprogramming Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants