Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add small compute examples illustrating new WGSL primitives for AI #350

Open
kenrussell opened this issue Jan 24, 2024 · 10 comments
Open

Add small compute examples illustrating new WGSL primitives for AI #350

kenrussell opened this issue Jan 24, 2024 · 10 comments
Assignees
Labels
sample request Request for a new sample sample wanted We definitely want to add this sample; contributions welcome

Comments

@kenrussell
Copy link

@beaufortfrancois requested that some small examples be published here showing how to use the new WGSL primitives aimed at AI/ML workloads: shader-f16, DP4A, and soon, subgroups. Could we consider this?

Not sure what would be the most compelling - perhaps something with some visual output, and running a microbenchmark against the fallback WGSL code, assuming the feature is actually supported?

@kenrussell
Copy link
Author

CC @dneto0

@dneto0
Copy link

dneto0 commented Feb 29, 2024

For dp4a:

  • They're good for ML image processing: segmentation, object identification. But it's involved to put an ML model in the samples.
  • Maybe we can do something really simple, like sobel filter on a grayscale image.

@kainino0x
Copy link
Collaborator

dp4a accelerates matrix multiplication right? Even a basic matrix multiplication sample could be enough for a sample. The sample could even just display some text. But sobel or any other simple convolution would make it more compelling.

@kainino0x kainino0x added sample request Request for a new sample sample wanted We definitely want to add this sample; contributions welcome labels Mar 5, 2024
@kainino0x
Copy link
Collaborator

Also it should have a toggle to enable/disable dp4a and hopefully see some performance improvement.

@cmhhelgeson
Copy link
Contributor

Is dp4a available in the current version of WebGPU?

@austinEng
Copy link
Collaborator

dp4a is available starting in Chromium M123. So today, that would be Chrome Beta and newer.

@cmhhelgeson
Copy link
Contributor

cmhhelgeson commented Mar 6, 2024

If possible, I'd like to try my hand at this issue, at least for the next week (sorry about the timeline, day job is gonna day job). Sobel filter is a good place to start I think.

EDIT: 'GPGPU Compute Category' or 'Features Category'?

@cmhhelgeson
Copy link
Contributor

Just want to make sure I understand the assignment, the intended use of dp4a here. Instead of writing, say, this for our sobel filter.... (below is in pseudo-wgsl)

(Pixels loaded from texture using textureLoad and global_invocation_id
let result =  1 * pixel1.r
        + 2 * pixel2.r
        + 1 * pixel 3. r
        - 1 * textureLoad(inputTexture, vec2<u32>(id.x + 1, id.y - 1), 0).r

textureStore(output, id.xy, result)

We should do something like this?

let pixelPack = pack4xU8Clamp(pixel1.r, pixel2.r, pixel3.r, pixel4.r)
let kernelPack = pack4xU8Clamp(1, 2, 1, -1)
let result = dot4U8Packed(pixelPack, kernelPack);
textureStore(output, id.xy, result)

@dneto0
Copy link

dneto0 commented Mar 6, 2024

Also it should have a toggle to enable/disable dp4a and hopefully see some performance improvement.

I suspect a Sobel filter is simple enough that it's limited by memory bandwidth instead of computation.
So I wouldn't get hung up on perf improvement for this sample.

@cmhhelgeson
Copy link
Contributor

cmhhelgeson commented Mar 12, 2024

Somebody else should take this on, I understand the functionality, but I'm struggling with the quantization of the dp4a result back to something usable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sample request Request for a new sample sample wanted We definitely want to add this sample; contributions welcome
Projects
None yet
Development

No branches or pull requests

5 participants