-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wgsl: Implement AbstractFloat matrix multiplcation tests #3446
wgsl: Implement AbstractFloat matrix multiplcation tests #3446
Conversation
63dfbbe
to
9e9521e
Compare
PTAL, this depends on the implementation of dot, so all of the new code is in the later diff. |
src/webgpu/listing_meta.json
Outdated
@@ -918,6 +918,11 @@ | |||
"webgpu:shader,execution,expression,binary,af_multiplication:scalar_vector:*": { "subcaseMS": 2025.534 }, | |||
"webgpu:shader,execution,expression,binary,af_multiplication:vector:*": { "subcaseMS": 710.667 }, | |||
"webgpu:shader,execution,expression,binary,af_multiplication:vector_scalar:*": { "subcaseMS": 2085.300 }, | |||
"webgpu:shader,execution,expression,binary,af_matrix_matrix_multiplication:matrix_matrix:*": { "subcaseMS": 0 }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those binary files are huge. I'd be curious to get a sense of how long these take to run. My spidey-sense is telling me that we have too many permutations here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a bit of time to at the case generation code. There isn't any obvious accidently using dense ranges vs sparse ones that I can see relative to the other floating point types.
Looking at the size of .bins, the AbstractFloat ones are actually slightly smaller than the equivalent f32 ones, propbably because though the size will be inflated due to 32 vs 64 bits, there are fewer total cases due to OOB being trimmed.
Time to generate the case cache doesn't appear to be significantly affected by whether or not this PR is present.
With a pre-built case cache, running locally on lava pipe,
-j 1 'webgpu:shader,execution,expression,binary,af_matrix_scalar_multiplication,*'
takes 5m30.410694302s
and
-j 1 'webgpu:shader,execution,expression,binary,f32_matrix_scalar_multiplication,*'
takes 14.364973476s
So yeah, AF is much much slower than f32, but I think that is because of the all of the extraction logic that needs to be run in the shader vs cache size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for doing investigations here.
5½ minutes is a huge amount of time for a single test case. I strongly suspect that's going to cause timeouts for our test runner and likely others. I think we're going to have to take a look at reducing the case permutations.
9e9521e
to
c362232
Compare
import { makeCaseCache } from '../case_cache.js'; | ||
|
||
const sf = new StochasticFilter(0); | ||
const filter_percentage = 50; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please document.
GitHub just threw my comment on the floor. Re-writing... |
'finite', | ||
FP.abstract.multiplicationMatrixMatrixInterval | ||
) | ||
.filter(c => filter_percentage <= crc32(c.input.toString()) % 100); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be preferable to specify an absolute target value instead of a percentage. The target value could then be used to calculate the relative threshold (0xffffffff*n/count >= crc32(c.input.toString())
or something)
This could also be put in a reusable helper function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
8f702cf
to
c8800ee
Compare
PTAL, this should be ready for review |
return cases; | ||
} | ||
|
||
return cases.filter(c => n * (0xffff_ffff / count) > crc32(c.input.toString())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Idea: Should we xor-in another crc32
of a string parameter (i.e. cache key name)? I'm thinking that we'll get the same N-cases selected for different tests with the same parameters. XOR-ing in something unique to the test would prevent this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a good point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving - although if you like the xor idea, it might be good to do that as part of this PR, to avoid bloating the git history.
c8800ee
to
efa7b4f
Compare
Issue #1626
Requirements for PR author:
.unimplemented()
./** documented */
and new helper files are found inhelper_index.txt
.Requirements for reviewer sign-off:
When landing this PR, be sure to make any necessary issue status updates.