Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add case cache to repo #3162

Merged
merged 1 commit into from
Nov 16, 2023
Merged

Add case cache to repo #3162

merged 1 commit into from
Nov 16, 2023

Conversation

ben-clayton
Copy link
Contributor

@ben-clayton ben-clayton commented Nov 13, 2023

Add case cache to repo

  • Add the cache files to the CTS repo under src/resources/cache.
  • Compress & decompress the cache with gzip using the Compression Streams API. All the cache files sum to just over 4 meg.
  • Update gen_cache --validate to check that all the cache files are present and up to date. Also error if there are old cache files that are no longer needed.
  • Update pr.yml to use run gen_cache --validate before building with all (which re-generates the cache)

Requirements for PR author:

  • All missing test coverage is tracked with "TODO" or .unimplemented().
  • New helpers are /** documented */ and new helper files are found in helper_index.txt.
  • Test behaves as expected in a WebGPU implementation. (If not passing, explain above.)

Requirements for reviewer sign-off:

  • Tests are properly located in the test tree.
  • Test descriptions allow a reader to "read only the test plans and evaluate coverage completeness", and accurately reflect the test code.
  • Tests provide complete coverage (including validation control cases). Missing coverage MUST be covered by TODOs.
  • Helpers and types promote readability and maintainability.

When landing this PR, be sure to make any necessary issue status updates.

@ben-clayton
Copy link
Contributor Author

Looks like I need to update the version of node that the presubmits use.

@ben-clayton
Copy link
Contributor Author

ben-clayton commented Nov 13, 2023

This is fun. The cache seems to differ based on the version of node used. Investigating.

Update: Looks like gzip compresses differently between node versions. I'll change the validation to test post-decompression tomorrow. Even still, there's one 4-byte diff between node 18 and 20 where a float number (looks non-finite, possibly f16) is encoded differently.

Copy link
Collaborator

@kainino0x kainino0x left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fun. The cache seems to differ based on the version of node used. Investigating.

Update: Looks like gzip compresses differently between node versions. I'll change the validation to test post-decompression tomorrow. Even still, there's one 4-byte diff between node 18 and 20 where a float number (looks non-finite, possibly f16) is encoded differently.

I think it would be much preferable if we can make the compression results stable. If they're not stable we'll end up with different developers' machines fighting over the compressed files.

Another idea - not sure if this would work - would be to store in src/ in a non-compressed format, compress into gen/, and copy into out*/.
Or even store in a text format (to make git happier) and serialize+compress into gen/.
Never mind if this defeats the whole point of checking the cache into the repository. (Though come to think of it - I don't actually understand why we can't generate the cache at compile time)

Gruntfile.js Outdated Show resolved Hide resolved
Gruntfile.js Show resolved Hide resolved
src/common/runtime/server.ts Show resolved Hide resolved
package.json Outdated Show resolved Hide resolved
Gruntfile.js Show resolved Hide resolved
@ben-clayton
Copy link
Contributor Author

This is fun. The cache seems to differ based on the version of node used. Investigating.
Update: Looks like gzip compresses differently between node versions. I'll change the validation to test post-decompression tomorrow. Even still, there's one 4-byte diff between node 18 and 20 where a float number (looks non-finite, possibly f16) is encoded differently.

I think it would be much preferable if we can make the compression results stable. If they're not stable we'll end up with different developers' machines fighting over the compressed files.

Understood. I'm not entirely sure it can be though. What if we pinned the generation to a node version, erroring if you try to generate with the wrong version of node? It would only affect the developers that re-generate the cases, which so far has been ~3 people.

Another idea - not sure if this would work - would be to store in src/ in a non-compressed format, compress into gen/, and copy into out*/. Or even store in a text format (to make git happier) and serialize+compress into gen/. Never mind if this defeats the whole point of checking the cache into the repository. (Though come to think of it - I don't actually understand why we can't generate the cache at compile time)

We did generate the cache at compile time. That was the first version. It was deemed too expensive to do each time we run the CTS.

@kainino0x
Copy link
Collaborator

We did generate the cache at compile time. That was the first version. It was deemed too expensive to do each time we run the CTS.

OK. I'd like to understand this a bit more. We already build all of Chromium in order to run the CTS in Dawn - does the cache take so long to build that we don't want to add it onto that build stage?

@ben-clayton
Copy link
Contributor Author

We did generate the cache at compile time. That was the first version. It was deemed too expensive to do each time we run the CTS.

OK. I'd like to understand this a bit more. We already build all of Chromium in order to run the CTS in Dawn - does the cache take so long to build that we don't want to add it onto that build stage?

It did. It previously took around 6 minutes, then spiked to 50, which caused major upset and forced me to drop everything to get the cache build removed from the CTS runs. Since we've all been optimizing the CTS, this has been brought down to less than 2 minutes on a moderately powerful single core machine.

Clearly we've brought this down to something almost negligible in the grand scheme of things, however:

  • This is only going to climb again as we add more tests.
  • This was being run for each CTS bots configuration, and the data is purely static. It's a bunch of immutable constant expression constants that can be completely calculated once, ahead of time.
  • CTS runs will repeat the generation N times to build this, so would each and every presubmit and CI run.
  • (I could be wrong) but I believe the bots that were doing the cache calculation were the GPU bots, which are limited in their numbers. Infra were not happy that we were spending cycles doing non-gpu work with them.

Obviously it would be nicer if we could just build this when we build the CTS, but we've yet to find a place where this can be done without past infra impact and future scalability concerns.

@ben-clayton
Copy link
Contributor Author

I've added a followup change that adds a cache/hashes.json file, which holds the CRC32 hashes of all the source files used to build the cache binary file. This massively speeds up the validation, and means we only have to rebuild the cases that have been modified.
It also fixes the validation issue between node versions, as the output binary is no longer regenerated as part of validation.

@ben-clayton ben-clayton force-pushed the commit-cache branch 2 times, most recently from c8d7774 to 876c02d Compare November 14, 2023 13:48
@ben-clayton
Copy link
Contributor Author

PTAL

Copy link
Collaborator

@kainino0x kainino0x left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay, LGTM

src/resources/cache/hashes.json Outdated Show resolved Hide resolved
src/unittests/parse_imports.spec.ts Show resolved Hide resolved
src/common/tools/gen_cache.ts Show resolved Hide resolved
@kainino0x
Copy link
Collaborator

It did. It previously took around 6 minutes, then spiked to 50, which caused major upset and forced me to drop everything to get the cache build removed from the CTS runs. Since we've all been optimizing the CTS, this has been brought down to less than 2 minutes on a moderately powerful single core machine.

This is a major improvement that sounds like it came from one particular change - if I knew what that change was I could judge better how likely it is to be a problem again in the future.

Clearly we've brought this down to something almost negligible in the grand scheme of things, however:

  • This is only going to climb again as we add more tests.

It's not clear to me how much this will climb. We could 3x the number of tests and if it takes 6min to build then I think we're still in an OK realm.

  • This was being run for each CTS bots configuration, and the data is purely static. It's a bunch of immutable constant expression constants that can be completely calculated once, ahead of time.

This doesn't concern me that much if it's a few minutes of compiler bot time. It should be just once per platform, not once per swarming config.

I worry slightly about the maintainability of the solution you've built, though I'm pretty sure it will be fine. But if it does turn out to cause difficulties then we can reevaluate.

  • CTS runs will repeat the generation N times to build this, so would each and every presubmit and CI run.

True on GitHub. On Dawn it's probably cached?

  • (I could be wrong) but I believe the bots that were doing the cache calculation were the GPU bots, which are limited in their numbers. Infra were not happy that we were spending cycles doing non-gpu work with them.

I hope not - theoretically the C++ compile should definitely not be happening on GPU bots, so would be OK if it's part of that build.

Obviously it would be nicer if we could just build this when we build the CTS, but we've yet to find a place where this can be done without past infra impact and future scalability concerns.

@kainino0x
Copy link
Collaborator

I worry slightly about the maintainability of the solution you've built

I'll add I was pleasantly surprised by how little code it took to do this nice hashing solution, so if it turns out to be robust then that's great.

@kainino0x
Copy link
Collaborator

I just remembered one more thought I had while reviewing (not blocking this, but I'll let you decide when to land it).

What if we checked in the uncompressed cache, and did only the compression in the build step? This would be fast and free us from variations in gzip results.

Depending on how these files change over time, Git might also do a better job storing incremental changes to the binary files if they're uncompressed. https://stackoverflow.com/a/20690323

@ben-clayton
Copy link
Contributor Author

What if we checked in the uncompressed cache, and did only the compression in the build step? This would be fast and free us from variations in gzip results.

We could investigate dropping the compression.

The main purpose for adding the compression was to reduce the bloat of the repo. The binary files are substantially larger when uncompressed, but yes, git will probably end up compressing the files with a very similar compression method. However, the files in the working directory will remain uncompressed and possibly objects outside of the packfiles may be uncompressed.

Let's see how this goes. I'm happy to tweak things if they become problematic (4th iteration so far!).

* Add the cache files to the CTS repo under `src/resources/cache`.
* Compress & decompress the cache with gzip using the Compression
  Streams API. All the cache files sum to just over 4 meg.
* Add a `src/resources/cache/hashes.json` to track the hash of the
  source files used to generate each of the cache files. Only rebuild
  the cases that are stale.
* Update `gen_cache --validate` to check that all the cache files are
  present and up to date. Also error if there are old cache files that
  are no longer needed. This is run by `npm test`, which is also called
  by the GitHub presubmits.
* Add `wpt lint` exemption for `*.bin` files

Optimization: Add a cache hash file

This file contains a hash of the source files used to build the cache.
Only source files that have been changed need to be rebuilt.
The validation step doesn't have to perform the expensive build step to know whether files are stale.
@ben-clayton ben-clayton merged commit ec19459 into gpuweb:main Nov 16, 2023
1 check passed
@ben-clayton ben-clayton deleted the commit-cache branch November 16, 2023 10:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants