Skip to content

Clean up file hash usage #10102

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 14 commits into
base: main
Choose a base branch
from
Open

Conversation

niloc132
Copy link
Member

GWT currently uses a mix of md5 and sha1, though all use cases are non-cryptographic (and both hash functions are discouraged, and have been since before they were added to GWT). This patch proposes replacing them with Guava's murmur3 128 implementation, as it is slightly faster than md5 (and much faster than sha1), and can still be used to produce the same output format (16 upper-case hex digits).

As part of this refactor, some structured content now has size or field delimiters hashed to avoid collisions, and some streamed content is hashed as it is read to avoid holding the entire payload in memory at once. The Util method dealt with the size delimiters in some cases, but not uniformly, but also ignored exceptions - this patch also cleans up those errors to correctly throw with meaningful errors.

Fixes #10090

Before merging:

  • Ensure that collisions are never fatal - by the pigeonhole principle collisions are guaranteed to happen eventually, but GWT seems to assume that collisions are effectively impossible. Compiler caching looks safe at a glance, but storing .cache.js etc resources doesn't appear to have any defense against multiple different files having the same hash. The risk might be higher using mumur3 than md5.
  • Consider putting the hash function(s) back into a single shared class among various parts of the codebase (though not also in a single class that also manages IO, templating, classnames, xml serialization...). I don't love the slightly-different snippets sprinkled throughout the project, but those differences help some use cases to be more efficient.

@niloc132 niloc132 added this to the 2.13 milestone Feb 26, 2025
@niloc132 niloc132 requested a review from vegegoku February 26, 2025 21:19
Copy link
Member Author

@niloc132 niloc132 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick self review, will resolve these and merge conflicts.

Copy link
Member Author

@niloc132 niloc132 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another pass at self-review complete, I'm removing the "draft". As far as I can tell, the three mentioned changes have the risk of breaking the compiler if an individual user makes a change locally that causes a hash collision. The likelihood for this is very low even for a non-cryptographically secure hash like murmur3 - if we were distributing partly compiled files between developers (like git commits, etc) I might be more conservative with this change.

That said, I'm definitely open to feedback on reverting back to md5 for at least these three changes, and possibly having a single class that picks the hash function for the rest of the compiler, so that a project/team can decide for themselves to pay a little more for compilation in exchange for longer/better hashes.

Besides those three, the full build seems to still pass with collisions in all other calls to produce hashes - that is, I replaced the result in all other places with "1234567890ABCDEF". This might point to insufficient tests - more research here would be a good idea long term.

Regardless of what hashing algorithm is used, I think there is value in this patch in making sure we don't make multiple copies of data just to hash it, but hash as we read/write.

+ permutationDescriptionString
+ optionsDescriptionString)
.getBytes()));
String consistentHash = Hashing.murmur3_128().newHasher()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Collisions from this are especially bad for the compiler, and should be revisited (so that collisions don't result in broken code).

@@ -60,7 +61,7 @@ public CompileDependencyVisitor() {
}

public String getSignature() {
return Util.computeStrongName(Util.getBytes(getRawString()));
return Hashing.murmur3_128().hashString(getRawString(), StandardCharsets.UTF_8).toString();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Collisions in this result in the compiler failing to generated code correctly in a variety of ways - this should be revisited to find a safer way to handle collisions.

Comment on lines +155 to +157
byte[] sourceBytes = source.getBytes(StandardCharsets.UTF_8);
strongHash = Hashing.murmur3_128().hashBytes(sourceBytes).toString().toUpperCase(Locale.ROOT);
sourceToken = diskCache.writeByteArray(sourceBytes);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Collisions from this are especially bad for the correct output, and should be revisited (so that collisions don't result in broken code).

@niloc132 niloc132 marked this pull request as ready for review April 30, 2025 18:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consolidate/cleanup hash functions
1 participant