-
Notifications
You must be signed in to change notification settings - Fork 383
Clean up file hash usage #10102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Clean up file hash usage #10102
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick self review, will resolve these and merge conflicts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another pass at self-review complete, I'm removing the "draft". As far as I can tell, the three mentioned changes have the risk of breaking the compiler if an individual user makes a change locally that causes a hash collision. The likelihood for this is very low even for a non-cryptographically secure hash like murmur3 - if we were distributing partly compiled files between developers (like git commits, etc) I might be more conservative with this change.
That said, I'm definitely open to feedback on reverting back to md5 for at least these three changes, and possibly having a single class that picks the hash function for the rest of the compiler, so that a project/team can decide for themselves to pay a little more for compilation in exchange for longer/better hashes.
Besides those three, the full build seems to still pass with collisions in all other calls to produce hashes - that is, I replaced the result in all other places with "1234567890ABCDEF"
. This might point to insufficient tests - more research here would be a good idea long term.
Regardless of what hashing algorithm is used, I think there is value in this patch in making sure we don't make multiple copies of data just to hash it, but hash as we read/write.
+ permutationDescriptionString | ||
+ optionsDescriptionString) | ||
.getBytes())); | ||
String consistentHash = Hashing.murmur3_128().newHasher() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Collisions from this are especially bad for the compiler, and should be revisited (so that collisions don't result in broken code).
@@ -60,7 +61,7 @@ public CompileDependencyVisitor() { | |||
} | |||
|
|||
public String getSignature() { | |||
return Util.computeStrongName(Util.getBytes(getRawString())); | |||
return Hashing.murmur3_128().hashString(getRawString(), StandardCharsets.UTF_8).toString(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Collisions in this result in the compiler failing to generated code correctly in a variety of ways - this should be revisited to find a safer way to handle collisions.
byte[] sourceBytes = source.getBytes(StandardCharsets.UTF_8); | ||
strongHash = Hashing.murmur3_128().hashBytes(sourceBytes).toString().toUpperCase(Locale.ROOT); | ||
sourceToken = diskCache.writeByteArray(sourceBytes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Collisions from this are especially bad for the correct output, and should be revisited (so that collisions don't result in broken code).
GWT currently uses a mix of md5 and sha1, though all use cases are non-cryptographic (and both hash functions are discouraged, and have been since before they were added to GWT). This patch proposes replacing them with Guava's murmur3 128 implementation, as it is slightly faster than md5 (and much faster than sha1), and can still be used to produce the same output format (16 upper-case hex digits).
As part of this refactor, some structured content now has size or field delimiters hashed to avoid collisions, and some streamed content is hashed as it is read to avoid holding the entire payload in memory at once. The Util method dealt with the size delimiters in some cases, but not uniformly, but also ignored exceptions - this patch also cleans up those errors to correctly throw with meaningful errors.
Fixes #10090
Before merging: