Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Publish a design rationale / justification #88

Open
cessen opened this issue Feb 2, 2025 · 8 comments
Open

Publish a design rationale / justification #88

cessen opened this issue Feb 2, 2025 · 8 comments

Comments

@cessen
Copy link

cessen commented Feb 2, 2025

Large-output hashes like MeowHash aren't feasible to positively test for quality empirically due to their enormous state space. @NoHatCoder discussed this as well in an old issue, and it's fantastic to see that MeowHash isn't designed against SMHasher like so many other hash functions.

However, as the authors of a large-output hash, it's important not only to avoid relying on empirical test suites in the design process, but also to not rely on them when making quality claims. Instead, alongside the hash itself, you should provide an analysis that justifies the hash design with respect to quality. It doesn't need to be fancy, but it should be reasonably complete.

This is (very unfortunately) not yet standard practice for non-cryptographic hashes, but for large-output hashes it absolutely should be.

Since MeowHash is not yet declared final, I don't think this is urgent. But I wanted to poke you guys to see if providing a design justification is at least on the roadmap as a 1.0 target. It would be great if MeowHash could help set an example for how this ought to be done.


As an aside:

I've also developed a hash (TentHash) that's trying to set an example here. However, its design focuses more on simplictity and portability than performance, so people looking for extreme performance are unlikely to reach for it.

Part of my motivation for filing this issue is that I would really like to be able to recommend MeowHash to people who need or want higher performance than TentHash offers, but at the moment I can't do that. People currently often misguidedly reach for xxHash3's 128-bit variant in such cases, which is really unfortunate because it has some questionable design decisions and really shouldn't be used. Having MeowHash as a properly justified alternative would be a huge boon.

Aside-to-the-aside:

My own very cursory analysis of MeowHash 0.5 suggests that it's not conservative about quality, lacking sufficient mixing between the incorporation of input blocks. I may have missed something elsewhere in the design that accounts for that. If so, then that's an example of something that ought to be in a design justification document. If not, then that should be fixed. There's not much point in a large-output hash that's not conservative about quality.

I don't know if the 0.6 candidate hash functions (#82) address that or not, as I haven't taken a look yet.

@NoHatCoder
Copy link

I think it is going to be hard to provide a good design rationale for Meow 0.5, we ended up with a hack that really wasn't as effective as we thought it would be, and yeah, there is 0 design headroom. Practically I don't think it is a big issue, but it is hard to claim that it is perfect.

The 0.6 candidate eventually turned into Tjald, I think that might be what you want to recommend. I haven't had enough eyes on it to recommend cryptographic use yet, but for non-cryptographic use I don't see any issues. I have a big write-up on the design process https://github.com/NoHatCoder/Meow-Hash-0.6-Candidate/tree/main/Patterncheck I don't think anybody actually understand what I wrote, but the whole thing at least makes me very confident about the design.

I think your design rationale for TentHash spends a bit much time essentially saying "I didn't make this newbie mistake". What I would want from a design rationale is things that I can't just read in the code. For me the part on your diffusion test and how you chose the rotation counts is the most relevant. It is not that I think it is wrong to write about the rest of the design rationale, but it doesn't really do anything for my confidence in the construction. Overall I think TentHash is fine, but I wouldn't call it conservative, you pass your own statistical tests with 7 rounds, but I assume 6 rounds wouldn't pass.

I think your "blocks per full diffusion" statistic needs a revamp. Mainly 1 block doesn't have the same significance for all constructions. The way TentHash is designed it definitely needs a good avalanche between each block, but in Meow 0.5 and Tjald it takes many blocks before the effect of a previous block could potentially be undone. A more fair statistic would be how much diffusion has occurred while the function takes input data equal to it's internal state size, but even that doesn't fully describe the advantage of taking all input twice (which Meow 0.5 doesn't fully do, that might fool your statistic a bit, the avalanching speed of the single digested bytes can't be exploited on its own).

@cessen
Copy link
Author

cessen commented Feb 2, 2025

I think it is going to be hard to provide a good design rationale for Meow 0.5

To be clear, my recommendation is to provide a design rationale when declaring MeowHash final. I don't necessarily expect a design rationale write-up for a still WIP progress hash (although it's certainly nice).

I think your "blocks per full diffusion" statistic needs a revamp. Mainly 1 block doesn't have the same significance for all constructions.

I address that in the rather lengthy footnote for that column of the table. Indeed, there are hash constructions that make it an ill-suited measure, and its inclusion in the table is meant to be suggestive rather than definitive.

It's perhaps a little cheeky, but part of my reason for including it in the table also amounts to "If this measure doesn't apply to a given hash, and the hash actually lives up to its output size, it's on the creators of that hash to demonstrate that fact by publishing a design rationale."

but in Meow 0.5 and Tjald it takes many blocks before the effect of a previous block could potentially be undone

I can't speak for Tjald, of course, since I haven't looked at it yet.

But for MeowHash 0.5 the entry in the table actually (attempts to) take that into account. I didn't go that far with the other hashes in the table, and their entries do reflect just a single-block measure. The reason I did so for MeowHash is because I would very much like to see MeowHash succeed, and I wanted to give it as fair a shake as I could.

That's also why I bothered to submit this issue: I'm not trying to challenge MeowHash, I actually want it to become something I can recommend.

(EDIT: the table where I specifically took that into account was actually the table in "Hash Design and Goodhart's Law". My brain got wires crossed as I was responding. If we consider 3 blocks to be the new "1 block" as suggested in the linked source code comment, then the table in the TentHash readme would list 2 blocks rather than 6. Sorry for the mix up.)

The 0.6 candidate eventually turned into Tjald

Does that mean that MeowHash is essentially a dead project at this point? If so, that's really unfortunate. Unless I've missed something major, I don't think it's appropriate to recommend it even for non-cryptographic use in its current state.

I think your design rationale for TentHash spends a bit much time essentially saying "I didn't make this newbie mistake".

My goal in doing that was to try to make the design rationale accessible to people not as familiar with hash design. In retrospect, it's not clear to me how well I succeeded in that. But I don't think it hurts, at least.

I agree that the sections about the mix function are the actually important ones. TentHash uses a construction where its quality (I believe) provably reduces to the quality of the mix function, and indeed I suspect that's fairly obvious to anyone knowledgeable about hash design.

I wouldn't call it conservative, you pass your own statistical tests with 7 rounds, but I assume 6 rounds wouldn't pass.

That's correct, and that's fair.

I think we might be using "conservative" in a bit of a different way, though. As you're well aware, there is an ocean of non-cryptographic hashes out there that are essentially designed by trial-and-error against things like SMHasher. I categorize large-output hashes like that as "maybe okay, but it's hard to say because you can't positively test for quality at those output sizes".

When I say "conservative" I'm contrasting with those hashes.

"Conservative" of course would mean something different if you're only looking at hashes that are properly designed through analysis. Among such hashes, I absolutely agree that TentHash only qualifies as pretty run-of-the-mill, not conservative.

Nevertheless, I still think it's appropriate to call it conservative in the context of what is (unfortunately) currently normal in the non-cryptographic hash landscape.

@NoHatCoder
Copy link

Does that mean that MeowHash is essentially a dead project at this point? If so, that's really unfortunate.

Casey should have a say in exactly how this goes, but at least to some degree Tjald is the successor. Quality-wise it is clear to me that Tjald2 is better, and while I may not have fully justified this conclusion in writing, I do believe that the function itself should address all your concerns. Mainly it does 50% more AES rounds, and I have the write-up that nobody understands justifying that I spend them in a really good way.

Please do run Tjald2 through your Goodhart test, I think it will do a lot better. And have a go at seeing if you can understand the rationale, ask questions if you need.

I should add that I don't think a non-cryptographic function needs to be conservative in the way a cryptographic one does. We try to guard cryptographic code against future mathematical advances by doing more computation than what we know to be necessary. But the non-cryptographic space really doesn't have the same risk of new discoveries, if your justification for the strength of TentHash turns out not to hold, it is not going to be by a huge margin, and the practical impact will be virtually zero.

@cmuratori
Copy link
Owner

I don't have an opinion on this, and am happy to defer to @NoHatCoder. My goal originally was just to have a very fast hash that could work on large art assets. The original Meow actually worked fine for this, but when Jacob suggested he might be able to make a construction that was more sound, I was in favor of trying, anyway. It sort-of worked out, and sort-of didn't, which left Meow in a bit of a weird place.

Since I am not a cryptographer, I don't have a strong opinion about what happens from here. I have left Meow 0.5 up because it is useful as-is, so if people want it, they can use it. But I do think that if Jacob's new hash (Tjald) gets properly put through its paces and turns out to be more resistant, then as long as there aren't any unforeseen performance regressions, I can't see much of a reason why people should use Meow in new projects. It could presumably be sunset as a legacy hash, and people OK with the AES requirement would switch to Tjald going forward.

If that is too confusing for some reason, I also don't object to making Tjald be Meow 0.6, if that needs to happen. Meow changed constructions previously, so it could certainly do so again. But I assumed it would be clearer this time to just let it be a new hash name - Tjald in this case - since there are a lot of people using Meow, and Tjald is not backwards-compatible. But if the consensus was that it was better for some reason to maintain the Meow hash naming, then I have no objections to doing that and I can help do that update as necessary.

That's my thinking at the moment.

- Casey

@cessen
Copy link
Author

cessen commented Feb 3, 2025

My goal originally was just to have a very fast hash that could work on large art assets. The original Meow actually worked fine for this

Was the intended use case error detection, or content addressing?

If the former, then I agree that the original MeowHash was probably fine. You don't need 128 bits for that. But for content addressing / fingerprinting / etc. (the use cases that actually need 128+ bits), I think even the current MeowHash can't be recommended, for the reasons I outlined in Hash Design and Goodhart's Law.

It's nothing at all to do with cryptographic security—I'm also not a cryptographer, and have no meaningful expertise in that area. My concerns are all from a level-1 hash standpoint (to use @NoHatCoder's classification system). It's purely about statistical soundness, and being able to have confidence that a hash lives up to its output size.

If that is too confusing for some reason, I also don't object to making Tjald be Meow 0.6, if that needs to happen.

I think it's fine for Tjald to take the use cases that need security claims. But I think it would be good if MeowHash continues to serve the use case of level-1 hashing. So a 0.6 release that addresses issues relevant to that would IMO be a very good thing. MeowHash already has a lot of brand recognition, so to speak, and I suspect it would be easier to get people to use it.

@NoHatCoder
Copy link

We could just call Tjald2 Meow. I should add that the reason there is no Tjald1 is that I found no real way to make it faster than Tjald2. The 50% more AES rounds than Meow 0.5 does not actually turn into 50% more execution time on modern CPUs as we are not limited by AES throughput. The more relevant stat is 7 SIMD instructions per round for Tjald2, and 6 for Meow 0.5, but on my old Zen2 Tjald2 also scheduled better due to the bigger state, so they were neck and neck.

If we want to make changes it should probably be to get better small input speed, where Tjald2 isn't amazing.

@cmuratori
Copy link
Owner

OK. I will try to set aside some time to study Tjald and understand the performance, etc., so we can figure out a merge plan. Meow also wasn't great at small inputs, so that part may not matter - I would hope that anyone choosing Meow is choosing it for things where they expect the majority of their inputs to be more than 1k, etc. We could certainly add some warnings to that effect on 0.6 as well.

- Casey

@cessen
Copy link
Author

cessen commented Feb 3, 2025

I'll try to make time to look at Tjald in the next few weeks or so as well, mainly coming at it from a statistical soundness standpoint.

Meow also wasn't great at small inputs, so that part may not matter - I would hope that anyone choosing Meow is choosing it for things where they expect the majority of their inputs to be more than 1k, etc. We could certainly add some warnings to that effect on 0.6 as well.

My expectation is that anyone who actually needs a 128-bit hash output probably doesn't have small inputs as their typical case. But it never hurts to make performance characteristics clear.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants