-
-
Notifications
You must be signed in to change notification settings - Fork 350
Ensure deterministic ordering of consolidated metadata #3288
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure deterministic ordering of consolidated metadata #3288
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3288 +/- ##
==========================================
- Coverage 60.73% 60.72% -0.01%
==========================================
Files 78 78
Lines 9407 9408 +1
==========================================
Hits 5713 5713
- Misses 3694 3695 +1
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a good change 👍 Please could you add a test that checks the order of metadata, to make sure we don't accidentally break this in the future.
Agreed. Could you also update the docs (probably both the docs in |
And if you're feeling especially ambitious, you could note this as a desirable feature at zarr-developers/zarr-specs#309. There could possibly be some indication in the Zarr metadata that the consolidated metadata has been sorted. |
221b95f
to
1cdf29a
Compare
I added a test that checks the proper sorting of a (nested) store structure. I also included a short paragraph in the user guide, that explains the new sorting of the consolidated metadata keys. I couldn't find a specific doc string to modify, and while thinking about it, the new behaviour seems less important on the individual function level, and more appropriate for the user guide. |
@@ -100,6 +100,11 @@ With nested groups, the consolidated metadata is available on the children, recu | |||
>>> consolidated['child'].metadata.consolidated_metadata | |||
ConsolidatedMetadata(metadata={'child': GroupMetadata(attributes={'kind': 'grandchild'}, zarr_format=3, consolidated_metadata=ConsolidatedMetadata(metadata={}, kind='inline', must_understand=False), node_type='group')}, kind='inline', must_understand=False) | |||
|
|||
The keys in the consolidated metadata are sorted in ascending order by path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you clarify whether this is on the write side or read side (I think write?). That implies that users can rely on this when reading files written by zarr-python, but it's not necessarily safe to assume, unless we always sort when reading, since other systems writing consolidated metadata might not sort.
And perhaps note that this is new, by putting this in a .. versionadded:: 3.1.1
(or whatever the next version is, I'm not sure).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Earlier in this doc, we do a pprint(dict(sorted(consolidated_metadata.items())))
. Is that sorted
unnecessary now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understand it, the to_dict()
method is only used before writing. Therefore, I assume that reading from a non-sorted store will maintain the same key order as in the store. I have updated the user guide accordingly.
I removed the sorted()
function from the documentation, as I assume it is no longer needed. There are also instances in the tests where sorted()
is used before matching against the expected outcome. Technically, these are also no longer needed. However, I would keep them to make those tests specific; a regression in the sort shouldn't cause those tests to fail.
I added a versionadded
directive, but the result is rather bold – Sphinx puts the entire paragraph in a green box, as if it were a warning. I am happy to keep it that way, but personally, I don't like the aesthetic.
1cdf29a
to
8ceb880
Compare
8ceb880
to
1fb326e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I think the codecov failure is spurious.
@dstansby are you able to take this the rest of the way? Your review is blocking the merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implementation and docs look 👍 I have one minor doc comment inline, which I think is worth fixing before merging to avoid confusion.
Head branch was pushed to by a user without write access
1fb326e
to
dfddbec
Compare
…nsolidated metadata
…data (#3322) Co-authored-by: Lukas Kluft <[email protected]>
This PR ensures that the contents of the consolidated metadata are sorted.
As a result, the consolidated metadata for the same input array becomes deterministic across all file systems, enabling, e.g., reliable checksum computation.
Keys are sorted in ascending order by path length (where a path is a sequence of strings joined by "/"). For keys with the same path length, lexicographic order is used as a tie-breaker (as suggested in #3281)
This sorting could be implemented in several places; for example,
flattened_metadata
could already be sorted. I chose to implement it in theto_dict()
method, which seemed the least invasive and closest to the end of the internal metadata processing pipeline. However, I'm open to moving this elsewhere if there’s a better place for it.I didn’t include a unit test because this behavior is difficult to test directly. It appears many existing tests assume sorted metadata. I suppose they pass due to the use of memory stores. That said, I'm happy to add a test if you have ideas on how best to validate this determinism.
TODO:
docs/user-guide/*.rst
changes/