Skip to content

Ensure deterministic ordering of consolidated metadata #3288

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Aug 1, 2025

Conversation

lkluft
Copy link
Contributor

@lkluft lkluft commented Jul 23, 2025

This PR ensures that the contents of the consolidated metadata are sorted.

As a result, the consolidated metadata for the same input array becomes deterministic across all file systems, enabling, e.g., reliable checksum computation.

Keys are sorted in ascending order by path length (where a path is a sequence of strings joined by "/"). For keys with the same path length, lexicographic order is used as a tie-breaker (as suggested in #3281)

This sorting could be implemented in several places; for example, flattened_metadata could already be sorted. I chose to implement it in the to_dict() method, which seemed the least invasive and closest to the end of the internal metadata processing pipeline. However, I'm open to moving this elsewhere if there’s a better place for it.

I didn’t include a unit test because this behavior is difficult to test directly. It appears many existing tests assume sorted metadata. I suppose they pass due to the use of memory stores. That said, I'm happy to add a test if you have ideas on how best to validate this determinism.

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/user-guide/*.rst
  • Changes documented as a new file in changes/
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

Copy link

codecov bot commented Jul 23, 2025

Codecov Report

❌ Patch coverage is 0% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 60.72%. Comparing base (4eda04e) to head (a65e598).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/zarr/core/group.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3288      +/-   ##
==========================================
- Coverage   60.73%   60.72%   -0.01%     
==========================================
  Files          78       78              
  Lines        9407     9408       +1     
==========================================
  Hits         5713     5713              
- Misses       3694     3695       +1     
Files with missing lines Coverage Δ
src/zarr/core/group.py 70.12% <0.00%> (-0.08%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@dstansby dstansby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a good change 👍 Please could you add a test that checks the order of metadata, to make sure we don't accidentally break this in the future.

@TomAugspurger
Copy link
Contributor

Agreed.

Could you also update the docs (probably both the docs in open_consolidated and https://github.com/zarr-developers/zarr-python/blob/main/docs/user-guide/consolidated_metadata.rst) stating what users can rely on?

@TomAugspurger
Copy link
Contributor

And if you're feeling especially ambitious, you could note this as a desirable feature at zarr-developers/zarr-specs#309. There could possibly be some indication in the Zarr metadata that the consolidated metadata has been sorted.

@lkluft lkluft force-pushed the sort-consolidated-metadata branch 3 times, most recently from 221b95f to 1cdf29a Compare July 24, 2025 16:14
@lkluft
Copy link
Contributor Author

lkluft commented Jul 24, 2025

I added a test that checks the proper sorting of a (nested) store structure.

I also included a short paragraph in the user guide, that explains the new sorting of the consolidated metadata keys. I couldn't find a specific doc string to modify, and while thinking about it, the new behaviour seems less important on the individual function level, and more appropriate for the user guide.

@@ -100,6 +100,11 @@ With nested groups, the consolidated metadata is available on the children, recu
>>> consolidated['child'].metadata.consolidated_metadata
ConsolidatedMetadata(metadata={'child': GroupMetadata(attributes={'kind': 'grandchild'}, zarr_format=3, consolidated_metadata=ConsolidatedMetadata(metadata={}, kind='inline', must_understand=False), node_type='group')}, kind='inline', must_understand=False)

The keys in the consolidated metadata are sorted in ascending order by path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify whether this is on the write side or read side (I think write?). That implies that users can rely on this when reading files written by zarr-python, but it's not necessarily safe to assume, unless we always sort when reading, since other systems writing consolidated metadata might not sort.

And perhaps note that this is new, by putting this in a .. versionadded:: 3.1.1 (or whatever the next version is, I'm not sure).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Earlier in this doc, we do a pprint(dict(sorted(consolidated_metadata.items()))). Is that sorted unnecessary now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand it, the to_dict() method is only used before writing. Therefore, I assume that reading from a non-sorted store will maintain the same key order as in the store. I have updated the user guide accordingly.

I removed the sorted() function from the documentation, as I assume it is no longer needed. There are also instances in the tests where sorted() is used before matching against the expected outcome. Technically, these are also no longer needed. However, I would keep them to make those tests specific; a regression in the sort shouldn't cause those tests to fail.

I added a versionadded directive, but the result is rather bold – Sphinx puts the entire paragraph in a green box, as if it were a warning. I am happy to keep it that way, but personally, I don't like the aesthetic.

@lkluft lkluft force-pushed the sort-consolidated-metadata branch from 1cdf29a to 8ceb880 Compare July 25, 2025 09:36
@lkluft lkluft force-pushed the sort-consolidated-metadata branch from 8ceb880 to 1fb326e Compare July 25, 2025 09:55
Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I think the codecov failure is spurious.

@TomAugspurger TomAugspurger requested a review from dstansby July 29, 2025 13:18
@TomAugspurger TomAugspurger enabled auto-merge (squash) July 29, 2025 13:18
@dstansby dstansby added this to the 3.1.2 milestone Jul 31, 2025
@TomAugspurger
Copy link
Contributor

@dstansby are you able to take this the rest of the way? Your review is blocking the merge.

Copy link
Contributor

@dstansby dstansby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation and docs look 👍 I have one minor doc comment inline, which I think is worth fixing before merging to avoid confusion.

auto-merge was automatically disabled August 1, 2025 10:23

Head branch was pushed to by a user without write access

@lkluft lkluft force-pushed the sort-consolidated-metadata branch from 1fb326e to dfddbec Compare August 1, 2025 10:23
@dstansby dstansby enabled auto-merge (squash) August 1, 2025 10:34
@dstansby dstansby merged commit e410173 into zarr-developers:main Aug 1, 2025
29 checks passed
meeseeksmachine pushed a commit to meeseeksmachine/zarr-python that referenced this pull request Aug 1, 2025
@lkluft lkluft deleted the sort-consolidated-metadata branch August 1, 2025 13:07
dstansby pushed a commit that referenced this pull request Aug 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants