Skip to content

[do not merge] proof of concept for unified v2 / v3 codecs #3276

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 129 commits into
base: main
Choose a base branch
from

Conversation

d-v-b
Copy link
Contributor

@d-v-b d-v-b commented Jul 21, 2025

this PR is a proof of concept that demonstrates how we can augment our Codec API to support v2 or v3 codecs.

Here are two motivating examples that run under this PR and fail in main:

gzip compression

This example shows that we can use the exact same codec (GZipCodec OR numcodecs.GZip) with zarr v2 or zarr v3
to create an array. This fails on main if you try to use GZipCodec with zarr v2, or numcodecs.GZip with zarr v3, even though the underlying gzip compression is identical.

# /// script
# requires-python = ">=3.11"
# dependencies = [
#   "zarr @ git+https://github.com/d-v-b/zarr-python.git@a2bc6555",
#   "pytest"
# ]
# ///
import json

import pytest
import numpy as np
import zarr
from zarr.codecs import GzipCodec
from numcodecs.gzip import GZip

@pytest.mark.parametrize("zarr_format", [2, 3])
@pytest.mark.parametrize('codec_cls', [GzipCodec, GZip])
def test_gzip_compression(zarr_format, codec_cls) -> None:
    store = {}
    z_w = zarr.create_array(
        store=store,
        dtype="int",
        shape=(1,),
        chunks=(10,),
        zarr_format=zarr_format,
        compressors=codec_cls(),
    )
    z_w[:] = 5

    z_r = zarr.open_array(store=store, zarr_format=zarr_format)
    assert np.all(z_r[:] == 5)

if __name__ == "__main__":
    pytest.main([__file__, f"-c {__file__}", "-s"])

jpeg compression

This example demonstrates using the Jpeg codec defined in imagecodecs as a compressor for a v2 array and a serializer for a v3 array. Data can be written and read back.

  • The codec comes in to create_array as an object that implements the numcodecs.abc.Codec API
  • we internally wrap that object in a class that emulates the zarr-python Codec API.
  • v2 and v3 JSON metadata is created by the same object, depending on the zarr format of the array
  • when decoding, we check the numcodecs registry, but only if numcodecs is installed. this could be improved -- we could just register the codec directly in our own codec registry, and remove numcodecs entirely from this example.
# /// script
# requires-python = ">=3.11"
# dependencies = [
#   "zarr @ git+https://github.com/d-v-b/zarr-python.git@a2bc6555",
#   "imagecodecs==2025.3.30",
#   "pytest"
# ]
# ///

#   "zarr @ git+https://github.com/zarr-developers/zarr-python.git@main",
from typing import Literal

import numcodecs
import numpy as np
import pytest
from imagecodecs.numcodecs import Jpeg

import zarr

numcodecs.register_codec(Jpeg)
jpg_codec = Jpeg()


@pytest.mark.parametrize("zarr_format", [2, 3])
def test(zarr_format: Literal[2, 3]) -> None:
    store = {}
    if zarr_format == 2:
        z_w = zarr.create_array(
            store=store,
            data=np.zeros((100, 100, 3), dtype=np.uint8),
            compressors=jpg_codec,
            zarr_format=zarr_format,
        )
    else:
        z_w = zarr.create_array(
            store=store,
            data=np.zeros((100, 100, 3), dtype=np.uint8),
            serializer=jpg_codec,
            zarr_format=zarr_format,
        )
    z_w[:] = 2
    z_r = zarr.open_array(store=store, zarr_format=zarr_format)
    assert np.all(z_r[:] == 2)
    if zarr_format == 2:
        print(z_r.metadata.to_dict()["compressor"])
    else:
        print(z_r.metadata.to_dict()["codecs"])


if __name__ == "__main__":
    pytest.main([__file__, f"-c {__file__}", "-s"])

What's this requires

This functionality requires a set of changes that I would like to introduce in a series of PRs:

  • Adding to_json(zarr_format: Literal[2,3] methods to all the codecs, so the same object can generate zarr v2 or zarr v3 metadata. example.
  • Adding from_json(data, zarr_format: Literal[2,3]) methods to all the codecs, so the same object can be created from zarr v2 metadata or zarr v3 metadata. example.
  • Adding a protocol that models the numcodecs,abc.Codec API, so we can interact with numcodecs objects type-safely without a numcodecs dependency. example.
  • Adding a Codec class that wraps a numcodec-like object and endows it with full codec powers. These numcodec-adapter objects can be cast to array-array codec, array-bytes-codec, or bytes-bytes codec as needed. example
  • Adding logic in the codec pipeline to handle codecs that are a mix of zarr python native codecs and numcodecs-adapter codecs. This ensures that we have a valid codec pipeline. example.
    ... There might be more changes that I'm forgetting right now, but these are the big ones.

We also need checks to ensure that a codec claiming to be "gzip" generates gzip-compatible metadata. There are a few ways of doing this (inspect the metadata it generates, or replace it with the in-house codec with the same name), I haven't implemented either option in this PR.

What I would like to do next

If we all agree on this strategy, I would like to start breaking this PR into separate segments and getting them merged. I think we can do all of this in a non-breaking manner, so hopefully we don't need this to be part of 3.2

@github-actions github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Jul 21, 2025
@d-v-b d-v-b changed the title [do not merge] proof of concept for a unified v2 / v3 codecs [do not merge] proof of concept for unified v2 / v3 codecs Jul 21, 2025
Copy link

codecov bot commented Jul 21, 2025

Codecov Report

Attention: Patch coverage is 36.15819% with 339 lines in your changes missing coverage. Please review.

Project coverage is 58.15%. Comparing base (abbdbf2) to head (50c6b48).

Files with missing lines Patch % Lines
src/zarr/codecs/numcodec.py 33.33% 52 Missing ⚠️
src/zarr/codecs/blosc.py 30.64% 43 Missing ⚠️
src/zarr/codecs/vlen_utf8.py 28.26% 33 Missing ⚠️
src/zarr/codecs/bytes.py 36.17% 30 Missing ⚠️
src/zarr/codecs/sharding.py 22.58% 24 Missing ⚠️
src/zarr/codecs/transpose.py 20.68% 23 Missing ⚠️
src/zarr/codecs/zstd.py 34.28% 23 Missing ⚠️
src/zarr/abc/codec.py 16.66% 20 Missing ⚠️
src/zarr/codecs/crc32c_.py 25.92% 20 Missing ⚠️
src/zarr/codecs/gzip.py 35.48% 20 Missing ⚠️
... and 6 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3276      +/-   ##
==========================================
- Coverage   59.56%   58.15%   -1.42%     
==========================================
  Files          78       79       +1     
  Lines        8684     9057     +373     
==========================================
+ Hits         5173     5267      +94     
- Misses       3511     3790     +279     
Files with missing lines Coverage Δ
src/zarr/core/codec_pipeline.py 62.61% <ø> (ø)
src/zarr/codecs/_v2.py 0.00% <0.00%> (-59.58%) ⬇️
src/zarr/core/metadata/v3.py 54.48% <62.50%> (+0.85%) ⬆️
src/zarr/core/array.py 68.45% <59.09%> (-0.58%) ⬇️
src/zarr/core/common.py 35.48% <0.00%> (-2.45%) ⬇️
src/zarr/core/metadata/v2.py 56.96% <59.37%> (+0.29%) ⬆️
src/zarr/registry.py 65.71% <70.21%> (+1.68%) ⬆️
src/zarr/abc/codec.py 20.45% <16.66%> (-0.76%) ⬇️
src/zarr/codecs/crc32c_.py 32.72% <25.92%> (-8.45%) ⬇️
src/zarr/codecs/gzip.py 30.00% <35.48%> (+1.42%) ⬆️
... and 7 more
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs release notes Automatically applied to PRs which haven't added release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant