[do not merge] proof of concept for unified v2 / v3 codecs #3276

d-v-b · 2025-07-21T20:59:28Z

this PR is a proof of concept that demonstrates how we can augment our Codec API to support v2 or v3 codecs.

Here are two motivating examples that run under this PR and fail in main:

gzip compression

This example shows that we can use the exact same codec (GZipCodec OR numcodecs.GZip) with zarr v2 or zarr v3
to create an array. This fails on main if you try to use GZipCodec with zarr v2, or numcodecs.GZip with zarr v3, even though the underlying gzip compression is identical.

# /// script
# requires-python = ">=3.11"
# dependencies = [
#   "zarr @ git+https://github.com/d-v-b/zarr-python.git@a2bc6555",
#   "pytest"
# ]
# ///
import json

import pytest
import numpy as np
import zarr
from zarr.codecs import GzipCodec
from numcodecs.gzip import GZip

@pytest.mark.parametrize("zarr_format", [2, 3])
@pytest.mark.parametrize('codec_cls', [GzipCodec, GZip])
def test_gzip_compression(zarr_format, codec_cls) -> None:
    store = {}
    z_w = zarr.create_array(
        store=store,
        dtype="int",
        shape=(1,),
        chunks=(10,),
        zarr_format=zarr_format,
        compressors=codec_cls(),
    )
    z_w[:] = 5

    z_r = zarr.open_array(store=store, zarr_format=zarr_format)
    assert np.all(z_r[:] == 5)

if __name__ == "__main__":
    pytest.main([__file__, f"-c {__file__}", "-s"])

jpeg compression

This example demonstrates using the Jpeg codec defined in imagecodecs as a compressor for a v2 array and a serializer for a v3 array. Data can be written and read back.

The codec comes in to create_array as an object that implements the numcodecs.abc.Codec API
we internally wrap that object in a class that emulates the zarr-python Codec API.
v2 and v3 JSON metadata is created by the same object, depending on the zarr format of the array
when decoding, we check the numcodecs registry, but only if numcodecs is installed. this could be improved -- we could just register the codec directly in our own codec registry, and remove numcodecs entirely from this example.

# /// script
# requires-python = ">=3.11"
# dependencies = [
#   "zarr @ git+https://github.com/d-v-b/zarr-python.git@a2bc6555",
#   "imagecodecs==2025.3.30",
#   "pytest"
# ]
# ///

#   "zarr @ git+https://github.com/zarr-developers/zarr-python.git@main",
from typing import Literal

import numcodecs
import numpy as np
import pytest
from imagecodecs.numcodecs import Jpeg

import zarr

numcodecs.register_codec(Jpeg)
jpg_codec = Jpeg()


@pytest.mark.parametrize("zarr_format", [2, 3])
def test(zarr_format: Literal[2, 3]) -> None:
    store = {}
    if zarr_format == 2:
        z_w = zarr.create_array(
            store=store,
            data=np.zeros((100, 100, 3), dtype=np.uint8),
            compressors=jpg_codec,
            zarr_format=zarr_format,
        )
    else:
        z_w = zarr.create_array(
            store=store,
            data=np.zeros((100, 100, 3), dtype=np.uint8),
            serializer=jpg_codec,
            zarr_format=zarr_format,
        )
    z_w[:] = 2
    z_r = zarr.open_array(store=store, zarr_format=zarr_format)
    assert np.all(z_r[:] == 2)
    if zarr_format == 2:
        print(z_r.metadata.to_dict()["compressor"])
    else:
        print(z_r.metadata.to_dict()["codecs"])


if __name__ == "__main__":
    pytest.main([__file__, f"-c {__file__}", "-s"])

What's this requires

This functionality requires a set of changes that I would like to introduce in a series of PRs:

Adding to_json(zarr_format: Literal[2,3] methods to all the codecs, so the same object can generate zarr v2 or zarr v3 metadata. example.
Adding from_json(data, zarr_format: Literal[2,3]) methods to all the codecs, so the same object can be created from zarr v2 metadata or zarr v3 metadata. example.
Adding a protocol that models the numcodecs,abc.Codec API, so we can interact with numcodecs objects type-safely without a numcodecs dependency. example.
Adding a Codec class that wraps a numcodec-like object and endows it with full codec powers. These numcodec-adapter objects can be cast to array-array codec, array-bytes-codec, or bytes-bytes codec as needed. example
Adding logic in the codec pipeline to handle codecs that are a mix of zarr python native codecs and numcodecs-adapter codecs. This ensures that we have a valid codec pipeline. example.
... There might be more changes that I'm forgetting right now, but these are the big ones.

We also need checks to ensure that a codec claiming to be "gzip" generates gzip-compatible metadata. There are a few ways of doing this (inspect the metadata it generates, or replace it with the in-house codec with the same name), I haven't implemented either option in this PR.

What I would like to do next

If we all agree on this strategy, I would like to start breaking this PR into separate segments and getting them merged. I think we can do all of this in a non-breaking manner, so hopefully we don't need this to be part of 3.2

…base

…o JSON

… registry load frequency, add object_codec_id for v2 json deserialization

codecov · 2025-07-21T21:11:40Z

Codecov Report

Attention: Patch coverage is 36.15819% with 339 lines in your changes missing coverage. Please review.

Project coverage is 58.15%. Comparing base (abbdbf2) to head (50c6b48).

Files with missing lines	Patch %	Lines
src/zarr/codecs/numcodec.py	33.33%	52 Missing ⚠️
src/zarr/codecs/blosc.py	30.64%	43 Missing ⚠️
src/zarr/codecs/vlen_utf8.py	28.26%	33 Missing ⚠️
src/zarr/codecs/bytes.py	36.17%	30 Missing ⚠️
src/zarr/codecs/sharding.py	22.58%	24 Missing ⚠️
src/zarr/codecs/transpose.py	20.68%	23 Missing ⚠️
src/zarr/codecs/zstd.py	34.28%	23 Missing ⚠️
src/zarr/abc/codec.py	16.66%	20 Missing ⚠️
src/zarr/codecs/crc32c_.py	25.92%	20 Missing ⚠️
src/zarr/codecs/gzip.py	35.48%	20 Missing ⚠️
... and 6 more

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3276      +/-   ##
==========================================
- Coverage   59.56%   58.15%   -1.42%     
==========================================
  Files          78       79       +1     
  Lines        8684     9057     +373     
==========================================
+ Hits         5173     5267      +94     
- Misses       3511     3790     +279

Files with missing lines	Coverage Δ
src/zarr/core/codec_pipeline.py	`62.61% <ø> (ø)`
src/zarr/codecs/_v2.py	`0.00% <0.00%> (-59.58%)`	⬇️
src/zarr/core/metadata/v3.py	`54.48% <62.50%> (+0.85%)`	⬆️
src/zarr/core/array.py	`68.45% <59.09%> (-0.58%)`	⬇️
src/zarr/core/common.py	`35.48% <0.00%> (-2.45%)`	⬇️
src/zarr/core/metadata/v2.py	`56.96% <59.37%> (+0.29%)`	⬆️
src/zarr/registry.py	`65.71% <70.21%> (+1.68%)`	⬆️
src/zarr/abc/codec.py	`20.45% <16.66%> (-0.76%)`	⬇️
src/zarr/codecs/crc32c_.py	`32.72% <25.92%> (-8.45%)`	⬇️
src/zarr/codecs/gzip.py	`30.00% <35.48%> (+1.42%)`	⬆️
... and 7 more

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

d-v-b added 30 commits July 20, 2025 20:22

modernize typing

32e60d2

lint

f104f27

new dtypes

be6dedd

rename base dtype, change type to kind

f0dfbbf

start working on JSON serialization

06db4f6

get json de/serialization largely working, and start making tests pass

2bb4707

tweak json type guards

edcb7eb

fix dtype sizes, adjust fill value parsing in from_dict, fix tests

3fd0bf8

mid-refactor commit

404a71c

working form for dtype classes

aaeeb98

remove unused code

ec934b8

use wrap / unwrap instead of to_dtype / from_dtype; push into v2 code…

8369ffc

…base

push into v2

0aa1e49

remove endianness kwarg to methods, make it an instance variable instead

de24a14

make wrapping safe by default

31a39d6

dtype-specific tests

2079efe

more tests, fix void type default value logic

46a761b

fix dtype mechanics in bytescodec

3507eff

remove __post_init__ magic in favor of more explicit declaration

53205ca

fix tests

ba9c06e

refactor data types

04f3b84

start design doc

925b9e2

more design doc

e2fce7f

update docs

a583cd3

fix sphinx warnings

e0b662d

tweak docs

ed0c76b

info about v3 data types

79a8fd2

adjust note

5e15369

fix: use unparametrized types in direct assignment

14da662

start fixing config

a050f3b

d-v-b added 25 commits July 20, 2025 20:23

remove vestigial use of to_dtype().itemsize()

2b725ee

remove another vestigial use of to_dtype().itemsize()

03259c6

emit warning about unstable dtype when serializing Structured dtype t…

9a87b3d

…o JSON

put string dtypes in the strings module

de76df0

make tests isomorphic to source code

b4f1063

remove old string logic

7b6c78c

use scale_factor and unit in cast_value for datetime

63ad7f5

add regression testing against v2.18

e0b5a64

truncate U and S scalars in _cast_value_unsafe

6437c8d

docstrings and simplification for regression tests

d9ab8da

changes necessary for linting with regression tests

3302161

improve method names, refactor type hints with typeddictionaries, fix…

4a301d9

… registry load frequency, add object_codec_id for v2 json deserialization

fix storage info discrepancy in docs

12bbb07

fix docstring that was troubling sphinx

463789b

wip: add vlen-bytes

e665cef

add vlen-bytes

35116af

wip

73c3c45

wip

6295578

add image codecs test

64f234e

wip

6eb3298

pass tests

e463d0a

expand example

2cfc848

revert to main

60939c2

recover from bad rebase

31c95ca

remove off-target changes

a2bc655

github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Jul 21, 2025

d-v-b changed the title ~~[do not merge] proof of concept for a unified v2 / v3 codecs~~ [do not merge] proof of concept for unified v2 / v3 codecs Jul 21, 2025

d-v-b mentioned this pull request Jul 21, 2025

a codec simplification plan #3162

Open

update imagecodecs example

50c6b48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[do not merge] proof of concept for unified v2 / v3 codecs #3276

[do not merge] proof of concept for unified v2 / v3 codecs #3276

Uh oh!

d-v-b commented Jul 21, 2025

Uh oh!

codecov bot commented Jul 21, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

[do not merge] proof of concept for unified v2 / v3 codecs #3276

Are you sure you want to change the base?

[do not merge] proof of concept for unified v2 / v3 codecs #3276

Uh oh!

Conversation

d-v-b commented Jul 21, 2025

gzip compression

jpeg compression

What's this requires

What I would like to do next

Uh oh!

codecov bot commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

codecov bot commented Jul 21, 2025 •

edited

Loading