Skip to content

Conversation

@nbren12
Copy link

@nbren12 nbren12 commented Oct 30, 2025

Writing to sharded arrays was up to 10x slower for largish chunk sizes because the _ShardBuilder object has many calls to np.concatenate. This commit coalesces these into a single concatenate call, and improves write performance by a factor of 10 on the benchmarking script in #3560.

Added a new core.Buffer.combine API

Resolves #3560

[Description of PR]

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/user-guide/*.md
  • Changes documented as a new file in changes/
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

Writing to sharded arrays was up to 10x slower for largish chunk sizes
because the _ShardBuilder object has many calls to np.concatenate. This
commit coalesces these into a single concatenate call, and improves write
performance by a factor of 10 on the benchmarking script in zarr-developers#3560.

Added a new core.Buffer.combine API

Resolves zarr-developers#3560

Signed-off-by: Noah D. Brenowitz <[email protected]>
@github-actions github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Oct 30, 2025
@nbren12
Copy link
Author

nbren12 commented Oct 30, 2025

Benchmarking results
Before:

================================================================================
Zarr Sharding Benchmark
================================================================================
Zarr version: 3.1.4.dev34+gb3e9aed30
Array shape: (1000, 100000)
Chunks: (1, 100000)
Shard chunks: (100, 1000000)
Data type: <class 'numpy.float32'>
Total data size: 381.47 MB
Number of iterations: 2
================================================================================

Generating test data...
Test data generated: 381.47 MB

================================================================================
BENCHMARK: Without Sharding
================================================================================

Write (no sharding):
  Mean time: 0.5280 ? 0.0161 seconds
  Throughput: 722.44 MB/s
  Times: ['0.5442', '0.5119']

Read (no sharding, full):
  Mean time: 0.3106 ? 0.0060 seconds
  Throughput: 1228.23 MB/s
  Times: ['0.3165', '0.3046']

Read (no sharding, partial):
  Mean time: 0.0267 ? 0.0001 seconds
  Throughput: 142.88 MB/s
  Times: ['0.0266', '0.0268']

================================================================================
BENCHMARK: With Sharding
================================================================================

Write (with sharding):
  Mean time: 4.2136 ? 0.1268 seconds
  Throughput: 90.53 MB/s
  Times: ['4.3404', '4.0867']

Read (with sharding, full):
  Mean time: 0.3533 ? 0.0036 seconds
  Throughput: 1079.64 MB/s
  Times: ['0.3569', '0.3498']

Read (with sharding, partial):
  Mean time: 0.0317 ? 0.0003 seconds
  Throughput: 120.33 MB/s
  Times: ['0.0320', '0.0314']

================================================================================
COMPARISON
================================================================================
Write: 0.13x speedup with sharding (or 7.98x slower)
Read: 0.88x speedup with sharding (or 1.14x slower)

================================================================================
Benchmark complete!
================================================================================

After

================================================================================
Zarr Sharding Benchmark
================================================================================
Zarr version: 3.1.4.dev34+gb3e9aed30
Array shape: (1000, 100000)
Chunks: (1, 100000)
Shard chunks: (100, 1000000)
Data type: <class 'numpy.float32'>
Total data size: 381.47 MB
Number of iterations: 2
================================================================================

Generating test data...
Test data generated: 381.47 MB

================================================================================
BENCHMARK: Without Sharding
================================================================================

Write (no sharding):
  Mean time: 0.5144 ? 0.0080 seconds
  Throughput: 741.65 MB/s
  Times: ['0.5223', '0.5064']

Read (no sharding, full):
  Mean time: 0.3128 ? 0.0018 seconds
  Throughput: 1219.45 MB/s
  Times: ['0.3147', '0.3110']

Read (no sharding, partial):
  Mean time: 0.0287 ? 0.0002 seconds
  Throughput: 133.12 MB/s
  Times: ['0.0288', '0.0285']

================================================================================
BENCHMARK: With Sharding
================================================================================

Write (with sharding):
  Mean time: 0.6107 ? 0.0017 seconds
  Throughput: 624.61 MB/s
  Times: ['0.6125', '0.6090']

Read (with sharding, full):
  Mean time: 0.2908 ? 0.0030 seconds
  Throughput: 1311.91 MB/s
  Times: ['0.2938', '0.2877']

Read (with sharding, partial):
  Mean time: 0.0238 ? 0.0006 seconds
  Throughput: 160.59 MB/s
  Times: ['0.0244', '0.0231']

================================================================================
COMPARISON
================================================================================
Write: 0.84x speedup with sharding (or 1.19x slower)
Read: 1.08x speedup with sharding (or 0.93x slower)

================================================================================
Benchmark complete!
================================================================================

Signed-off-by: Noah D. Brenowitz <[email protected]>
Signed-off-by: Noah D. Brenowitz <[email protected]>
@d-v-b
Copy link
Contributor

d-v-b commented Oct 30, 2025

thanks so much to tackling this problem!

@nbren12
Copy link
Author

nbren12 commented Oct 30, 2025

hmmm. lot's of failing tests. I'm having some trouble grokking the implementation of the sharding codec.

@d-v-b
Copy link
Contributor

d-v-b commented Oct 30, 2025

I'll have a look. I don't know this section of the codebase very well, but maybe I can figure something out. And BTW if you see any opportunities to refactor / simplify the logic here, feel free to move in that direction. I think making this code easier to understand would be a big win.

@d-v-b
Copy link
Contributor

d-v-b commented Oct 30, 2025

diff --git a/src/zarr/codecs/sharding.py b/src/zarr/codecs/sharding.py
index 0e36f50f..31d68027 100644
--- a/src/zarr/codecs/sharding.py
+++ b/src/zarr/codecs/sharding.py
@@ -227,6 +227,7 @@ class _ShardBuilder(_ShardReader, ShardMutableMapping):
     buffers: list[Buffer]
     index: _ShardIndex
     buf: Buffer
+    _chunk_buffers: dict[tuple[int, ...], Buffer]  # Map chunk_coords to buffers
 
     @classmethod
     def merge_with_morton_order(
@@ -255,13 +256,24 @@ class _ShardBuilder(_ShardReader, ShardMutableMapping):
         obj = cls()
         obj.buf = buffer_prototype.buffer.create_zero_length()
         obj.buffers = []
+        obj._chunk_buffers = {}
         obj.index = _ShardIndex.create_empty(chunks_per_shard)
         return obj
 
+    def __getitem__(self, chunk_coords: tuple[int, ...]) -> Buffer:
+        # Override to use _chunk_buffers instead of self.buf
+        if chunk_coords in self._chunk_buffers:
+            return self._chunk_buffers[chunk_coords]
+        raise KeyError
+
     def __setitem__(self, chunk_coords: tuple[int, ...], value: Buffer) -> None:
         chunk_start = sum(len(buf) for buf in self.buffers)
         chunk_length = len(value)
-        self.buffers.append(value)
+        # Store the buffer for later retrieval
+        self._chunk_buffers[chunk_coords] = value
+        # Only append non-empty buffers to avoid messing up offset calculations
+        if chunk_length > 0:
+            self.buffers.append(value)
         self.index.set_chunk_slice(chunk_coords, slice(chunk_start, chunk_start + chunk_length))
 
     def __delitem__(self, chunk_coords: tuple[int, ...]) -> None:
@@ -280,7 +292,8 @@ class _ShardBuilder(_ShardReader, ShardMutableMapping):
             self.buffers.insert(0, index_bytes)
         else:
             self.buffers.append(index_bytes)
-        self.buf = self.buf.combine(self.buffers)

makes this work for me locally

@d-v-b
Copy link
Contributor

d-v-b commented Oct 30, 2025

@nbren12 please take a look at https://github.com/d-v-b/zarr-python/blob/chore/sharding-refactor/src/zarr/codecs/sharding.py, it's a refactor of the sharding logic that might be easier to reason about.

remove inheritance, hide the index attribute  and remove some indirection

Signed-off-by: Noah D. Brenowitz <[email protected]>
just use dicts

Signed-off-by: Noah D. Brenowitz <[email protected]>
@nbren12
Copy link
Author

nbren12 commented Oct 30, 2025

Thanks @d-v-b. I ended up pursuing an alternative refactor. I removed the various shard builder objects and things seem to be working now.

Signed-off-by: Noah D. Brenowitz <[email protected]>
@nbren12
Copy link
Author

nbren12 commented Oct 30, 2025

@d-v-b Hopefully my latest commit fixes the tests.

For a future PR, the partial write case could definitely use further optimization. When the index is stored at the end it should be possible to write just the new chunks...rather than the whole shard.

@codecov
Copy link

codecov bot commented Oct 30, 2025

Codecov Report

❌ Patch coverage is 81.81818% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 61.87%. Comparing base (b3e9aed) to head (a32e86a).

Files with missing lines Patch % Lines
src/zarr/codecs/sharding.py 86.11% 5 Missing ⚠️
src/zarr/core/buffer/core.py 0.00% 3 Missing ⚠️
src/zarr/core/buffer/cpu.py 85.71% 1 Missing ⚠️
src/zarr/core/buffer/gpu.py 88.88% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3561      +/-   ##
==========================================
+ Coverage   61.77%   61.87%   +0.10%     
==========================================
  Files          85       85              
  Lines       10165    10134      -31     
==========================================
- Hits         6279     6270       -9     
+ Misses       3886     3864      -22     
Files with missing lines Coverage Δ
src/zarr/core/buffer/cpu.py 41.37% <85.71%> (+3.19%) ⬆️
src/zarr/core/buffer/gpu.py 43.37% <88.88%> (+2.12%) ⬆️
src/zarr/core/buffer/core.py 30.34% <0.00%> (-0.65%) ⬇️
src/zarr/codecs/sharding.py 62.10% <86.11%> (+3.02%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs release notes Automatically applied to PRs which haven't added release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

writing to a sharded array is 10x slower

2 participants