-
Notifications
You must be signed in to change notification settings - Fork 102
[CW/R2] Tensorstore Malformed StorageGeneration #4373
Description
Describe the bug
The cw-ci-test integration test fails with a tensorstore Malformed StorageGeneration error when reading tokenized zarr3 data from R2 (S3-compatible) storage. The failure is in the quickstart-tests/tokenized_2fdf2156 step of the marin-itest pipeline.
This failure is unrelated to the PR under test (#4370, K8s stream_logs OOM fix) — it appears to be a flaky interaction between tensorstore and R2.
To Reproduce
- Push any PR that triggers the
cw-ci-testworkflow. - The
marin-itestpipeline submits a quickstart tokenization job. - A subsequent step tries to open the tokenized zarr3 array and fails with
Malformed StorageGeneration.
Expected behavior
The tokenized zarr3 array should be readable after the tokenization step completes.
Additional context
Error from CI run:
ValueError: Error opening "zarr3" driver: Error reading
"temp/ci/marin-itest-22712054/quickstart-tests/tokenized-20383a/train/part-00000-of-00001/input_ids/offsets/zarr.json":
Malformed StorageGeneration
The tensorstore spec uses "recheck_cached_data": false, which may cause stale cache entries to collide with freshly written data. The kvstore driver is s3 pointed at the R2 endpoint (74981a43be0de7712369306c7b19133d.r2.cloudflarestorage.com, bucket marin-na).
Possible causes:
- R2 returning an unexpected ETag format that tensorstore cannot parse as a StorageGeneration.
- A race between the writer finishing and the reader opening the array when caching is aggressive (
recheck_cached_data: false). - Transient R2 API inconsistency during the CI window.