Skip to content

[CW/R2] Tensorstore Malformed StorageGeneration #4373

@ravwojdyla-agent

Description

@ravwojdyla-agent

Describe the bug

The cw-ci-test integration test fails with a tensorstore Malformed StorageGeneration error when reading tokenized zarr3 data from R2 (S3-compatible) storage. The failure is in the quickstart-tests/tokenized_2fdf2156 step of the marin-itest pipeline.

This failure is unrelated to the PR under test (#4370, K8s stream_logs OOM fix) — it appears to be a flaky interaction between tensorstore and R2.

To Reproduce

  1. Push any PR that triggers the cw-ci-test workflow.
  2. The marin-itest pipeline submits a quickstart tokenization job.
  3. A subsequent step tries to open the tokenized zarr3 array and fails with Malformed StorageGeneration.

Expected behavior

The tokenized zarr3 array should be readable after the tokenization step completes.

Additional context

Error from CI run:

ValueError: Error opening "zarr3" driver: Error reading
  "temp/ci/marin-itest-22712054/quickstart-tests/tokenized-20383a/train/part-00000-of-00001/input_ids/offsets/zarr.json":
  Malformed StorageGeneration

The tensorstore spec uses "recheck_cached_data": false, which may cause stale cache entries to collide with freshly written data. The kvstore driver is s3 pointed at the R2 endpoint (74981a43be0de7712369306c7b19133d.r2.cloudflarestorage.com, bucket marin-na).

Possible causes:

  • R2 returning an unexpected ETag format that tensorstore cannot parse as a StorageGeneration.
  • A race between the writer finishing and the reader opening the array when caching is aggressive (recheck_cached_data: false).
  • Transient R2 API inconsistency during the CI window.

Metadata

Metadata

Assignees

Labels

agent-generatedCreated by automation/agentbugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions