Skip to content

Conversation

@BrianMichell
Copy link

@BrianMichell BrianMichell commented Dec 4, 2025

Resolves #241

This PR implements support for structured data in the zarr3 driver as well as the ability to interact with the full array as raw bytes to avoid needing to manage multiple Stores in client code. I am also happy to revisit the question of #125 for a more general Zarr support for this behavior. I have also implemented the open_as_void feature in the zarr driver for feature parity between the formats.

Below is an example workflow that this usecase would support.

The Python snippet will generate a structured array of several different elements and types. A real world example of this is creating an array to store the SEG-Y 240-byte Trace Headers.

Click here to expand Python example.
import numpy as np
import zarr

store = "foo.zarr"

np_dtype = np.dtype(
    [
        ("field_1", "<i4"),
        ("field_2", ">i4"),
        # More theoretical fields here...
        ("field_n", "<f4"),
    ]
)

z = zarr.create_array(
    store=store,
    shape=(128, 128),
    dtype=np_dtype,
    chunks=(32, 32),
)

arr = np.zeros((128, 128), dtype=np_dtype)

f1 = np.arange(128, dtype="<i4")
# NOTE: We populate f2 as little endian for the demonstration
f2 = np.arange(128, dtype="<i4")
fn = np.arange(128, dtype="<f4") / 10

arr["field_1"][:] = f1[:, None]
arr["field_2"][:] = f2[:, None]
arr["field_n"][:] = fn[:, None]

z[:] = arr

The usecase continues with an HPC C++ application consuming either a single field from the structured dtype, or the entirety of the dtype to perform its intended workload. To keep this example code simple, I only demonstrate consuming the same field out of the raw data, however this could reasonably be extended to 70+ fields per element if writing a trivial data loader.

Click here to expand C++ example.
// Minimal example demonstrating structured Zarr v3 data reading.
// This example reads the 'foo.zarr' output generated by generate_struct.py
//
// Usage:
//   ./read_foo_zarr --zarr_path=/path/to/foo.zarr
#include <stdint.h>

#include <iostream>
#include <string>
#include <cstring>

#include "absl/flags/flag.h"
#include "absl/flags/parse.h"
#include "absl/status/status.h"
#include <nlohmann/json.hpp>
#include "tensorstore/array.h"
#include "tensorstore/context.h"
#include "tensorstore/data_type.h"
#include "tensorstore/index.h"
#include "tensorstore/open.h"
#include "tensorstore/open_mode.h"
#include "tensorstore/spec.h"
#include "tensorstore/tensorstore.h"
#include "tensorstore/util/result.h"
#include "tensorstore/util/status.h"


ABSL_FLAG(std::string, zarr_path,
          "/foo.zarr",
          "Path to the foo.zarr directory");

namespace {

using ::tensorstore::Index;

// Helper function to read and display data from a tensorstore
// T: the logical value type you want to interpret the data as
// offset_bytes:
//   -1 (default): interpret array.data() as a T* over the logical elements
//   >=0        : treat array.data() as a byte buffer and start reading T
//                values at `offset_bytes`, stepping by sizeof(T).
template <typename T>
absl::Status ReadAndDisplayData(const tensorstore::TensorStore<>& store,
                                const std::string& description,
                                Index offset_bytes = -1) {
  std::cout << "\n=== " << description << " ===" << std::endl;

  // Get array information
  auto domain = store.domain();
  std::cout << "Domain: " << domain << std::endl;
  std::cout << "Data type: " << store.dtype() << std::endl;

  // Read all data
  TENSORSTORE_ASSIGN_OR_RETURN(
      auto array,
      tensorstore::Read<tensorstore::zero_origin>(store).result());

  std::cout << "Successfully read array with " << array.num_elements()
            << " elements" << std::endl;

  auto shape = domain.shape();
  if (shape.size() < 2) {
    std::cout << "Rank < 2, skipping pretty-print of first 2D block."
              << std::endl;
    return absl::OkStatus();
  }

  Index rows = std::min(shape[0], Index{5});
  Index cols = std::min(shape[1], Index{5});

  std::cout << "First " << rows << "x" << cols << " elements";
  if (offset_bytes >= 0) {
    std::cout << " starting at byte offset " << offset_bytes;
  }
  std::cout << " interpreted as " << sizeof(T) * 8 << "-bit values:"
            << std::endl;

  // No offset: “normal” interpretation as T[]
  if (offset_bytes < 0) {
    const T* data = reinterpret_cast<const T*>(array.data());

    for (Index i = 0; i < rows; ++i) {
      for (Index j = 0; j < cols; ++j) {
        Index idx = i * shape[1] + j;
        if (idx >= array.num_elements()) break;
        std::cout << data[idx] << "\t";
      }
      std::cout << std::endl;
    }

    return absl::OkStatus();
  }

  // Offset mode: interpret as raw bytes and then per-record struct.
  const auto* bytes = reinterpret_cast<const uint8_t*>(array.data());

  // Total bytes in buffer (works for any dtype).
  std::size_t dtype_size = store.dtype().size();
  Index total_bytes =
      static_cast<Index>(array.num_elements()) *
      static_cast<Index>(dtype_size == 0 ? 1 : dtype_size);

  if (total_bytes == 0) {
    std::cout << "[empty buffer]" << std::endl;
    return absl::OkStatus();
  }

  // Determine "record" size in bytes.
  //
  // For open_as_void on a structured array, we get something like:
  //   shape = [N0, N1, record_bytes]
  //   dtype = byte
  //
  // In that case, treat the last dimension as the record layout.
  Index record_bytes = static_cast<Index>(store.dtype().size());
  if (store.dtype() == tensorstore::dtype_v<tensorstore::dtypes::byte_t> &&
      shape.size() >= 3) {
    record_bytes = shape.back();  // e.g. 12 bytes in your example
  }

  if (offset_bytes < 0 || offset_bytes >= record_bytes) {
    std::cout << "[offset outside record size (" << record_bytes
              << " bytes); nothing to display]" << std::endl;
    return absl::OkStatus();
  }

  for (Index i = 0; i < rows; ++i) {
    for (Index j = 0; j < cols; ++j) {
      Index record_index_2d = i * shape[1] + j;
      Index base_offset =
          record_index_2d * record_bytes + offset_bytes;

      if (base_offset + static_cast<Index>(sizeof(T)) > total_bytes) {
        std::cout << "[out]\t";
        continue;
      }

      T value{};
      std::memcpy(&value,
                  bytes + base_offset,
                  sizeof(T));
      std::cout << value << "\t";
    }
    std::cout << std::endl;
  }

  return absl::OkStatus();
}


absl::Status Run(const std::string& zarr_path) {
  std::cout << "=== Structured Zarr v3 Example ===" << std::endl;
  std::cout << "Reading from: " << zarr_path << std::endl;

  auto context = tensorstore::Context::Default();

  // Open the structured Zarr v3 array with field access
  ::nlohmann::json spec = ::nlohmann::json::object();
  spec["driver"] = "zarr3";
  spec["kvstore"] = ::nlohmann::json::object();
  spec["kvstore"]["driver"] = "file";
  spec["kvstore"]["path"] = zarr_path + "/";
  spec["field"] = "field_2";

  auto open_result = tensorstore::Open(spec, context, tensorstore::OpenMode::open,
                                      tensorstore::ReadWriteMode::read).result();

  if (!open_result.ok()) {
    std::cout << "Failed to open structured array: " << open_result.status() << std::endl;
    return open_result.status();
  }

  auto store = std::move(open_result).value();
  TENSORSTORE_RETURN_IF_ERROR(ReadAndDisplayData<int32_t>(store, "Structured Array"));

  ::nlohmann::json void_spec = spec;
  void_spec["open_as_void"] = true;
  void_spec.erase("field");

  auto void_open_result = tensorstore::Open(void_spec, context, tensorstore::OpenMode::open,
                                           tensorstore::ReadWriteMode::read).result();

  if (!void_open_result.ok()) {
    std::cout << "Failed to open with open_as_void: " << void_open_result.status() << std::endl;
    return void_open_result.status();
  }

  auto void_store = std::move(void_open_result).value();

  constexpr Index kFieldOffsetBytes = 4;
  TENSORSTORE_RETURN_IF_ERROR(ReadAndDisplayData<int32_t>(void_store, "Raw Bytes (open_as_void)", kFieldOffsetBytes));

  return absl::OkStatus();
}

}  // namespace

int main(int argc, char** argv) {
  absl::ParseCommandLine(argc, argv);

  std::string zarr_path = absl::GetFlag(FLAGS_zarr_path);
  if (zarr_path.empty()) {
    std::cerr << "Error: --zarr_path is required" << std::endl;
    return 1;
  }

  auto status = Run(zarr_path);
  if (!status.ok()) {
    std::cerr << "\nFinal status: " << status << std::endl;
    return 1;
  }

  return 0;
}

Implement shim for `open_as_void` driver level flag
* Begin removing void field shim

* Fully removed void string shim

* Cleanup debug prints

* Remove shimmed validation

* Remove unnecessary comment

* Prefer false over zero for ternary clarity
* Implement a more general and portable example set

* Fix driver cache bug

* Update example for template

* Cleanup example

* Remove testing examples from source
* Use the appropriate fill value for open_as_void structured data

* Cleanup
@BrianMichell BrianMichell changed the title V3 structs Add support for structured dtypes to zarr3 driver Dec 4, 2025
@BrianMichell BrianMichell changed the title Add support for structured dtypes to zarr3 driver Add support for structured dtypes to zarr3 driver, open zarr 2 and 3 structs as void Dec 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Zarr v3 Struct Support & Tensorstore

1 participant