Skip to content

Latest commit

 

History

History
86 lines (65 loc) · 3.2 KB

2022-08-10.md

File metadata and controls

86 lines (65 loc) · 3.2 KB

Binary Sparse Format Working Group Meeting #1 - August 10, 2022

Attendees

  • Ben Brock
  • Tim Davis
  • Jim Kitchen
  • Tim Mattson
  • Scott McMillan
  • Erik Welch
  • Isaac Virshup

Agenda

  • Discuss the possibility of creating a binary storage format for sparse data.
  • Establish a general course of direction for creating a spec, enumerate desired features.

Summary of Discussion

It was pointed out that there are two components that need to be created, based on the proposed approach in Erik Welch's HPEC keynote, which proposed JSON metadata segment describing the sparse matrix layout combined with a series of binary blobs.

  1. The specification for valid metadata. This includes supported formats (CSR, COO, Dense, etc.), as well as which types of data are supported (vectors, matrices, n-dimensional tensors) and the specific syntax to be used.

  2. How this metadata maps on to the stored binary blobs, as well as how they are actually stored. This gets into implementation details (are we using HDF5, Zarr, or some other storage format).

We began by enumerating some of the storage options available.

  • HDF5
  • Zarr
  • ASDF
  • Possibly others

As well as the features we would like to see supported by any future spec. We established that a version 1.0 of any spec would likely support storing:

  • Vectors, presumably with sparse and dense formats
  • Matrices with a variety of sparse storage formats
  • N-D Tensors, initially limited to COO, but likely CSF in future versions

The following matrix formats were discussed:

  • COO
  • CSR
  • CSR
  • Hyper CSR (DCSR)
  • Hyper COO (DCSC)
  • ISO or single-value matrix
  • Dense mask (?)

We discussed various implementation complexities of a dense mask format, which may leave it to be left out of an initial spec.

We discussed the important of supporting blocked formats, which could potentially allow for very efficient loading and storing on distributed memory systems.

We discussed the complexities of supporting user-defined (and even built-in) types. We concluded that we should depend on a binary storage library like HDF5 to handle this for us. Some storage formats, like potentially Zarr, may allow the use of user-defined types.

We discussed the need to support different varieties of index types, at least encompassing C integer types.

We discussed TileDB's sparse storage format, acknowledging that we should perhaps take its design into consideration, but that certain aspects are not compatible with our vision for supporting multiple sparse matrix formats.

Ben expressed interest in developing a prototype implementation for C++. Jim Kitchen expressed interest in expanding on his prototype implementation in Python. Tim Davis expressed interest in supporting a cross-platform binary spec in SuiteSparse and possibly providing the format in the SuiteSparse Collection once the spec becomes mature.

Outcomes

  • Ben to open new GitHub repo for developing and discussing binary sparse format specification.

  • We will begin by agreeing on a feature set on GitHub, then working on the spec as prototype implementations develop.

  • The calendar invite will be updated to once monthly, with the possibility of meeting sooner if there is progress on the GitHub.