diff --git a/README.md b/README.md new file mode 100644 index 0000000..4c7bafa --- /dev/null +++ b/README.md @@ -0,0 +1,4 @@ +# Binary Sparse Format Specification +This is part of a new effort to create a binary storage format for storing sparse matrices and other sparse data to disk. + +[Minutes from our meetings](minutes) are available. diff --git a/minutes/aug-10-2022.md b/minutes/aug-10-2022.md new file mode 100644 index 0000000..7da8b84 --- /dev/null +++ b/minutes/aug-10-2022.md @@ -0,0 +1,86 @@ +# Binary Sparse Format Working Group Meeting #1 - August 10, 2022 + +## Attendees +- [X] Ben Brock +- [X] Tim Davis +- [X] Jim Kitchen +- [X] Tim Mattson +- [ ] Scott McMillan +- [X] Erik Welch +- [X] Isaac Virshup + +## Agenda + +- Discuss the possibility of creating a binary storage format for sparse data. +- Establish a general course of direction for creating a spec, enumerate desired features. + +## Summary of Discussion + +It was pointed out that there are two components that need to be created, based +on the proposed approach in Erik Welch's HPEC keynote, which proposed JSON metadata +segment describing the sparse matrix layout combined with a series of binary +blobs. + +1. The specification for valid metadata. This includes supported formats (CSR, + COO, Dense, etc.), as well as which types of data are supported (vectors, matrices, + n-dimensional tensors) and the specific syntax to be used. + +2. How this metadata maps on to the stored binary blobs, as well as how they are + actually stored. + This gets into implementation details (are we using HDF5, Zarr, or some other + storage format). + +We began by enumerating some of the storage options available. + +- HDF5 +- Zarr +- ASDF +- Possibly others + +As well as the features we would like to see supported by any future spec. +We established that a version 1.0 of any spec would likely support storing: + +- Vectors, presumably with sparse and dense formats +- Matrices with a variety of sparse storage formats +- N-D Tensors, initially limited to COO, but likely CSF in future versions + +The following matrix formats were discussed: +- COO +- CSR +- CSR +- Hyper CSR (DCSR) +- Hyper COO (DCSC) +- ISO or single-value matrix +- Dense mask (?) + +We discussed various implementation complexities of a dense mask format, which +may leave it to be left out of an initial spec. + +We discussed the important of supporting blocked formats, which could potentially +allow for very efficient loading and storing on distributed memory systems. + +We discussed the complexities of supporting user-defined (and even built-in) types. +We concluded that we should depend on a binary storage library like HDF5 to handle +this for us. Some storage formats, like potentially Zarr, may allow the use of +user-defined types. + +We discussed the need to support different varieties of index types, at least +encompassing C integer types. + +We discussed TileDB's sparse storage format, acknowledging that we should perhaps +take its design into consideration, but that certain aspects are not compatible +with our vision for supporting multiple sparse matrix formats. + +Ben expressed interest in developing a prototype implementation for C++. Jim Kitchen +expressed interest in expanding on his prototype implementation in Python. Tim Davis +expressed interest in supporting a cross-platform binary spec in SuiteSparse and +possibly providing the format in the SuiteSparse Collection once the spec becomes mature. + +## Outcomes +- Ben to open new GitHub repo for developing and discussing binary sparse format specification. + +- We will begin by agreeing on a feature set on GitHub, then working on the spec + as prototype implementations develop. + +- The calendar invite will be updated to once monthly, with the possibility of + meeting sooner if there is progress on the GitHub.