- Author: Erik Welch
- Date: 2022-09-07
- Purpose: Strawman proposal to help discuss PR #3 and its proposal
- https://github.com/GraphBLAS/binsparse-specification
Let's call our binary sparse storage format binsparse
in this document.
For high level design, let's assume we will use existing data storage formats that can store metadata and arrays including the shape and data type of arrays:
flowchart TB
high[binsparse: high level metadata]-->mid[containers for metadata and arrays]-->low[(files)]
Example data storage formats:
flowchart TB
binsparse-->NetCDF4
binsparse-->HDF5
binsparse-->Zarr
binsparse-->n5
binsparse-->ASDF
binsparse<-->anndata
These containers are higher level than e.g. ZIP, LMDB, and Tar files, and handle data types, multidimensional arrays, endianness, compression, etc.
Our current design goal, then, is to determine the high level metadata and arrays for sparse arrays.
We should also consider using binsparse
as an embedded spec to be used as components in other storage formats.
flowchart LR
subgraph Container
a[embedded binsparse 'foo']
b[embedded binsparse 'bar']
c[dense array 'baz']
d[other stuff]
end
- Very limited scope
- Simple and opinionated with no options
- Easy to explain and understand
- Smooth transition to version 2.0
- Cover most of MatrixMarket format
- Sparse formats: COO, CSR, CSC, DCSR, DCSC, sparse vector (VEC)
- Indices must be lexicographically sorted by dim0, dim1
- No duplicate indices allowed
- Applies to COO
- Applies to row indices in DCSR
- May support dense and bitmap formats
- May allow attributes such as
is_symmetric
,ndiag
, etc.
Assume all metadata contains format type and version, such as:
{
"spec_version": "https://purl.org/binsparse/spec/core/1.0"
}
CSR | CSC | DCSR | DCSC | COO | VEC |
---|---|---|---|---|---|
{
"format": "CSR",
"shape": [m, n]
} |
{
"format": "CSC",
"shape": [m, n]
} |
{
"format": "DCSR",
"shape": [m, n]
} |
{
"format": "DCSC",
"shape": [m, n]
} |
{
"format": "COO",
"shape": [m, n]
} |
{
"format": "COO",
"shape": [n]
} |
|
|
|
|
|
|
Inluding some reactions from the September 7 meeting:
- Should we save datatypes or defer to the storage container?
- e.g.,
{"data_types": {"indices_0": "uint64", "values": "float64"}
- Jim says yes
- Keep v1 simple
- Researching data types is an action item
- If so, should we indicate endianness such as
"<i4"
?- Jim and Ben say no
- e.g.,
- Should we add
"COOR"
and"COOC"
(or"COO-R"
and"COO-C"
) formats that are sorted by row then column and column then row?- Jim, yes for COOR, COOC
- Should
"COO"
format be "special" and allowed to be unsorted and/or have duplicates?- Ben: COO defaults to COOR
- Should we always include
dim_types
(see below)?- Nah
- Should we add a field for arbitrary user metadata such as comments?
- Yes, a field such as
"attrs"
that can save key-value pairs - Would be nice to store SuiteSparse collection w/o losing any info
- Erik: follow Zarr's example and don't allow arbitrary key-value pairs--have a specified place for user metadata
- Yes, a field such as
- Should bitmap and full formats be in v1?
- Jim, Ben, Erik: gut reactions--no
- Isaac would like to consider memory-friendliness
Easy datatypes
int8 int16 int32 int64
uint8 uint16 uint32 uint64
float32 float64
Harder datatypes
bool
complex64
int128
uint128
float16
fixed-width UDTs (void or composed?)
strings or variable-length
datetimes
Single values array:
{
"format": "CSR",
"shape": [m, n],
"data_types": {"pointers_0": "int32", "indices_1": "uint64", "values": "float64"},
}
Iso-value:
{
"format": "CSR",
"shape": [m, n],
"data_types": {"pointers_0": "int32", "indices_1": "uint64", "values": "float64"},
"isovalue": 1.0
"data_types": {"pointers_0": "int32", "indices_1": "uint64", "values": {"iso[float64"]: 1.0}},
}
No value, structure-only:
{
"format": "CSR",
"shape": [m, n],
"data_types": {"pointers_0": "int32", "indices_1": "uint64", "values": null}
"is_structure_only": true
}
- Version 2.x is an extension of and can read version 1.x
- Support multidimensional sparse arrays
- Support duplicated indices
- common for COO
- Support unordered indices
- common for COO
- "jumbled" columns in SuiteSparse:GraphBLAS CSR and DCSR
- Support multiple value arrays, no value arrays, and iso-valued arrays
- Support per-dimension, per-values, and global attributes
- Allow extensions
CSR | CSC |
---|---|
{
"format": "CSR",
"shape": [m, n],
"dim_types": ["compressed", "sparse"],
"dim_order": [0, 1],
"dim_properties": {"1": {"is_ordered": true}},
"properties": {"has_duplicates": false},
"dim_sizes": [m, n],
"dim_taco": ["dense", "compressed"]
} |
{
"format": "CSC",
"shape": [m, n],
"dim_types": ["compressed", "sparse"],
"dim_order": [1, 0],
"dim_properties": {"1": {"is_ordered": true}},
"properties": {"has_duplicates": false},
"dim_sizes": [n, m],
"dim_taco": ["dense", "compressed"]
} |
DCSR | DCSC |
{
"format": "DCSR",
"shape": [m, n],
"dim_types": ["doubly_compressed", "sparse"],
"dim_order": [0, 1],
"dim_properties": {
"0": {"is_ordered": true, "is_unique": true},
"1": {"is_ordered": true}
},
"properties": {"has_duplicates": false},
"dim_sizes": [m, n],
"dim_taco": ["compressed", "compressed"]
} |
{
"format": "DCSC",
"shape": [m, n],
"dim_types": ["doubly_compressed", "sparse"],
"dim_order": [1, 0],
"dim_properties": {
"0": {"is_ordered": true, "is_unique": true},
"1": {"is_ordered": true}
},
"properties": {"has_duplicates": false},
"dim_sizes": [n, m],
"dim_taco": ["compressed", "compressed"]
} |
COO | Vector |
{
"format": "COO",
"shape": [m, n],
"dim_types": ["sparse", "sparse"],
"dim_order": [0, 1],
"dim_properties": {
"0": {"is_ordered": true},
"1": {"is_ordered": true}
},
"properties": {"has_duplicates": false},
"dim_sizes": [m, n],
"dim_taco": ["compressed-nonunique", "singleton"]
} |
{
"format": "COO",
"shape": [n],
"dim_types": ["sparse"],
"dim_order": [0],
"dim_properties": {
"0": {"is_ordered": true}
},
"properties": {"has_duplicates": false},
"dim_sizes": [n],
"dim_taco": ["compressed"]
}
|
- Each dimension is one of
"sparse"
,"compressed"
, or"doubly_compressed"
as described here: - Design choices
- required, even in version 1.0
- required except for formats in version 1.0
- not required if it can be inferred from the format name
- Example:
"format": "SCSR"
-->"dim_types": ["sparse", "compressed", "sparse"]
- Example:
- make
format
optional ifdim_types
is given?
- Logically reorders dimensions
- Like
dimOrdering
in MLIR sparse tensor shape = [dim_sizes[i] for i in dim_order]
dim_sizes = [shape[i] for i in argsort(dim_order)]
- Like
- Optional; defaults to
[0, 1, ..., N-1]
- (choice) May infer from last character in
format
ifdim_types
can also be inferred:"R"
results in[0, 1, ..., N-1]
"C"
results in[N-1, ..., 1, 0]
- otherwise
dim_order
is required (or choose a character like"X"
to mean this)
- (choice) Make required like
dim_types
?
- Optional; dictionary of properties for specified dimensions
- Default properties:
"compressed"
:{}
"sparse"
:{"is_ordered": true}
"doubly_compressed"
:{"is_ordered": true, "is_unique": true}
is_ordered
property- Are the indices in
indices_i
in order for"sparse"
or"doubly_compressed"
dimensions? - Indicates whether
indices_0
is sorted - Indicates whether
indices_i
with the same previous indices (i.e., betweenpointers_{i-1}
boundaries) is sorted - TACO (paper) may call this
ordered
- SuiteSparse:GraphBLAS uses
jumbled
- Are the indices in
is_unique
property- Indicates whether
indices_i
has no duplicates for"doubly_compressed"
dimensions - TACO (paper) may call this
unique
- If
false
, then this conceptually splits a single index tree into multiple trees
- Indicates whether
- Overall properties of the entire sparse index structure
- Optional; defaults to
{"has_duplicates": false}
has_duplicates
property- Indicates whether the sparse array has values with the same indices
- May add
is_symmetric
,ndiag
,fill_value
, etc
- Optional; allowed for clarity
- Determined from
dim_order
andshape
shape = [dim_sizes[i] for i in dim_order]
dim_sizes = [shape[i] for i in argsort(dim_order)]
- Dimension names for TACO and MLIR sparse
- Optional; allowed for clarity
- Determined by
dim_types
(when possible) - Each dimension is one of
"compressed"
,"dense"
,"singleton"
, or"compressed-nonunique"
Still being explored; don't yet have a complete strawman proposal.
Some ideas:
Full v1 | Bitmap v1 |
---|---|
{
"format": "Full",
"shape": [m, n],
} |
{
"format": "Bitmap",
"shape": [m, n],
} |
|
|
Full v2 | Bitmap v2 |
{
"format": "Full",
"shape": [m, n],
"dim_types": ["full", "full"],
"dim_order": [0, 1],
"dim_properties": {},
"properties": {"has_duplicates": false},
"dim_sizes": [m, n],
"dim_taco": ["dense", "dense"]
}
|
{
"format": "Bitmap",
"shape": [m, n],
"dim_types": ["full", "full"],
"dim_order": [0, 1],
"dim_properties": {},
"properties": {"has_duplicates": false},
"dim_sizes": [m, n],
"dim_taco": ["dense", "dense"],
"value_bitmaps": {"values": "bitmap"}
} |
|
|
CSR (bitmap) | DCSC (iso) |
{
"format": "CSR",
"shape": [m, n],
"dim_types": ["compressed", "sparse"],
"dim_order": [0, 1],
"dim_properties": {"1": {"is_ordered": true}},
"properties": {"has_duplicates": false},
"dim_sizes": [m, n],
"dim_taco": ["dense", "compressed"],
"value_bitmaps": {"values": "bitmap"}
}
|
{
"format": "DCSC",
"shape": [m, n],
"dim_types": ["doubly_compressed", "sparse"],
"dim_order": [1, 0],
"dim_properties": {
"0": {"is_ordered": true, "is_unique": true},
"1": {"is_ordered": true}
},
"properties": {"has_duplicates": false},
"dim_sizes": [n, m],
"dim_taco": ["compressed", "compressed"],
"value_isovalues": {"values": 0} // needs data_type!
} |
|
|
CSC (many value arrays) | DCSC (structure-only) |
{
"format": "CSC",
"shape": [m, n],
"dim_types": ["compressed", "sparse"],
"dim_order": [1, 0],
"dim_properties": {"1": {"is_ordered": true}},
"properties": {"has_duplicates": false},
"dim_sizes": [n, m],
"dim_taco": ["dense", "compressed"],
"value_names": ["foo", "bar"],
}
|
{
"format": "DCSR",
"shape": [m, n],
"dim_types": ["doubly_compressed", "sparse"],
"dim_order": [0, 1],
"dim_properties": {
"0": {"is_ordered": true, "is_unique": true},
"1": {"is_ordered": true}
},
"properties": {"has_duplicates": false},
"dim_sizes": [m, n],
"dim_taco": ["compressed", "compressed"],
"value_names": [],
} |
|
|
semi-COO | CSR (array values) |
{
"format": "SFR",
"shape": [m, n],
"dim_types": ["sparse", "full"],
"dim_order": [0, 1],
"dim_properties": {"0": {"is_ordered": true}},
"properties": {"has_duplicates": false},
"dim_sizes": [m, n],
"dim_taco": ["compressed-nonunique", "dense"]
}
|
{
"format": "CSFR",
"shape": [m, n, o],
"dim_types": ["compressed", "sparse", "full"],
"dim_order": [0, 1, 2],
"dim_properties": {
"0": {"is_ordered": true},
"1": {"is_ordered": true}
},
"properties": {"has_duplicates": false},
"dim_sizes": [m, n, o],
"dim_taco": ["dense", "compressed", "dense"],
} |
Data:
- indices_0
- values // shape = [len(indices_0), n] |
Data:
- pointers_0
- indices_1
- values // shape = [len(indices_1), o] |
CSR (multigraph) | CSR (variable-size datatype) |
{
"format": "CSR",
"shape": [m, n],
"dim_types": ["compressed", "doubly_compressed"],
"dim_order": [0, 1],
"dim_properties":
{"1": {"is_ordered": true, "is_unique": true}
},
"properties": {"has_duplicates": true},
"dim_sizes": [m, n],
"dim_taco": null,
}
|
{
"format": "CSR",
"shape": [m, n],
"dim_types": ["compressed", "sparse"],
"dim_order": [0, 1],
"dim_properties": {"1": {"is_ordered": true}},
"properties": {"has_duplicates": false},
"dim_sizes": [m, n],
"dim_taco": ["dense", "compressed"],
"value_pointers": {"values": "valptr"}
}
|
|
|