Skip to content

Latest commit

 

History

History
executable file
·
888 lines (701 loc) · 15.2 KB

2022-09-07-strawman.md

File metadata and controls

executable file
·
888 lines (701 loc) · 15.2 KB

High level perspective

Let's call our binary sparse storage format binsparse in this document.

For high level design, let's assume we will use existing data storage formats that can store metadata and arrays including the shape and data type of arrays:

flowchart TB
    high[binsparse: high level metadata]-->mid[containers for metadata and arrays]-->low[(files)]
Loading

Example data storage formats:

flowchart TB
    binsparse-->NetCDF4
    binsparse-->HDF5
    binsparse-->Zarr
    binsparse-->n5
    binsparse-->ASDF
    binsparse<-->anndata
Loading

These containers are higher level than e.g. ZIP, LMDB, and Tar files, and handle data types, multidimensional arrays, endianness, compression, etc.

Our current design goal, then, is to determine the high level metadata and arrays for sparse arrays.

We should also consider using binsparse as an embedded spec to be used as components in other storage formats.

flowchart LR
  subgraph Container
  a[embedded binsparse 'foo']
  b[embedded binsparse 'bar']
  c[dense array 'baz']
  d[other stuff]
  end
Loading

Version 1.0

  • Very limited scope
    • Simple and opinionated with no options
    • Easy to explain and understand
    • Smooth transition to version 2.0
    • Cover most of MatrixMarket format
  • Sparse formats: COO, CSR, CSC, DCSR, DCSC, sparse vector (VEC)
  • Indices must be lexicographically sorted by dim0, dim1
  • No duplicate indices allowed
    • Applies to COO
    • Applies to row indices in DCSR
  • May support dense and bitmap formats
  • May allow attributes such as is_symmetric, ndiag, etc.

Assume all metadata contains format type and version, such as:

{
  "spec_version": "https://purl.org/binsparse/spec/core/1.0"
}
CSR CSC DCSR DCSC COO VEC
{
  "format": "CSR",
  "shape": [m, n]
}
{
  "format": "CSC",
  "shape": [m, n]
}
{
  "format": "DCSR",
  "shape": [m, n]
}
{
  "format": "DCSC",
  "shape": [m, n]
}
{
  "format": "COO",
  "shape": [m, n]
}
{
  "format": "COO",
  "shape": [n]
}
Data:
 - pointers_0
 - indices_1
 - values

Data:
 - pointers_0
 - indices_1
 - values

Data:
 - indices_0
 - pointers_0
 - indices_1
 - values
Data:
 - indices_0
 - pointers_0
 - indices_1
 - values
Data:
 - indices_0
 - indices_1
 - values

Data:
 - indices_0
 - values


Questions

Inluding some reactions from the September 7 meeting:

  • Should we save datatypes or defer to the storage container?
    • e.g., {"data_types": {"indices_0": "uint64", "values": "float64"}
      • Jim says yes
      • Keep v1 simple
      • Researching data types is an action item
    • If so, should we indicate endianness such as "<i4"?
      • Jim and Ben say no
  • Should we add "COOR" and "COOC" (or "COO-R" and "COO-C") formats that are sorted by row then column and column then row?
    • Jim, yes for COOR, COOC
  • Should "COO" format be "special" and allowed to be unsorted and/or have duplicates?
    • Ben: COO defaults to COOR
  • Should we always include dim_types (see below)?
    • Nah
  • Should we add a field for arbitrary user metadata such as comments?
    • Yes, a field such as "attrs" that can save key-value pairs
    • Would be nice to store SuiteSparse collection w/o losing any info
    • Erik: follow Zarr's example and don't allow arbitrary key-value pairs--have a specified place for user metadata
  • Should bitmap and full formats be in v1?
    • Jim, Ben, Erik: gut reactions--no
    • Isaac would like to consider memory-friendliness
Easy datatypes
    int8 int16 int32 int64
    uint8 uint16 uint32 uint64
    float32 float64

Harder datatypes
    bool
    complex64
    int128
    uint128
    float16
    fixed-width UDTs (void or composed?)
    strings or variable-length
    datetimes

Ideas for handling data types and values

Single values array:

{
  "format": "CSR",
  "shape": [m, n],
  "data_types": {"pointers_0": "int32", "indices_1": "uint64", "values": "float64"},
}

Iso-value:

{
  "format": "CSR",
  "shape": [m, n],
  "data_types": {"pointers_0": "int32", "indices_1": "uint64", "values": "float64"},
  "isovalue": 1.0

  "data_types": {"pointers_0": "int32", "indices_1": "uint64", "values": {"iso[float64"]: 1.0}},
}

No value, structure-only:

{
  "format": "CSR",
  "shape": [m, n],
  "data_types": {"pointers_0": "int32", "indices_1": "uint64", "values": null}

  "is_structure_only": true
}

Version 2.0

  • Version 2.x is an extension of and can read version 1.x
  • Support multidimensional sparse arrays
  • Support duplicated indices
    • common for COO
  • Support unordered indices
    • common for COO
    • "jumbled" columns in SuiteSparse:GraphBLAS CSR and DCSR
  • Support multiple value arrays, no value arrays, and iso-valued arrays
  • Support per-dimension, per-values, and global attributes
  • Allow extensions
CSR CSC
{
  "format": "CSR",
  "shape": [m, n],

  "dim_types": ["compressed", "sparse"],
  "dim_order": [0, 1],

  "dim_properties": {"1": {"is_ordered": true}},
  "properties": {"has_duplicates": false},

  "dim_sizes": [m, n],
  "dim_taco": ["dense", "compressed"]
}
{
  "format": "CSC",
  "shape": [m, n],

  "dim_types": ["compressed", "sparse"],
  "dim_order": [1, 0],

  "dim_properties": {"1": {"is_ordered": true}},
  "properties": {"has_duplicates": false},

  "dim_sizes": [n, m],
  "dim_taco": ["dense", "compressed"]
}
DCSR DCSC
{
  "format": "DCSR",
  "shape": [m, n],

  "dim_types": ["doubly_compressed", "sparse"],
  "dim_order": [0, 1],

  "dim_properties": {
    "0": {"is_ordered": true, "is_unique": true},
    "1": {"is_ordered": true}
  },
  "properties": {"has_duplicates": false},

  "dim_sizes": [m, n],
  "dim_taco": ["compressed", "compressed"]
}
{
  "format": "DCSC",
  "shape": [m, n],

  "dim_types": ["doubly_compressed", "sparse"],
  "dim_order": [1, 0],

  "dim_properties": {
    "0": {"is_ordered": true, "is_unique": true},
    "1": {"is_ordered": true}
  },
  "properties": {"has_duplicates": false},

  "dim_sizes": [n, m],
  "dim_taco": ["compressed", "compressed"]
}
COO Vector
{
  "format": "COO",
  "shape": [m, n],

  "dim_types": ["sparse", "sparse"],
  "dim_order": [0, 1],

  "dim_properties": {
    "0": {"is_ordered": true},
    "1": {"is_ordered": true}
  },
  "properties": {"has_duplicates": false},

  "dim_sizes": [m, n],
  "dim_taco": ["compressed-nonunique", "singleton"]
}
{
  "format": "COO",
  "shape": [n],

  "dim_types": ["sparse"],
  "dim_order": [0],

  "dim_properties": {
    "0": {"is_ordered": true}
  },
  "properties": {"has_duplicates": false},

  "dim_sizes": [n],
  "dim_taco": ["compressed"]
}

dim_types

dim_order

  • Logically reorders dimensions
    • Like dimOrdering in MLIR sparse tensor
    • shape = [dim_sizes[i] for i in dim_order]
    • dim_sizes = [shape[i] for i in argsort(dim_order)]
  • Optional; defaults to [0, 1, ..., N-1]
  • (choice) May infer from last character in format if dim_types can also be inferred:
    • "R" results in [0, 1, ..., N-1]
    • "C" results in [N-1, ..., 1, 0]
    • otherwise dim_order is required (or choose a character like "X" to mean this)
  • (choice) Make required like dim_types?

dim_properties

  • Optional; dictionary of properties for specified dimensions
  • Default properties:
    • "compressed": {}
    • "sparse": {"is_ordered": true}
    • "doubly_compressed": {"is_ordered": true, "is_unique": true}
  • is_ordered property
    • Are the indices in indices_i in order for "sparse" or "doubly_compressed" dimensions?
    • Indicates whether indices_0 is sorted
    • Indicates whether indices_i with the same previous indices (i.e., between pointers_{i-1} boundaries) is sorted
    • TACO (paper) may call this ordered
    • SuiteSparse:GraphBLAS uses jumbled
  • is_unique property
    • Indicates whether indices_i has no duplicates for "doubly_compressed" dimensions
    • TACO (paper) may call this unique
    • If false, then this conceptually splits a single index tree into multiple trees

properties

  • Overall properties of the entire sparse index structure
  • Optional; defaults to {"has_duplicates": false}
  • has_duplicates property
    • Indicates whether the sparse array has values with the same indices
  • May add is_symmetric, ndiag, fill_value, etc

dim_sizes

  • Optional; allowed for clarity
  • Determined from dim_order and shape
    • shape = [dim_sizes[i] for i in dim_order]
    • dim_sizes = [shape[i] for i in argsort(dim_order)]

dim_taco

  • Dimension names for TACO and MLIR sparse
  • Optional; allowed for clarity
  • Determined by dim_types (when possible)
  • Each dimension is one of "compressed", "dense", "singleton", or "compressed-nonunique"

Value arrays

Still being explored; don't yet have a complete strawman proposal.

Some ideas:

Full v1 Bitmap v1
{
  "format": "Full",
  "shape": [m, n],
}
{
  "format": "Bitmap",
  "shape": [m, n],
}
Data:
 - values

Data:
 - values
 - bitmap
Full v2 Bitmap v2
{
  "format": "Full",
  "shape": [m, n],

  "dim_types": ["full", "full"],
  "dim_order": [0, 1],

  "dim_properties": {},
  "properties": {"has_duplicates": false},

  "dim_sizes": [m, n],
  "dim_taco": ["dense", "dense"]
}

{
  "format": "Bitmap",
  "shape": [m, n],

  "dim_types": ["full", "full"],
  "dim_order": [0, 1],

  "dim_properties": {},
  "properties": {"has_duplicates": false},

  "dim_sizes": [m, n],
  "dim_taco": ["dense", "dense"],

  "value_bitmaps": {"values": "bitmap"}
}
Data:
 - values

Data:
 - values
 - bitmap
CSR (bitmap) DCSC (iso)
{
  "format": "CSR",
  "shape": [m, n],

  "dim_types": ["compressed", "sparse"],
  "dim_order": [0, 1],

  "dim_properties": {"1": {"is_ordered": true}},
  "properties": {"has_duplicates": false},

  "dim_sizes": [m, n],
  "dim_taco": ["dense", "compressed"],

  "value_bitmaps": {"values": "bitmap"}
}


{
  "format": "DCSC",
  "shape": [m, n],

  "dim_types": ["doubly_compressed", "sparse"],
  "dim_order": [1, 0],

  "dim_properties": {
    "0": {"is_ordered": true, "is_unique": true},
    "1": {"is_ordered": true}
  },
  "properties": {"has_duplicates": false},

  "dim_sizes": [n, m],
  "dim_taco": ["compressed", "compressed"],

  "value_isovalues": {"values": 0}  // needs data_type!
}
Data:
 - pointers_0
 - indices_1
 - values
 - bitmap
Data:
 - indices_0
 - pointers_0
 - indices_1

CSC (many value arrays) DCSC (structure-only)
{
  "format": "CSC",
  "shape": [m, n],

  "dim_types": ["compressed", "sparse"],
  "dim_order": [1, 0],

  "dim_properties": {"1": {"is_ordered": true}},
  "properties": {"has_duplicates": false},

  "dim_sizes": [n, m],
  "dim_taco": ["dense", "compressed"],

  "value_names": ["foo", "bar"],
}


{
  "format": "DCSR",
  "shape": [m, n],

  "dim_types": ["doubly_compressed", "sparse"],
  "dim_order": [0, 1],

  "dim_properties": {
    "0": {"is_ordered": true, "is_unique": true},
    "1": {"is_ordered": true}
  },
  "properties": {"has_duplicates": false},

  "dim_sizes": [m, n],
  "dim_taco": ["compressed", "compressed"],

  "value_names": [],
}
Data:
 - pointers_0
 - indices_1
 - foo
 - bar
Data:
 - indices_0
 - pointers_0
 - indices_1

semi-COO CSR (array values)
{
  "format": "SFR",
  "shape": [m, n],

  "dim_types": ["sparse", "full"],
  "dim_order": [0, 1],

  "dim_properties": {"0": {"is_ordered": true}},
  "properties": {"has_duplicates": false},

  "dim_sizes": [m, n],
  "dim_taco": ["compressed-nonunique", "dense"]
}


{
  "format": "CSFR",
  "shape": [m, n, o],

  "dim_types": ["compressed", "sparse", "full"],
  "dim_order": [0, 1, 2],

  "dim_properties": {
    "0": {"is_ordered": true},
    "1": {"is_ordered": true}
  },
  "properties": {"has_duplicates": false},

  "dim_sizes": [m, n, o],
  "dim_taco": ["dense", "compressed", "dense"],
}
Data:
 - indices_0
 - values  // shape = [len(indices_0), n]
Data:
 - pointers_0
 - indices_1
 - values  // shape = [len(indices_1), o]
CSR (multigraph) CSR (variable-size datatype)
{
  "format": "CSR",
  "shape": [m, n],

  "dim_types": ["compressed", "doubly_compressed"],
  "dim_order": [0, 1],

  "dim_properties":
    {"1": {"is_ordered": true, "is_unique": true}
  },
  "properties": {"has_duplicates": true},

  "dim_sizes": [m, n],
  "dim_taco": null,
}
{
  "format": "CSR",
  "shape": [m, n],

  "dim_types": ["compressed", "sparse"],
  "dim_order": [0, 1],

  "dim_properties": {"1": {"is_ordered": true}},
  "properties": {"has_duplicates": false},

  "dim_sizes": [m, n],
  "dim_taco": ["dense", "compressed"],

  "value_pointers": {"values": "valptr"}
}
Data:
 - pointers_0
 - indices_1
 - pointers_1
 - values
Data:
 - pointers_0
 - indices_1
 - values
 - valptr