Demonstrate N-dimensional sparse arrays in Python #3

eriknw · 2022-08-24T03:14:04Z

Given a COO-like structure for N dimensions, I can automatically create SVG diagrams such as these:

Sparse structure: `DC-DC-DC-S` (i.e., CSF)

Sparse structure: `S-C-DC-S`

See all possible sparse structures for this dataset here (also included in this PR):
https://nbviewer.org/github/eriknw/binsparse-specification/blob/spz/spz_python/notebooks/Example_Rank4.ipynb

Phew! Okay, so this PR is a tangible continuation of #2.

The code in this PR creates any supported sparse structure in any number of whole dimensions. This is meant as a reference implementation that we can explore and use to educate. It's not meant to be fast. But, it appears to be passing what I hope are robust tests.

I largely follow the semantics I first laid out in #2 where each dimension can be compressed as:

S, "sparse": like COO; indices are "aligned" to the following dimension indices or values.
- Uses: indices_i
C, "compressed sparse": like CSR; fast looking by index into pointers to the next dimension.
- Uses: pointers_i
DC, "doubly compressed sparse": like DCSR; store unique index values and pointers to the next dimension.
- Uses: pointers_i, indices_i

I really like using S, C, and DC (optionally separated by hyphens) for dimension compression type. Why? Because it lets us call CSR, CSC, DCSR, and DCSC by the same names! This naming convention also gives us SSR and SSC for COO structures that are lexicographically sorted by dimensions (0, 1) and (1, 0) respectively.

I hope the diagrams in the notebook will help y'all understand how N-dimensional sparse array compression works. I know I found it enlightening. Please let me know if you have any questions or suggestions. The code for generating the diagrams is kind of wild (please report any bugs), but we have good control over it, and I think the result is fairly compact and clean.

I think my next step will be to start writing prose in markdown files to give us a base to build upon and revise as needed.

…different structures

eriknw · 2022-08-29T03:30:57Z

Should I succumb and use the word "tensor" instead of rank N array, multidimensional array, N-D array, N-way array, or order-N array? Tensor is a loaded word that means lots of different things and is very context-dependent, but it is the jargon that is typically used by researchers.

eriknw · 2022-08-29T03:37:01Z

For easier reading, here's a link to the design document in this PR:

https://github.com/eriknw/binsparse-specification/blob/spz/design_docs/01_rankN_arrays.md

Please post comment inline in this PR though.

eriknw · 2022-08-29T17:33:19Z

@Wimmerer pointed out a paper from TACO compiler that supports more compression formats:

http://tensor-compiler.org/files/chou-oopsla18-taco-formats.pdf

Like MLIR: dense, compressed, singleton
New formats: range, offset, hashed

This supports DIA-style compression among others. I'm happy to punt on these for now.

rayegun · 2022-08-29T19:55:56Z

I believe the TACO compiler also punts on these right now, development on those was done in a branch that wasn't merged into the core AFAIK.

BenBrock · 2022-08-30T23:10:08Z

This is super cool. I'm going to go ahead and merge in your current status so that it shows up by default in the repo.

I am slightly confused by a few details after my original read-through. (Shouldn't doubly compressed actually be the combination of two compressed dimensions? Since the "doubly" part refers to different dimensions? Also, I'm still a little confused about sparse vs. compressed---it seems like they are both compressed, but "sparse" here refers to dimensions that store an index of a nonzero instead of an offset an inner sparse dimension.)

eriknw · 2022-08-31T01:30:44Z

Thanks for taking a look!

Shouldn't doubly compressed actually be the combination of two compressed dimensions? Since the "doubly" part refers to different dimensions?

Perhaps (but, with principles, I say no). I think it's a matter of perspective. Do you consider any dimensions in COO to be "compressed"? Please think about what compression means to you.

In CSR, I consider the row indices to be "compressed". Each row needs to only store one pointer value no matter how many values are in the row. This is compression (to me). Row indices are "doubly compressed" in DCSR because the pointers are "compressed" to not have duplicate values.

Think of it in terms of how we group an index in one dimension to indices in the following dimension. In COO--which I describe as having two "sparse" dimensions--each row index points to exactly one column index. It is a one-to-one relationship, and we need to duplicate a row index if there are multiple values in the row, so there is no compression (although it is still sparse in that we don't store missing values). In CSR, each row index is grouped with any number of column indices. This is very similar to run-length encoding--a type of compression--where the runs are consecutive row indices in COO format. Similarly, in DCSR, each row index that we store is grouped with at least one column index.

In lay terms, I view compression as not storing duplicate things. In COO, row indices are duplicated. CSR compresses COO by not storing duplicate row indices, but the resulting pointers array may have duplicates. DCSR compresses CSR by not storing duplicate pointers. Hence, I consider "rows" to be "doubly compressed" in DCSR.

Is this clearer? Do you view things differently? Any other questions?

rayegun · 2022-08-31T02:35:45Z

I think it's useful to look at what TACO does here. There are some formats where a dimension is represented by multiple stacked "index formats", but if I recall correctly they view DCSR and DCSC as Compressed x Compressed. I would have to go back and read the Kjolstad thesis to be absolutely clear though.

eriknw · 2022-08-31T03:28:54Z

Right, I've looked, and DCSR in TACO is [compressed, compressed], but this gives an extra pointers array. I don't like this. I think it is awkward and imprecise for an on-disk format. I think DCSR could also be [singleton, compressed]. So, yes, I am proposing something different than, but compatible with, TACO and MLIR.

We should consider following what TACO, MLIR sparse, and COMET do. COMET spells some things differently than any others. I don't feel compelled to exactly follow what they do, though, so I think we should also consider my current proposal. I prefer a product growth mindset, so being clear for future users is very important to me.

I actually found TACO and MLIR sparse tensor formats really hard to understand. I don't think it's intuitive. I also find it interesting that many fairly knowledgeable people discuss TACO with a lot of doubt. I've seen forum comments by people I know are very smart ask "what does ... format mean?". I hope the diagrams we can create from this PR will help remove the mysticism.

In terms of prose, I think we can first introduce how to store sparse vectors and matrices. This includes metadata that I hope will appear natural to most people. For example, CSR is [compressed, sparse]--great, just like the name! I think this is 1000x more clear than [dense, compressed]. DCSR is [doubly compressed, sparse]--again, just like the name! Also, we store exactly the arrays that people expect for CSR and DCSR. These may be little things, but I think they are everything. I suspect most people will only use vectors and matrices.

The wording of my proposal in this PR is not yet the prose of the documentation. It can be clearer. And we can use diagrams (hooray!) to make it extra clear. Going to sparse multidimensional arrays should be super-easy once you understand how matrices are handled. I will also have a section to discuss going to and from TACO-style since we are compatible, and I think diagrams will make it super-easy for people who know TACO to understand our approach.

I also think it's nice that we can natively store multi-graph data "for free" as [doubly compressed, doubly compressed]. TACO can't do this.

Remember, we're doing something different than TACO, et al. We are storing things in a file. We are not building a compiler. We can choose to do something different that better fits our goals, and it doesn't need to be a rapprochement of TACO/MLIR.

Thanks for the engagement. I know there's a lot here to absorb. Please feel welcome to criticize or "think out loud"--I won't take offense.

eriknw · 2022-08-31T03:42:37Z

There are some formats where a dimension is represented by multiple stacked "index formats"

Also, I'm curious about this. Please share if you find out more :)

eriknw added 8 commits August 21, 2022 17:28

Create Python package for exploring n-dimensional sparse arrays with …

aedee99

…different structures

More rigorous testing

82640c9

Begin adding diagrams via svgbob

c20e6e5

Fix repr for rank 1 arrays

24883da

Faster tests

3ce40e2

Add example notebook for rank 4 arrays

28e7ff0

Clearer

73905b4

Add design document for supporting rank N sparse arrays

e1c9a2f

eriknw mentioned this pull request Aug 29, 2022

Agree on Feature Set for Metadata #1

Open

BenBrock merged commit 930d6e3 into GraphBLAS:main Aug 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Demonstrate N-dimensional sparse arrays in Python #3

Demonstrate N-dimensional sparse arrays in Python #3

eriknw commented Aug 24, 2022

eriknw commented Aug 29, 2022 •

edited

Loading

eriknw commented Aug 29, 2022

eriknw commented Aug 29, 2022

rayegun commented Aug 29, 2022

BenBrock commented Aug 30, 2022

eriknw commented Aug 31, 2022

rayegun commented Aug 31, 2022

eriknw commented Aug 31, 2022

eriknw commented Aug 31, 2022

Demonstrate N-dimensional sparse arrays in Python #3

Demonstrate N-dimensional sparse arrays in Python #3

Conversation

eriknw commented Aug 24, 2022

Sparse structure: DC-DC-DC-S (i.e., CSF)

Sparse structure: S-C-DC-S

eriknw commented Aug 29, 2022 • edited Loading

eriknw commented Aug 29, 2022

eriknw commented Aug 29, 2022

rayegun commented Aug 29, 2022

BenBrock commented Aug 30, 2022

eriknw commented Aug 31, 2022

rayegun commented Aug 31, 2022

eriknw commented Aug 31, 2022

eriknw commented Aug 31, 2022

Sparse structure: `DC-DC-DC-S` (i.e., CSF)

Sparse structure: `S-C-DC-S`

eriknw commented Aug 29, 2022 •

edited

Loading