Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Demonstrate N-dimensional sparse arrays in Python #3

Merged
merged 8 commits into from
Aug 30, 2022

Conversation

eriknw
Copy link
Member

@eriknw eriknw commented Aug 24, 2022

I've been having some fun!

Given a COO-like structure for N dimensions, I can automatically create SVG diagrams such as these:

Sparse structure: DC-DC-DC-S (i.e., CSF)

DC-DC-DC-S

Sparse structure: S-C-DC-S

S-C-DC-S

See all possible sparse structures for this dataset here (also included in this PR):
https://nbviewer.org/github/eriknw/binsparse-specification/blob/spz/spz_python/notebooks/Example_Rank4.ipynb

Phew! Okay, so this PR is a tangible continuation of #2.

The code in this PR creates any supported sparse structure in any number of whole dimensions. This is meant as a reference implementation that we can explore and use to educate. It's not meant to be fast. But, it appears to be passing what I hope are robust tests.

I largely follow the semantics I first laid out in #2 where each dimension can be compressed as:

  • S, "sparse": like COO; indices are "aligned" to the following dimension indices or values.
    • Uses: indices_i
  • C, "compressed sparse": like CSR; fast looking by index into pointers to the next dimension.
    • Uses: pointers_i
  • DC, "doubly compressed sparse": like DCSR; store unique index values and pointers to the next dimension.
    • Uses: pointers_i, indices_i

I really like using S, C, and DC (optionally separated by hyphens) for dimension compression type. Why? Because it lets us call CSR, CSC, DCSR, and DCSC by the same names! This naming convention also gives us SSR and SSC for COO structures that are lexicographically sorted by dimensions (0, 1) and (1, 0) respectively.

I hope the diagrams in the notebook will help y'all understand how N-dimensional sparse array compression works. I know I found it enlightening. Please let me know if you have any questions or suggestions. The code for generating the diagrams is kind of wild (please report any bugs), but we have good control over it, and I think the result is fairly compact and clean.

I think my next step will be to start writing prose in markdown files to give us a base to build upon and revise as needed.

@eriknw
Copy link
Member Author

eriknw commented Aug 29, 2022

Should I succumb and use the word "tensor" instead of rank N array, multidimensional array, N-D array, N-way array, or order-N array? Tensor is a loaded word that means lots of different things and is very context-dependent, but it is the jargon that is typically used by researchers.

@eriknw
Copy link
Member Author

eriknw commented Aug 29, 2022

For easier reading, here's a link to the design document in this PR:

https://github.com/eriknw/binsparse-specification/blob/spz/design_docs/01_rankN_arrays.md

Please post comment inline in this PR though.

@eriknw
Copy link
Member Author

eriknw commented Aug 29, 2022

@Wimmerer pointed out a paper from TACO compiler that supports more compression formats:

Like MLIR: dense, compressed, singleton
New formats: range, offset, hashed

This supports DIA-style compression among others. I'm happy to punt on these for now.

@rayegun
Copy link

rayegun commented Aug 29, 2022

I believe the TACO compiler also punts on these right now, development on those was done in a branch that wasn't merged into the core AFAIK.

@BenBrock BenBrock merged commit 930d6e3 into GraphBLAS:main Aug 30, 2022
@BenBrock
Copy link
Contributor

This is super cool. I'm going to go ahead and merge in your current status so that it shows up by default in the repo.

I am slightly confused by a few details after my original read-through. (Shouldn't doubly compressed actually be the combination of two compressed dimensions? Since the "doubly" part refers to different dimensions? Also, I'm still a little confused about sparse vs. compressed---it seems like they are both compressed, but "sparse" here refers to dimensions that store an index of a nonzero instead of an offset an inner sparse dimension.)

@eriknw
Copy link
Member Author

eriknw commented Aug 31, 2022

Thanks for taking a look!

Shouldn't doubly compressed actually be the combination of two compressed dimensions? Since the "doubly" part refers to different dimensions?

Perhaps (but, with principles, I say no). I think it's a matter of perspective. Do you consider any dimensions in COO to be "compressed"? Please think about what compression means to you.

In CSR, I consider the row indices to be "compressed". Each row needs to only store one pointer value no matter how many values are in the row. This is compression (to me). Row indices are "doubly compressed" in DCSR because the pointers are "compressed" to not have duplicate values.

Think of it in terms of how we group an index in one dimension to indices in the following dimension. In COO--which I describe as having two "sparse" dimensions--each row index points to exactly one column index. It is a one-to-one relationship, and we need to duplicate a row index if there are multiple values in the row, so there is no compression (although it is still sparse in that we don't store missing values). In CSR, each row index is grouped with any number of column indices. This is very similar to run-length encoding--a type of compression--where the runs are consecutive row indices in COO format. Similarly, in DCSR, each row index that we store is grouped with at least one column index.

In lay terms, I view compression as not storing duplicate things. In COO, row indices are duplicated. CSR compresses COO by not storing duplicate row indices, but the resulting pointers array may have duplicates. DCSR compresses CSR by not storing duplicate pointers. Hence, I consider "rows" to be "doubly compressed" in DCSR.

Is this clearer? Do you view things differently? Any other questions?

@rayegun
Copy link

rayegun commented Aug 31, 2022

I think it's useful to look at what TACO does here. There are some formats where a dimension is represented by multiple stacked "index formats", but if I recall correctly they view DCSR and DCSC as Compressed x Compressed. I would have to go back and read the Kjolstad thesis to be absolutely clear though.

@eriknw
Copy link
Member Author

eriknw commented Aug 31, 2022

Right, I've looked, and DCSR in TACO is [compressed, compressed], but this gives an extra pointers array. I don't like this. I think it is awkward and imprecise for an on-disk format. I think DCSR could also be [singleton, compressed]. So, yes, I am proposing something different than, but compatible with, TACO and MLIR.

We should consider following what TACO, MLIR sparse, and COMET do. COMET spells some things differently than any others. I don't feel compelled to exactly follow what they do, though, so I think we should also consider my current proposal. I prefer a product growth mindset, so being clear for future users is very important to me.

I actually found TACO and MLIR sparse tensor formats really hard to understand. I don't think it's intuitive. I also find it interesting that many fairly knowledgeable people discuss TACO with a lot of doubt. I've seen forum comments by people I know are very smart ask "what does ... format mean?". I hope the diagrams we can create from this PR will help remove the mysticism.

In terms of prose, I think we can first introduce how to store sparse vectors and matrices. This includes metadata that I hope will appear natural to most people. For example, CSR is [compressed, sparse]--great, just like the name! I think this is 1000x more clear than [dense, compressed]. DCSR is [doubly compressed, sparse]--again, just like the name! Also, we store exactly the arrays that people expect for CSR and DCSR. These may be little things, but I think they are everything. I suspect most people will only use vectors and matrices.

The wording of my proposal in this PR is not yet the prose of the documentation. It can be clearer. And we can use diagrams (hooray!) to make it extra clear. Going to sparse multidimensional arrays should be super-easy once you understand how matrices are handled. I will also have a section to discuss going to and from TACO-style since we are compatible, and I think diagrams will make it super-easy for people who know TACO to understand our approach.

I also think it's nice that we can natively store multi-graph data "for free" as [doubly compressed, doubly compressed]. TACO can't do this.

Remember, we're doing something different than TACO, et al. We are storing things in a file. We are not building a compiler. We can choose to do something different that better fits our goals, and it doesn't need to be a rapprochement of TACO/MLIR.

Thanks for the engagement. I know there's a lot here to absorb. Please feel welcome to criticize or "think out loud"--I won't take offense.

@eriknw
Copy link
Member Author

eriknw commented Aug 31, 2022

There are some formats where a dimension is represented by multiple stacked "index formats"

Also, I'm curious about this. Please share if you find out more :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants