-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Demonstrate N-dimensional sparse arrays in Python #3
Conversation
…different structures
Should I succumb and use the word "tensor" instead of rank N array, multidimensional array, N-D array, N-way array, or order-N array? Tensor is a loaded word that means lots of different things and is very context-dependent, but it is the jargon that is typically used by researchers. |
For easier reading, here's a link to the design document in this PR: https://github.com/eriknw/binsparse-specification/blob/spz/design_docs/01_rankN_arrays.md Please post comment inline in this PR though. |
@Wimmerer pointed out a paper from TACO compiler that supports more compression formats: Like MLIR: dense, compressed, singleton This supports DIA-style compression among others. I'm happy to punt on these for now. |
I believe the TACO compiler also punts on these right now, development on those was done in a branch that wasn't merged into the core AFAIK. |
This is super cool. I'm going to go ahead and merge in your current status so that it shows up by default in the repo. I am slightly confused by a few details after my original read-through. (Shouldn't doubly compressed actually be the combination of two compressed dimensions? Since the "doubly" part refers to different dimensions? Also, I'm still a little confused about sparse vs. compressed---it seems like they are both compressed, but "sparse" here refers to dimensions that store an index of a nonzero instead of an offset an inner sparse dimension.) |
Thanks for taking a look!
Perhaps (but, with principles, I say no). I think it's a matter of perspective. Do you consider any dimensions in COO to be "compressed"? Please think about what compression means to you. In CSR, I consider the row indices to be "compressed". Each row needs to only store one pointer value no matter how many values are in the row. This is compression (to me). Row indices are "doubly compressed" in DCSR because the pointers are "compressed" to not have duplicate values. Think of it in terms of how we group an index in one dimension to indices in the following dimension. In COO--which I describe as having two "sparse" dimensions--each row index points to exactly one column index. It is a one-to-one relationship, and we need to duplicate a row index if there are multiple values in the row, so there is no compression (although it is still sparse in that we don't store missing values). In CSR, each row index is grouped with any number of column indices. This is very similar to run-length encoding--a type of compression--where the runs are consecutive row indices in COO format. Similarly, in DCSR, each row index that we store is grouped with at least one column index. In lay terms, I view compression as not storing duplicate things. In COO, row indices are duplicated. CSR compresses COO by not storing duplicate row indices, but the resulting pointers array may have duplicates. DCSR compresses CSR by not storing duplicate pointers. Hence, I consider "rows" to be "doubly compressed" in DCSR. Is this clearer? Do you view things differently? Any other questions? |
I think it's useful to look at what TACO does here. There are some formats where a dimension is represented by multiple stacked "index formats", but if I recall correctly they view DCSR and DCSC as Compressed x Compressed. I would have to go back and read the Kjolstad thesis to be absolutely clear though. |
Right, I've looked, and DCSR in TACO is We should consider following what TACO, MLIR sparse, and COMET do. COMET spells some things differently than any others. I don't feel compelled to exactly follow what they do, though, so I think we should also consider my current proposal. I prefer a product growth mindset, so being clear for future users is very important to me. I actually found TACO and MLIR sparse tensor formats really hard to understand. I don't think it's intuitive. I also find it interesting that many fairly knowledgeable people discuss TACO with a lot of doubt. I've seen forum comments by people I know are very smart ask "what does ... format mean?". I hope the diagrams we can create from this PR will help remove the mysticism. In terms of prose, I think we can first introduce how to store sparse vectors and matrices. This includes metadata that I hope will appear natural to most people. For example, CSR is The wording of my proposal in this PR is not yet the prose of the documentation. It can be clearer. And we can use diagrams (hooray!) to make it extra clear. Going to sparse multidimensional arrays should be super-easy once you understand how matrices are handled. I will also have a section to discuss going to and from TACO-style since we are compatible, and I think diagrams will make it super-easy for people who know TACO to understand our approach. I also think it's nice that we can natively store multi-graph data "for free" as Remember, we're doing something different than TACO, et al. We are storing things in a file. We are not building a compiler. We can choose to do something different that better fits our goals, and it doesn't need to be a rapprochement of TACO/MLIR. Thanks for the engagement. I know there's a lot here to absorb. Please feel welcome to criticize or "think out loud"--I won't take offense. |
Also, I'm curious about this. Please share if you find out more :) |
I've been having some fun!
Given a COO-like structure for N dimensions, I can automatically create SVG diagrams such as these:
Sparse structure:
DC-DC-DC-S
(i.e., CSF)Sparse structure:
S-C-DC-S
See all possible sparse structures for this dataset here (also included in this PR):
https://nbviewer.org/github/eriknw/binsparse-specification/blob/spz/spz_python/notebooks/Example_Rank4.ipynb
Phew! Okay, so this PR is a tangible continuation of #2.
The code in this PR creates any supported sparse structure in any number of whole dimensions. This is meant as a reference implementation that we can explore and use to educate. It's not meant to be fast. But, it appears to be passing what I hope are robust tests.
I largely follow the semantics I first laid out in #2 where each dimension can be compressed as:
S
, "sparse": like COO; indices are "aligned" to the following dimension indices or values.indices_i
C
, "compressed sparse": like CSR; fast looking by index into pointers to the next dimension.pointers_i
DC
, "doubly compressed sparse": like DCSR; store unique index values and pointers to the next dimension.pointers_i
,indices_i
I really like using
S
,C
, andDC
(optionally separated by hyphens) for dimension compression type. Why? Because it lets us call CSR, CSC, DCSR, and DCSC by the same names! This naming convention also gives us SSR and SSC for COO structures that are lexicographically sorted by dimensions(0, 1)
and(1, 0)
respectively.I hope the diagrams in the notebook will help y'all understand how N-dimensional sparse array compression works. I know I found it enlightening. Please let me know if you have any questions or suggestions. The code for generating the diagrams is kind of wild (please report any bugs), but we have good control over it, and I think the result is fairly compact and clean.
I think my next step will be to start writing prose in markdown files to give us a base to build upon and revise as needed.