-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Towards N-dimensional sparse arrays #2
Comments
I think this is ultimately the right way to go for storing sparse tensors of arbitrary dimensions. You can provide a descriptor (pretty much like TACO's descriptor) for how each dimension is stored, along with a dimension reordering, and then that will define what each of the pointer and indices arrays means. And there's a pretty clear naming scheme for the pointers and indices arrays already. I do think this seems a little complicated to necessarily include in a version 1.0, but am open to it after we've sorted out the regular matrices first if you want to iron out all the kinks. Also of note in this space is the MLIR SparseTensor dialect which has a similar descriptor. |
Thanks for the quick feedback. @jim22k and I are way too familiar with the MLIR SparseTensor dialect (Jim probably more so than I). Part of the exercise in considering N-dim is to come up with a generalized naming scheme and approach, and to see how well these apply to vectors and matrices. I agree that ranks 1 and 2 are what we should prioritize and implement, but I think it is possible to come up with a scheme that "feels" good enough for ranks 1 and 2, but also generalizes to higher rank. To do this, we mostly need to agree on two things:
Consider:
compared to
If we can get comfortable with one of the former options, then I would argue we can support the most common N-dim structures "for free" later on.
Yeah, it would be nice to capture a description of how each dimension is stored as a string with one character per dimension. I'm not sure the best way to spell this. Here's an idea:
I remember these as
Note that I think we can still name things such as |
I am very familiar with the MLIR sparse_tensor dialect. As a matter of interest, I'm working with the main developer to implement semirings in the algebra. I already have unary and binary operators working. Once reduce and select are implemented, we should have all the necessary pieces to have an MLIR version of GraphBLAS. |
Possible properties of
Globally, are there any duplicate indices (such as duplicate This isn't purely academic, because I think we'll want to consider these properties for rank 1 and 2 too. I suspect we'll want the vanilla file format to be as dense as possible and sorted, which matches our normal understanding of CSR and DCSR. But, COO is typically unsorted. Whatever we choose for "vanilla", we ought to have a way to add extension properties and indicate whether these properties violate "vanilla" format (so that readers that only read "vanilla" can complain loudly). |
I'm excited to here that the two of you are already well-aware of the MLIR sparse_tensor dialect. An MLIR backend for the Python GraphBLAS would be awesome, @jim22k! I like the idea of having a level-based syntax that supports arbitrary-size tensors and then having most of our labels ("CSR," "COO," etc.) be syntactic sugar for an expression in that syntax. (I would also point out that we could launch a version 1.0 with just the syntactic sugar labels, without having a full implementation and spec for arbitrary dimensions.) I think the naming scheme you have ( As for sorting, my gut reaction is that rows and indices should be unique. I think there are reasonable reasons for having repeat rows or indices in a binary storage format in memory when you're doing computation, but for file IO I think allowing repeats makes things too complicated for no clear win. This is assuming we are sticking with a blocking strategy to support large file inputs. |
Yeah, that's exactly the idea. I'm actually not very fond of the names MLIR sparse uses. IIRC, I think it uses "singleton" for COO-like (just And, yeah, I would actually prefer more explicit, descriptive names for the compression type of the dimension. How about this:
I think this is nice, because it uses language that is commonly used and understood. Here are examples using the single-character description for each dimension of the sparse structure:
To compare, I think MLIR sparse_tensor is:
If you look closely, you'll see my proposal is slightly different than, but is compatible with, MLIR sparse. I plan to work on a document that explains my proposal in more detail and with diagrams. I can highlight the differences there, and why I think it's justified to not match MLIR sparse exactly (short answer: I think it'll be easier to explain and generally more intuitive). Also, I'm totally okay with having version 1.0 be simple versions, and 2.0 be more expressive. |
In anticipation of our meeting on Monday, I'd like to point out the draft implementation of an N-dimensional hdf5 output for my tensor compiler, Finch. There are two relevant links:
let me know what you think! |
I think N-dimensional arrays may be achievable without too much consternation. Let me share a partial proposal that I think strikes a good balance between simplicity and expressiveness. This is largely inspired by TACO formats.
For the rest of this post, I will only assume row-oriented (aka C-oriented) structures such as CSR. Specific names are also open to change.
Instead of names like
col_indices
andcolumns
, let's see how far we can get with using onlyindices{i}
andpointers{i}
, such asindices0
,indices1
,pointers0
,pointers1
, etc.COO format it easy. It uses
indices0
,indices1
, ...,indicesN
.CSR can use
pointers0
andindices1
instead ofindptr
andcol_indices
.HyperCSR (DCSR) can use
indices0
,pointers0
, andindices1
instead ofrows
,indptr
, andcol_indices
.Let's look at all options for up to rank 3:
Ranks 1-3
I hope this is self-explanatory (although it may take some thinking). Please ask if you would like me to clarify anything.
Observation 1
If
indicesN
appears withoutpointersN
, thenlen(indicesN) == len(values)
, and all subsequent indices (such asindices{N+1}
) must also appear without the corresponding pointers (such aspointers{N+1}
).Edit: upon further consideration, this is not a strict rule. A format with
indices0
+indices1
+pointers1
+indices2
is a valid rank 3 array and actually not that weird.indices0
+pointers1
+indices2
is pretty weird though.Observe how rank 2 formats all have
indices1
, but nopointers1
, becauseindices1
is the final coordinate that must match with values.Observation 2
If
pointersN
appears withoutindicesN
, thenpointerN
is an array of pointers that can be fast-indexed with any given coordinate index. For example, this is likeindptr
in CSR that lets you quickly index into it using a row id.This case typically happens only at the beginning (as for CSR), and only once. Let's consider the two highlighted cases above that aren't typical:
pointers1
array will be of length2 * shape[1]
and can be fast-indexed into.i
andj
.pointers1
array will be of lengthshape[0] * shape[1]
.Ranks 1-5
For "easy-to-think-about" options (see typical cases from "observation 2" above)
I think this illustrates the pattern of "reasonable" cases nicely. For rank
R
, there are2*R - 1
such "reasonable" cases. Overall, though, there are2**R - 1
3**(R-1)
total possibilities given my current rules.Okay, it's late. I'll continue discussing this some other day.
CC @ivirshup
The text was updated successfully, but these errors were encountered: