Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Towards N-dimensional sparse arrays #2

Open
eriknw opened this issue Aug 11, 2022 · 7 comments
Open

Towards N-dimensional sparse arrays #2

eriknw opened this issue Aug 11, 2022 · 7 comments

Comments

@eriknw
Copy link
Member

eriknw commented Aug 11, 2022

I think N-dimensional arrays may be achievable without too much consternation. Let me share a partial proposal that I think strikes a good balance between simplicity and expressiveness. This is largely inspired by TACO formats.

For the rest of this post, I will only assume row-oriented (aka C-oriented) structures such as CSR. Specific names are also open to change.

Instead of names like col_indices and columns, let's see how far we can get with using only indices{i} and pointers{i}, such as indices0, indices1, pointers0, pointers1, etc.

COO format it easy. It uses indices0, indices1, ..., indicesN.
CSR can use pointers0 and indices1 instead of indptr and col_indices.
HyperCSR (DCSR) can use indices0, pointers0, and indices1 instead of rows, indptr, and col_indices.

Let's look at all options for up to rank 3:

Ranks 1-3

rank123

I hope this is self-explanatory (although it may take some thinking). Please ask if you would like me to clarify anything.

Observation 1

If indicesN appears without pointersN, then len(indicesN) == len(values), and all subsequent indices (such as indices{N+1}) must also appear without the corresponding pointers (such as pointers{N+1}).

Edit: upon further consideration, this is not a strict rule. A format with indices0 + indices1 + pointers1 + indices2 is a valid rank 3 array and actually not that weird. indices0 + pointers1 + indices2 is pretty weird though.

Observe how rank 2 formats all have indices1, but no pointers1, because indices1 is the final coordinate that must match with values.

Observation 2

If pointersN appears without indicesN, then pointerN is an array of pointers that can be fast-indexed with any given coordinate index. For example, this is like indptr in CSR that lets you quickly index into it using a row id.

This case typically happens only at the beginning (as for CSR), and only once. Let's consider the two highlighted cases above that aren't typical:

  • "weird, but valid": the first index is compressed as in HyperCSR. Imaging only two rows have data. Then, the pointers1 array will be of length 2 * shape[1] and can be fast-indexed into.
  • "fast lookup": this let's us fast-index using the first two indices i and j. pointers1 array will be of length shape[0] * shape[1].

Ranks 1-5

For "easy-to-think-about" options (see typical cases from "observation 2" above)
rank12345

I think this illustrates the pattern of "reasonable" cases nicely. For rank R, there are 2*R - 1 such "reasonable" cases. Overall, though, there are 2**R - 1 3**(R-1) total possibilities given my current rules.

Okay, it's late. I'll continue discussing this some other day.

CC @ivirshup

@BenBrock
Copy link
Contributor

I think this is ultimately the right way to go for storing sparse tensors of arbitrary dimensions. You can provide a descriptor (pretty much like TACO's descriptor) for how each dimension is stored, along with a dimension reordering, and then that will define what each of the pointer and indices arrays means. And there's a pretty clear naming scheme for the pointers and indices arrays already.

I do think this seems a little complicated to necessarily include in a version 1.0, but am open to it after we've sorted out the regular matrices first if you want to iron out all the kinks.

Also of note in this space is the MLIR SparseTensor dialect which has a similar descriptor.

@eriknw
Copy link
Member Author

eriknw commented Aug 11, 2022

Thanks for the quick feedback. @jim22k and I are way too familiar with the MLIR SparseTensor dialect (Jim probably more so than I).

Part of the exercise in considering N-dim is to come up with a generalized naming scheme and approach, and to see how well these apply to vectors and matrices. I agree that ranks 1 and 2 are what we should prioritize and implement, but I think it is possible to come up with a scheme that "feels" good enough for ranks 1 and 2, but also generalizes to higher rank. To do this, we mostly need to agree on two things:

  • What to name indices0 and pointers0 (and we'll need to decide how to name things regardless)
    • indices0, index0, ind0, idx0, coord0, fids0, indices_0, etc.
    • pointers0, pointer0, indptr0, fptr0, pointers_0, etc.
  • How to distinguish row-major and column-major
    • For example, by having axis_order=[1, 0] to indicate CSC in the metadata.

Consider:

Vector: indices_0
COO: indices_0, indices_1
CSR: pointers_0, indices_1
DCSR: indices_0, pointers_0, indices_1
CSC: pointers_0, indices_1, axis_order=[1, 0]
DCSC: indices_0, pointers_0, indices_1, axis_order=[1, 0]

or

Vector: ind0
COO: ind0, ind1
CSR: ptr0, ind1
DCSR: ind0, ptr0, ind1
CSC: ptr0, ind1, axis_order=[1, 0]
DCSC: ind0, ptr0, ind1, axis_order=[1, 0]

compared to

Vector: indices
COO: rows, columns
CSR: indptr, column_indices
DCSR: rows, indptr, column_indices
CSC: indptr, row_indices
DCSC: columns, indptr, row_indices

If we can get comfortable with one of the former options, then I would argue we can support the most common N-dim structures "for free" later on.

You can provide a descriptor (pretty much like TACO's descriptor) for how each dimension is stored

Yeah, it would be nice to capture a description of how each dimension is stored as a string with one character per dimension. I'm not sure the best way to spell this. Here's an idea:

  • I: for COO-like dimensions that only have e.g. indices_i
  • P: for CSR-like dimensions that only have e.g. pointers_i
  • D for DCSR-like dimensions that have both indices_i and pointers_i

I remember these as I for "indices", P for "pointers", and D for "doubly-compressed". Descriptions are then:

Vector: I
COO: II
CSR: PI
DCSR: DI
CSF: DDDDDDI
CSF-alt: PDDDDDI
COO-N: IIIIIII

Note that I think we can still name things such as CSR, DCSR, HyperCSR, etc.

@jim22k
Copy link
Member

jim22k commented Aug 12, 2022

I am very familiar with the MLIR sparse_tensor dialect.

As a matter of interest, I'm working with the main developer to implement semirings in the algebra. I already have unary and binary operators working. Once reduce and select are implemented, we should have all the necessary pieces to have an MLIR version of GraphBLAS.

@eriknw
Copy link
Member Author

eriknw commented Aug 12, 2022

Possible properties of indices_i:

  • Are the values sorted?
    • Example 1: for CSR, are all the column indices in each row sorted?
    • Example 2: are the indices of COO lexicographically sorted?
  • Are the values unique?
    • Example: dense rows in CSR can be split by repeating an index.
    • Duplicating indices for better load-balancing is the idea behind "Balanced CSF" or B-CSF: https://arxiv.org/pdf/1904.03329.pdf

Globally, are there any duplicate indices (such as duplicate i, j pairs in COO)?

This isn't purely academic, because I think we'll want to consider these properties for rank 1 and 2 too. I suspect we'll want the vanilla file format to be as dense as possible and sorted, which matches our normal understanding of CSR and DCSR. But, COO is typically unsorted. Whatever we choose for "vanilla", we ought to have a way to add extension properties and indicate whether these properties violate "vanilla" format (so that readers that only read "vanilla" can complain loudly).

@BenBrock
Copy link
Contributor

I'm excited to here that the two of you are already well-aware of the MLIR sparse_tensor dialect. An MLIR backend for the Python GraphBLAS would be awesome, @jim22k!

I like the idea of having a level-based syntax that supports arbitrary-size tensors and then having most of our labels ("CSR," "COO," etc.) be syntactic sugar for an expression in that syntax. (I would also point out that we could launch a version 1.0 with just the syntactic sugar labels, without having a full implementation and spec for arbitrary dimensions.) I think the naming scheme you have (pointers_i, indices_i) is good for the actual names of the binary arrays. For the descriptor part of the metadata, I assume we would use something closer to MLIR's actual syntax? (i.e. dimLevelType = ['compressed', 'compressed'], dimOrdering = affineMap<(i, j) -> (j, i)> for hyper CSC.)

As for sorting, my gut reaction is that rows and indices should be unique. I think there are reasonable reasons for having repeat rows or indices in a binary storage format in memory when you're doing computation, but for file IO I think allowing repeats makes things too complicated for no clear win. This is assuming we are sticking with a blocking strategy to support large file inputs.

@eriknw
Copy link
Member Author

eriknw commented Aug 14, 2022

Yeah, that's exactly the idea.

I'm actually not very fond of the names MLIR sparse uses. IIRC, I think it uses "singleton" for COO-like (just index_i), "compressed" for DCSR-like (both index_i and pointers_i), and "dense" for dense dimensions, which can be kinda weird to use (for example, I think CSR in MLIR is specified as "dense" + "compressed"). @jim22k, am I remembering this correctly?

And, yeah, axis_order as I had above does the same thing as MLIR sparse dimOrdering.

I would actually prefer more explicit, descriptive names for the compression type of the dimension. How about this:

  • "sparse"/"S": for COO-like dimensions that only have e.g. indices_i
  • "compressed sparse"/"C": for CSR-like dimensions that only have e.g. pointers_i
  • "doubly compressed sparse" / "D": for DCSR-like dimensions that have both indices_i and pointers_i

I think this is nice, because it uses language that is commonly used and understood. Here are examples using the single-character description for each dimension of the sparse structure:

Vector: S
COO: SS
CSR: CS
DCSR: DS
CSF: DDDDDDS
CSF-alt: CDDDDDS
COO-N: SSSSSSS

To compare, I think MLIR sparse_tensor is:

Vector: singleton
COO: singleton, singleton
CSR: dense, compressed
DCSR: compressed, compressed
CSF: compressed, compressed, ..., compressed
CSF-alt: dense, compressed, compressed, ..., compressed
COO-N: singleton, singleton, ..., singleton

If you look closely, you'll see my proposal is slightly different than, but is compatible with, MLIR sparse. I plan to work on a document that explains my proposal in more detail and with diagrams. I can highlight the differences there, and why I think it's justified to not match MLIR sparse exactly (short answer: I think it'll be easier to explain and generally more intuitive).

Also, I'm totally okay with having version 1.0 be simple versions, and 2.0 be more expressive.

@willow-ahrens
Copy link
Collaborator

In anticipation of our meeting on Monday, I'd like to point out the draft implementation of an N-dimensional hdf5 output for my tensor compiler, Finch. There are two relevant links:

  1. The high-level description of levels in Finch:
    willowahrens.io/Finch.jl/dev/level/
  2. The PR with the draft file format, and some tests with example usage for dense, COO, and CSC
    add hdf5 io finch-tensor/Finch.jl#126

let me know what you think!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants