Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Max-plus matrix multiplication #365

Open
Jogima-cyber opened this issue Mar 18, 2024 · 6 comments
Open

Max-plus matrix multiplication #365

Jogima-cyber opened this issue Mar 18, 2024 · 6 comments
Assignees
Labels
enhancement New feature or request

Comments

@Jogima-cyber
Copy link

Hi there! I'm currently looking for an autodiff approach to perform a max-plus matrix multiplication on cuda:
Capture d’écran 2024-03-18 à 12 29 29
I've been experimenting with KeOps, but it seems it might not be the ideal tool for this task. Do you think there's a possibility of tweaking it to accommodate this specific operation?

@jeanfeydy
Copy link
Contributor

jeanfeydy commented Mar 18, 2024

Hi @Jogima-cyber ,

Thanks for your interest in our library.
I agree with you that KeOps does not seem to be a great fit here, except if the number of k-indices is small (< 100) and you want to perform a symbolic computation on your matrix $C_{ij}$. The code below lets you compute the dense (I, J) matrix, but is far from being optimal on GPU as it cannot leverage the block structure efficiently:

import torch
from pykeops.torch import LazyTensor

I, J, K = 1000, 2000, 3000
a = torch.randn(I, K)
b = torch.randn(J, K)

a_ = LazyTensor(a.view(I, 1, 1, K, 1))
b_ = LazyTensor(b.view(1, J, 1, K, 1))
maxplus = (a_ + b_).max(dim=3)
print(maxplus.shape)  # (I, J, 1, 1)

# Use argmax + indexing to get a differentiable result

If you want to write an optimal CUDA kernel for this task, I would suggest looking at CUTLASS and Triton.

Going forward, we could support a better (but still sub-optimal) implementation by allowing the indexation of KeOps variables by other, integer-valued variables.
It would be a very useful and idiomatic improvement for KeOps, and code could look like:

a_ = LazyTensor(a.view(I, 1, K, 1))
b_ = LazyTensor(b.view(1, J, 1, K))
k_ = LazyTensor(torch.arange(K).view(1, 1, K, 1))
maxplus = (a_ + b_[k_]).max(dim=2)
print(maxplus.shape)  # (I, J, 1)

Performance would be OK if K < 100, and we could cut the problem in smaller chunks for larger values of K.
I don't know if @joanglaunes wants to comment on this?

Best regards,
Jean

@jeanfeydy jeanfeydy added the enhancement New feature or request label Mar 18, 2024
@joanglaunes
Copy link
Contributor

Hello,
I have thought about another option, which is to write

a_ = LazyTensor(a.view(I, 1, 1, K))
b_ = LazyTensor(b.view(1, J, 1, K))
maxplus = (a_ + b_).max(dim=3).sum(dim=2)

Here the summation along dim 2 does nothing since the size is 1, but it is just a dummy reduction to convert LazyTensor to actual tensor. Trying it, this option appears to be faster, but actually there seems to be an issue with this dummy summation, we need to investigate.

@joanglaunes
Copy link
Contributor

Hello @Jogima-cyber and @jeanfeydy ,
Investigating further on my previous comment, it seems the good timings I got with this version were only caused by a bug.
I was misled by the fact that I had correct outputs in almost all cases, but in fact I understand where is comes from now, and there is no reason this version can be faster.
So please forget this previous comment, this was not a good idea for your problem, and for us it means we are facing an issue in some cases when doing this type of dummy reduction. I should open a separate issue for this.

Coming back to the current topic, I agree with Jean that implementing an indexation by other variables the way he proposes would be a very nice improvement, although I don't know if this can be done easily!

@Jogima-cyber
Copy link
Author

Hello! @jeanfeydy @joanglaunes thank you very much for the time you spent trying several approaches to solve this issue! I tried @jeanfeydy's solution for my use case, and while it was quite slow, it was sufficient to test an idea. If the idea shows interesting results, I can always write an optimized cuda kernel as suggested by @jeanfeydy. Thank you again for your time.

@jeanfeydy
Copy link
Contributor

Hi @Jogima-cyber , you're very welcome! As far as I can tell, Taichi would be a good tool to implement such an operation efficiently. It is lower-level than KeOps (see e.g. this example that implements the same tiled reduction scheme as we do), but still much more readable than C++ and easier to deploy. Best of luck :-)

@joanglaunes
Copy link
Contributor

Hello @Jogima-cyber , @jeanfeydy ,
I have added the support for the indexing of a LazyTensor by another LazyTensor, as Jean was suggesting. This was requested also in issue #367 . It is currently in a branch called Index. Maybe Jean you can try your idea to see if it works as expected ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants