-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Max-plus matrix multiplication #365
Comments
Hi @Jogima-cyber , Thanks for your interest in our library. import torch
from pykeops.torch import LazyTensor
I, J, K = 1000, 2000, 3000
a = torch.randn(I, K)
b = torch.randn(J, K)
a_ = LazyTensor(a.view(I, 1, 1, K, 1))
b_ = LazyTensor(b.view(1, J, 1, K, 1))
maxplus = (a_ + b_).max(dim=3)
print(maxplus.shape) # (I, J, 1, 1)
# Use argmax + indexing to get a differentiable result If you want to write an optimal CUDA kernel for this task, I would suggest looking at CUTLASS and Triton. Going forward, we could support a better (but still sub-optimal) implementation by allowing the indexation of KeOps variables by other, integer-valued variables. a_ = LazyTensor(a.view(I, 1, K, 1))
b_ = LazyTensor(b.view(1, J, 1, K))
k_ = LazyTensor(torch.arange(K).view(1, 1, K, 1))
maxplus = (a_ + b_[k_]).max(dim=2)
print(maxplus.shape) # (I, J, 1) Performance would be OK if K < 100, and we could cut the problem in smaller chunks for larger values of K. Best regards, |
Hello, a_ = LazyTensor(a.view(I, 1, 1, K))
b_ = LazyTensor(b.view(1, J, 1, K))
maxplus = (a_ + b_).max(dim=3).sum(dim=2) Here the summation along dim 2 does nothing since the size is 1, but it is just a dummy reduction to convert LazyTensor to actual tensor. Trying it, this option appears to be faster, but actually there seems to be an issue with this dummy summation, we need to investigate. |
Hello @Jogima-cyber and @jeanfeydy , Coming back to the current topic, I agree with Jean that implementing an indexation by other variables the way he proposes would be a very nice improvement, although I don't know if this can be done easily! |
Hello! @jeanfeydy @joanglaunes thank you very much for the time you spent trying several approaches to solve this issue! I tried @jeanfeydy's solution for my use case, and while it was quite slow, it was sufficient to test an idea. If the idea shows interesting results, I can always write an optimized cuda kernel as suggested by @jeanfeydy. Thank you again for your time. |
Hi @Jogima-cyber , you're very welcome! As far as I can tell, Taichi would be a good tool to implement such an operation efficiently. It is lower-level than KeOps (see e.g. this example that implements the same tiled reduction scheme as we do), but still much more readable than C++ and easier to deploy. Best of luck :-) |
Hello @Jogima-cyber , @jeanfeydy , |
Hi there! I'm currently looking for an autodiff approach to perform a max-plus matrix multiplication on cuda:

I've been experimenting with KeOps, but it seems it might not be the ideal tool for this task. Do you think there's a possibility of tweaking it to accommodate this specific operation?
The text was updated successfully, but these errors were encountered: