User memory allocation & non-blocking execution

I’d like to request two related enhancements to cuFINUFFT:

1.	User-provided GPU memory allocation:
Allow passing in user-managed GPU workspaces instead of performing internal cudaMalloc/cudaFree calls. This would enable seamless integration with frameworks like PyTorch or JAX, which maintain their own GPU memory pools and expect full control over allocation and deallocation.

2.	Non-blocking, asynchronous execution:
Support fully asynchronous launches that avoid implicit CPU synchronizations (e.g., from hidden memory allocations or stream synchronizations). Frameworks like PyTorch and JAX rely on overlapping GPU execution with CPU-side scheduling — allowing the CPU to stay ahead and queue work — to minimize Python and other host-side overheads. Blocking behavior prevents these frameworks from efficiently pipelining GPU workloads.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

User memory allocation & non-blocking execution #738

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

User memory allocation & non-blocking execution #738

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions