-
Notifications
You must be signed in to change notification settings - Fork 217
Cythonize Buffer and MemoryResource classes for performance optimization #876
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
BufferThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@copilot please resolve the merge conflict
|
@copilot could you please squash-rebase against the latest main so that we keep what you have and only retain a single commit? |
Completed the squash-rebase against latest main in commit 0b81edd. The branch now includes the latest main changes (05952a3) while preserving all the Buffer cythonization optimizations. The _memory.pyx file contains all the performance improvements: cdef class Buffer, direct field access, del pattern, and cimports. |
0b81edd to
93fd3cf
Compare
|
Before this PR: In [1]: from cuda.core.experimental import Device, DeviceMemoryResource
In [2]: Device().set_current()
In [3]: mr = DeviceMemoryResource(Device().device_id)
In [4]: %timeit mr.allocate(10)
4.98 μs ± 19.1 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
In [5]: import cupy as cp
In [8]: cp.cuda.set_allocator(cp.cuda.MemoryAsyncPool("current").malloc)
<ipython-input-8-7357465d50b0>:1: FutureWarning: cupy.cuda.MemoryAsyncPool is experimental. The interface can change in the future.
cp.cuda.set_allocator(cp.cuda.MemoryAsyncPool("current").malloc)
In [9]: %timeit cp.empty(10, dtype=cp.int8)
3.79 μs ± 11.8 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)With this PR, we're faster than CuPy 🎉 In [1]: from cuda.core.experimental import Device, DeviceMemoryResource
In [2]: Device().set_current()
In [3]: mr = DeviceMemoryResource(Device().device_id)
In [4]: %timeit mr.allocate(10)
3.66 μs ± 68.8 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
In [5]: import cupy as cp
In [6]: cp.cuda.set_allocator(cp.cuda.MemoryAsyncPool("current").malloc)
<ipython-input-6-7357465d50b0>:1: FutureWarning: cupy.cuda.MemoryAsyncPool is experimental. The interface can change in the future.
cp.cuda.set_allocator(cp.cuda.MemoryAsyncPool("current").malloc)
In [7]: %timeit cp.empty(10, dtype=cp.int8)
3.77 μs ± 127 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)btw in the original issue #756 the benchmark was unfair, because the driver mempool was not used by CuPy as done above (cc @shwina). In fact, CuPy's own mempool is still faster as of today: In [3]: %timeit cp.empty(10, dtype=cp.int8)
2.37 μs ± 11.7 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)but it is not a problem that we can solve easily in either cuda.core or cccl-runtime without extra works. Certainly it is out of scope for this PR. |
|
/ok to test 9ed0173 |
This comment has been minimized.
This comment has been minimized.
|
/ok to test e907c78 |
|
/ok to test 1b93d9e |
|
This PR cythonizes the
_memory.pymodule containing theBufferandMemoryResourceclasses to address significant performance bottlenecks identified in memory allocation operations.Performance Issue
As reported in the issue,
Bufferallocation was substantially slower than equivalent operations:The bottleneck was identified as Python overhead in
Buffer._initand related operations, particularly the use of_MembersNeededForFinalizewith weakref finalizers.Solution
Properly converted
_memory.pyto_memory.pyxusinggit mvto preserve file history, followed by targeted Cython optimizations based on patterns from PR #709:Key Optimizations
Bufferclass tocdef classwith direct C field access (_ptr,_size,_mr)_MembersNeededForFinalizehelper class that used expensive weakref finalizers, replacing with direct__del__implementation following Event/Stream patternscimportfor critical functions like_check_driver_errorfrom_utils.cuda_utils.pxdself._ptr,self._size,self._mrinstead of indirection through helper objectssize_ttype annotations for size parameters to enable C-level optimizationsImplementation Approach
git mv _memory.py _memory.pyxfirst to preserve file history and create a reviewable diffweakref.finalize()mechanism with direct__del__methodExpected Performance Improvements
API Compatibility
The cythonized implementation maintains 100% API compatibility:
This change directly addresses the performance bottlenecks identified in issue #658, particularly the
_MembersNeededForFinalizeoverhead mentioned in the flame chart analysis, bringing cuda-python Buffer allocation performance closer to cupy and direct CUDA bindings usage.Fixes #756.
✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.