Skip to content

Fixed StridedBufferView::to_numpy on Metal devices #207

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

fangjunzhou
Copy link
Contributor

A draft fix for matrix buffer alignment issue mentioned in #206

This is an extremely naive implementation and might be slow as the buffer is copied in a for loop. Fix for copy_from_numpy is not implemented for now. This matrix alignment issue is also impacting BufferCursor.to_numpy and potentially Buffer.to_numpy() as well. A general fix for to_numpy alignment issue on Metal is required in the future. I'll keep looking into this issue.

The to_numpy() now produce correct 3x3 matrix buffers:

(base) ~/Documents/stanford/bvhgs (main ✗) pytest tests --log-cli-level=DEBUG --benchmark-skip
============================================================================================================= test session starts ==============================================================================================================
platform darwin -- Python 3.12.9, pytest-8.3.5, pluggy-1.5.0
benchmark: 5.1.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /Users/fangjun/Documents/stanford/bvhgs
configfile: pyproject.toml
plugins: anyio-4.9.0, benchmark-5.1.0
collecting ...
------------------------------------------------------------------------------------------------------------- live log collection --------------------------------------------------------------------------------------------------------------
INFO     bvhgs:__init__.py:9 Slang device created: Device(
  type = metal,
  adapter_name = "default",
  adapter_luid = 00000000000000000000000000000000,
  enable_debug_layers = false,
  supported_shader_model = sm_6_0,
  shader_cache_enabled = false,
  shader_cache_path = ""
)
collected 11 items

tests/math/test_quaternion.py::test_quaternion_multiply[4] PASSED                                                                                                                                                                        [  9%]
tests/math/test_quaternion.py::test_quaternion_conjugate[4] PASSED                                                                                                                                                                       [ 18%]
tests/math/test_quaternion.py::test_quaternion_inverse[4] PASSED                                                                                                                                                                         [ 27%]
tests/math/test_quaternion.py::test_quaternion_from_axis_angle[4] PASSED                                                                                                                                                                 [ 36%]
tests/math/test_quaternion.py::test_quaternion_to_axis_angle[4] PASSED                                                                                                                                                                   [ 45%]
tests/math/test_quaternion.py::test_rotate_vector[4] PASSED                                                                                                                                                                              [ 54%]
tests/math/test_quaternion.py::test_quaternion_as_rotation_matrix[4]
---------------------------------------------------------------------------------------------------------------- live log call -----------------------------------------------------------------------------------------------------------------
DEBUG    test_quaternion:test_quaternion.py:206 [[[-0.43706563 -0.7254428   0.53170127]
  [ 0.754113    0.02661575  0.65620506]
  [-0.49019092  0.6877675   0.5354331 ]]

 [[ 0.87763655 -0.04220945  0.47746468]
  [ 0.41413367  0.5683256  -0.71098477]
  [-0.24134511  0.8217204   0.51626366]]

 [[-0.55375576 -0.4672291  -0.68923986]
  [ 0.07288799 -0.8517591   0.51883894]
  [-0.8294829   0.23707275  0.5057219 ]]

 [[ 0.03341324 -0.91043794  0.41229412]
  [ 0.9988376   0.04476     0.01789208]
  [-0.03474391  0.41121697  0.9108751 ]]]
DEBUG    test_quaternion:test_quaternion.py:207 [[[-0.43706562 -0.72544289  0.53170129]
  [ 0.75411305  0.02661575  0.65620509]
  [-0.49019094  0.68776756  0.53543312]]

 [[ 0.87763651 -0.04220944  0.47746467]
  [ 0.41413366  0.56832555 -0.7109848 ]
  [-0.2413451   0.82172041  0.5162636 ]]

 [[-0.55375578 -0.46722909 -0.68923981]
  [ 0.072888   -0.85175905  0.51883895]
  [-0.8294829   0.23707276  0.50572189]]

 [[ 0.03341322 -0.91043788  0.41229409]
  [ 0.99883753  0.04475997  0.01789208]
  [-0.0347439   0.41121698  0.9108751 ]]]
PASSED                                                                                                                                                                                                                                   [ 63%]
tests/math/test_quaternion.py::test_quaternion_as_rotation_matrix_benchmark[1024] SKIPPED (Skipping benchmark (--benchmark-skip active).)                                                                                                [ 72%]
tests/math/test_quaternion.py::test_quaternion_as_rotation_matrix_benchmark[2048] SKIPPED (Skipping benchmark (--benchmark-skip active).)                                                                                                [ 81%]
tests/math/test_quaternion.py::test_quaternion_as_rotation_matrix_benchmark[4096] SKIPPED (Skipping benchmark (--benchmark-skip active).)                                                                                                [ 90%]
tests/math/test_quaternion.py::test_quaternion_as_rotation_matrix_benchmark[8192] SKIPPED (Skipping benchmark (--benchmark-skip active).)                                                                                                [100%]

========================================================================================================= 7 passed, 4 skipped in 0.87s =========================================================================================================

Fixed 4 elements alignment issue on Metal devices.
@fangjunzhou

This comment was marked as outdated.

@fangjunzhou

This comment was marked as outdated.

Fix the issue in to_ndarra instead. This version should be more
efficient and flexible.
@fangjunzhou
Copy link
Contributor Author

fangjunzhou commented May 11, 2025

@kaizhangNV This new fix should solve the issue in a more elegant way. 3xn matrix buffers can be converted to numpy and torch correctly on macOS now. However, due to the alignment on metal devices, a continuous StridedBuffer will be converted to a non-continuous ndarray. And copy_from_numpy requires both ndarray and buffer to be continuous in current implementation. This can be problematic for us to refactor fix it with current implementation.

I don't have a clear idea of how to solve this issue. One possible simple solution would be removing SGL_CHECK(is_ndarray_contiguous(data), "Source Numpy array must be contiguous"); as the underlying buffer is not continuous anyway. But this requires the users to allocate matrix aligned numpy buffer manually on different platform which is confusing. Another idea is we implement a new version of copy_from_numpy with strided numpy array support from ground up and we implement platform specific matrix alignment inside slangpy_ext. This could also make it easier to send numpy array slice to StridedBufferView.

This is a demonstration of what would happen if we copy a numpy array to a float3x3 buffer on macOS currently:

image

Fixed matrix alignment stride by copying the data from ndarray to a
temperary buffer.
@fangjunzhou

This comment was marked as outdated.

@fangjunzhou fangjunzhou marked this pull request as ready for review May 11, 2025 01:03
@kaizhangNV
Copy link
Contributor

As I mentioned in discord, the fix should not be in SGL. As the alignment on Metal is known to be different from other platforms. So we recommend developers to use shader cursor to write/read data to GPU.

We provide helper functions for array and vector, but not matrix, but I think's it's trivial to just extend.

We should always avoid such special case handling on metal, as shader cursor already handles those alignment issue correctly. And users should already avoid using raw data directly.

@fangjunzhou fangjunzhou marked this pull request as draft May 12, 2025 03:42
@ccummingsNV
Copy link
Contributor

I've been thinking on this one for the past couple of days. The intelligence of these to_numpy style functions is the first thing we need to decide. There's a perfectly valid argument for saying they should detect when (regardless of platform) the user is attempting to do something that isn't a simple memcpy, and throw an exception explaining that the user would need to use BufferCursor to copy effectively.

The flip side is that you could argue our goal should be to be as 'cross platform' as possible, and if it works on one platform, we should do everything possible to make it work on the others.

If we were to address it in to_numpy, the proposed fixes are probably not the right ones. I'd suggest that we should fall back on a reflection based mechanism (potentially using buffer cursor) for this, as we can then aim to handle all none-trivial use cases.

For now I suggest we keep this branch around for reference, and modify the 2 functions to detect when the copy would return invalid data and throw an exception accordingly.

@fangjunzhou
Copy link
Contributor Author

fangjunzhou commented May 15, 2025

I've changed the to_numpy fix to reflection based mechanism this Monday. Now it's using buffer_type_layout to reflect the stride: 8e12752

I think the current version of to_numpy is doing what you proposed (using buffer cursor for reflection is unnecessary here as the buffer_desc already contains the layout reflection needed for stride calculation)

The only issue is copy_from_numpy is a little bit tricky to fix. Because it doesn't support non-continuous copying at all and all it's doing is just memcpy the entire buffer. My current fix for that is a little bit wacky but it's still migrated to reflection mechanism instead of platform dependent code.

You can take a look at my commits this Monday and decide if it's a good idea to do it (I think to_numpy fix is ready to go, but copy_from_numpy still needs more work)

possible copy_from_numpy fix

One thing I thought about fixing copy_from_numpy is this should be fairly easy in numpy. If we have a ndarray of size (b1, b2, n, 3), we should just expand the size to (b1, b2, n, 4) for alignment and copy the original buffer to the slice [:, :, :, :3] in this expanded array. The underlying buffer should now be correct. However, I'm not familiar with nanobind and I didn't find any document about how to do array slicing read and write in nanobind ndarray. If we are implementing sliced memcpy in c++, it might be a lot of work and we need to take a look at how numpy implement it.

@mkeshavaNV
Copy link

@kaizhangNV - Can we have another review pass here?

oliver-batchelor pushed a commit to oliver-batchelor/slangpy that referenced this pull request Jun 3, 2025
* Buffer+tensor use full type name for signature

* Try fixing pyright issue with declrefs
@ccummingsNV
Copy link
Contributor

I would still rather we favour simply throwing an exception for invalid operations, and to detect it efficiently. Whilst I can see the logic in where we've got, the suggested modifications will make all construction of buffers and tensors more expensive and complex, to handle float3x3 on Mac. That doesn't seem like a good trade off to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants