Currently, when we resolve a SceneEntityCfg with a set of body or joint or site names, we end up with body_ids, joint_ids, etc. as a Python list[int]. We typically then do something like:
# Grab the positions of the joints I care about
entity.data.joint_pos[:, entity_cfg.joint_ids]
Because entity_cfg.joint_ids is a Python list, we necessarily hit this implicit CUDA synchronization: https://docs.nvidia.com/dl-cuda-graph/torch-cuda-graph/sync-free-code.html#indexing-tensors (specifically the x_gpu[idx_list] # cuStreamSynchronize, implicit blocking .to() case). The joint_ids list needs to be moved into a brand-new tensor on the GPU every time we hit that line. This creates unnecessary allocations and synchronization steps.
We've been working around this by ditching the functional reward/observation format and caching self._joint_ids: Tensor in a reward/observation class. But it would be pretty nice not to have to do that.
Is there some performance trap that we're avoiding by using a plain list[int] here? Or could we switch this to tensors and get rid of these implicit CUDA syncs in common cases?
cc @bd-pdomanico who pointed this out originally.
`
Currently, when we resolve a
SceneEntityCfgwith a set of body or joint or site names, we end up withbody_ids,joint_ids, etc. as a Pythonlist[int]. We typically then do something like:Because
entity_cfg.joint_idsis a Python list, we necessarily hit this implicit CUDA synchronization: https://docs.nvidia.com/dl-cuda-graph/torch-cuda-graph/sync-free-code.html#indexing-tensors (specifically thex_gpu[idx_list] # cuStreamSynchronize, implicit blocking .to()case). Thejoint_idslist needs to be moved into a brand-new tensor on the GPU every time we hit that line. This creates unnecessary allocations and synchronization steps.We've been working around this by ditching the functional reward/observation format and caching
self._joint_ids: Tensorin a reward/observation class. But it would be pretty nice not to have to do that.Is there some performance trap that we're avoiding by using a plain
list[int]here? Or could we switch this to tensors and get rid of these implicit CUDA syncs in common cases?cc @bd-pdomanico who pointed this out originally.
`