Skip to content

Conversation

@thowell
Copy link
Collaborator

@thowell thowell commented Dec 16, 2025

this pr is part of the effort to implement sparse Jacobians #88

when is_sparse==True, instead of iterating over all dofs to construction contact constraints, traverse dof tree for each contact bodies

mujoco reference: https://github.com/google-deepmind/mujoco/blob/08b4b4144d70c69206f96cf329d5044ae686a1e6/src/engine/engine_core_util.c#L55


humanoid

performance for dense should be unchanged

mjwarp-testspeed ./benchmark/humanoid/humanoid.xml --nconmax=24 --njmax=64 --nworld=8192 --event_trace=True

this pr:

Summary for 8192 parallel rollouts

Total JIT time: 0.33 s
Total simulation time: 2.96 s
Total steps per second: 2,767,435
Total realtime factor: 13,837.18 x
Total time per step: 361.35 ns
Total converged worlds: 8192 / 8192

step: 359.65
  forward: 357.14
    fwd_position: 89.85
      kinematics: 16.37
      com_pos: 5.85
      camlight: 1.75
      flex: 0.17
      crb: 13.22
      tendon_armature: 0.17
      collision: 9.35
        nxn_broadphase: 3.71
        convex_narrowphase: 0.17
        primitive_narrowphase: 4.57
      make_constraint: 39.00

main (bb81495):

Total JIT time: 0.32 s
Total simulation time: 2.96 s
Total steps per second: 2,767,848
Total realtime factor: 13,839.24 x
Total time per step: 361.29 ns
Total converged worlds: 8192 / 8192

step: 359.55
  forward: 357.04
    fwd_position: 89.84
      kinematics: 16.36
      com_pos: 5.85
      camlight: 1.75
      flex: 0.17
      crb: 13.23
      tendon_armature: 0.17
      collision: 9.35
        nxn_broadphase: 3.71
        convex_narrowphase: 0.18
        primitive_narrowphase: 4.57
      make_constraint: 38.99

performance for sparse should be improved

mjwarp-testspeed ./benchmark/humanoid/humanoid.xml --nconmax=24 --njmax=64 --nworld=8192 --event_trace=True -o "opt.is_sparse=True"

this pr:

Total JIT time: 0.81 s
Total simulation time: 3.10 s
Total steps per second: 2,641,695
Total realtime factor: 13,208.47 x
Total time per step: 378.54 ns
Total converged worlds: 8192 / 8192

Event trace:

step: 376.86
  forward: 374.33
    fwd_position: 77.87
      kinematics: 16.36
      com_pos: 5.84
      camlight: 1.74
      flex: 0.17
      crb: 9.98
      tendon_armature: 0.17
      collision: 9.34
        nxn_broadphase: 3.71
        convex_narrowphase: 0.17
        primitive_narrowphase: 4.57
      make_constraint: 30.65

main (bb81495):

Total JIT time: 0.25 s
Total simulation time: 3.17 s
Total steps per second: 2,586,971
Total realtime factor: 12,934.85 x
Total time per step: 386.55 ns
Total converged worlds: 8192 / 8192

Event trace:

step: 384.86
  forward: 382.33
    fwd_position: 86.29
      kinematics: 16.36
      com_pos: 5.84
      camlight: 1.74
      flex: 0.17
      crb: 9.99
      tendon_armature: 0.17
      collision: 9.34
        nxn_broadphase: 3.71
        convex_narrowphase: 0.17
        primitive_narrowphase: 4.56
      make_constraint: 38.73

notes:

  • performance should be further improved once efc.J is represented in a sparse format and it is not necessary to zero memory on each call to make_constraint
  • replacing repeated code in contact_pyramidal and contact_elliptic with a wp.func introduced overhead? as a result, to maintain performance for now there is duplicated code in dense and sparse cases.

todo

  • improve tree traversal performance with dense

@adenzler-nvidia
Copy link
Collaborator

I think even with dense jacobians, the tree traversal strategy could be faster? What do you think? We need to zero the memory though, but I'm sure we can find a good way to do it.

@thowell
Copy link
Collaborator Author

thowell commented Dec 19, 2025

@adenzler-nvidia yes, i think the tree traversal could be faster with dense. added a todo for making the dense version performant with tree traversal

@erikfrey
Copy link
Collaborator

@thowell usually I'm in favor of incremental changes and TODOs, but in this case we're adding some complexity (cache kernel, wp.static etc) that we might remove if it turns out that dof tree traversal makes sense for both dense and sparse.

Would you mind having a go at seeing whether it helps in both cases and if so we can simplify the changes in this PR?

In general I'm not a huge fan of cache_kernel, nested_kernel - I really try to use them sparingly when there's no other choice.

@thowell
Copy link
Collaborator Author

thowell commented Jan 5, 2026

@erikfrey i think ultimately all of the constraint functions will need this complexity in order to support dense and sparse without additional overhead, see #934

@thowell
Copy link
Collaborator Author

thowell commented Jan 10, 2026

one reason why it might not make sense to perform dof traversal for dense is that the efc.J row needs to have elements not visited by dof traversal zeroed. the dof traversal and zeroing might be more expensive that simply iterating over all dofs. in the sparse case it is not necessary to zero elements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants