Skip to content

WebGPURenderer performance trace analysis #32916

@mrdoob

Description

@mrdoob

Description

This week I learnt I can ask Claude to analyze chrome devtools performance traces.

I tried it with a 2 second trace of this example and these are the findings:


▗ ▗   ▖ ▖  Claude Code v2.1.29
           Opus 4.5 · Claude Max
  ▘▘ ▝▝    ~/Work/git/three.js

❯ Can you investigate Trace-20260201T095112.json.gz and see what's taking   
  the most CPU time? The code for this in the /src folder (WebGPURenderer). 
   Lets see if we can find any possible performance improvmeents.           

⏺ Bash(gunzip -c Trace-20260201T095112.json.gz | head -c 50000)
  ⎿  {"metadata": {
       "source": "DevTools",
       "startTime": "2026-02-01T00:51:12.270Z",
     … +874 lines (ctrl+o to expand)

⏺ Bash(gunzip -c Trace-20260201T095112.json.gz | jq -r '.traceEvents[] |
      select(.cat == "devtools.timeline" and .name == "FunctionCall") | "\(.dur
      // 0) \(.args.data.f…)
  ⎿  311556 update
     854
     120 MutationObserver.observe.childList

⏺ Let me get more detailed profiling data from the trace.

⏺ Bash(gunzip -c Trace-20260201T095112.json.gz | jq -r '.traceEvents[] |
      select(.name == "Profile" or .name == "ProfileChunk")' 2>/dev/null | head
      -c 100000)
  ⎿  {
       "args": {
         "data": {
     … +6502 lines (ctrl+o to expand)

⏺ Bash(gunzip -c Trace-20260201T095112.json.gz | jq -c '.traceEvents[] |
      select(.name == "ProfileChunk") | .args.data.cpuProfile.nodes[]? |
      select(.callFrame.url != nu…)
  ⎿  (No content)

⏺ Bash(gunzip -c Trace-20260201T095112.json.gz | jq '.traceEvents[] |
      select(.name == "ProfileChunk") | .args.data.cpuProfile.nodes' 2>/dev/null
       | jq -s 'add | .[] | s…)
  ⎿  (No content)

⏺ Bash(gunzip -c Trace-20260201T095112.json.gz | jq '[.traceEvents[] |
      select(.name == "ProfileChunk") | .args.data.cpuProfile.nodes // empty] |
      add | map(select(.call…)
  ⎿  (No content)

⏺ Read 1 file (ctrl+o to expand)

⏺ Bash(gunzip -c Trace-20260201T095112.json.gz | python3 -c "
      import json…)
  ⎿ Top functions by self time (microseconds):
    =======================================================================
    =========
    … +46 lines (ctrl+o to expand)


⏺ Excellent! Now let me map these hot functions to the source code in /src.

⏺ Searched for 7 patterns, read 2 files (ctrl+o to expand)

⏺ Let me also check the ChainMap and RenderList which are high on the list.

⏺ Searched for 1 pattern, read 1 file (ctrl+o to expand)

⏺ Let me check the Renderer's _renderObjectDirect method and getCacheKey which
  are also high on CPU time.

⏺ Searched for 1 pattern, read 1 file (ctrl+o to expand)

⏺ Based on my analysis of the trace file, here's a summary of the top CPU 
  consumers in the WebGPURenderer:

  Performance Analysis Results
  ┌────────────────┬────────────────────────┬───────────────────────┐
  │ Self Time (µs) │        Function        │       Location        │
  ├────────────────┼────────────────────────┼───────────────────────┤
  │ 37,647         │ get (ChainMap)         │ ChainMap.js           │
  ├────────────────┼────────────────────────┼───────────────────────┤
  │ 34,074         │ setPipelineAndBindings │ WebGPUBackend.js:1505 │
  ├────────────────┼────────────────────────┼───────────────────────┤
  │ 22,994         │ _renderObjectDirect    │ Renderer.js:3301      │
  ├────────────────┼────────────────────────┼───────────────────────┤
  │ 19,725         │ get (DataMap)          │ DataMap/other         │
  ├────────────────┼────────────────────────┼───────────────────────┤
  │ 17,913         │ setVertexBuffer        │ WebGPU native         │
  ├────────────────┼────────────────────────┼───────────────────────┤
  │ 13,202         │ updateMatrixWorld      │ Object3D.js           │
  ├────────────────┼────────────────────────┼───────────────────────┤
  │ 11,141         │ sort                   │ RenderList.js:362     │
  ├────────────────┼────────────────────────┼───────────────────────┤
  │ 10,693         │ draw                   │ WebGPUBackend.js:1486 │
  ├────────────────┼────────────────────────┼───────────────────────┤
  │ 9,943          │ setBindGroup           │ WebGPU native         │
  ├────────────────┼────────────────────────┼───────────────────────┤
  │ 9,033          │ needsRenderUpdate      │ WebGPUBackend.js:1799 │
  ├────────────────┼────────────────────────┼───────────────────────┤
  │ 8,469          │ _projectObject         │ Renderer.js           │
  ├────────────────┼────────────────────────┼───────────────────────┤
  │ 7,421          │ getCacheKey            │ RenderObject.js       │
  ├────────────────┼────────────────────────┼───────────────────────┤
  │ 7,092          │ intersectsSphere       │ Frustum.js            │
  ├────────────────┼────────────────────────┼───────────────────────┤
  │ 3,187          │ painterSortStable      │ RenderList.js:12      │
  ├────────────────┼────────────────────────┼───────────────────────┤
  │ 1,993          │ cyrb53                 │ hash function         │
  └────────────────┴────────────────────────┴───────────────────────┘
  Key Hotspots & Potential Improvements

  1. ChainMap get lookups (37.6ms) — Highest cost

  The ChainMap iterates through multiple nested WeakMaps per lookup. For scenes
  with many objects, this compounds.

  Potential improvement: Consider caching the result or using a simpler
  single-key lookup when possible (e.g., combining keys into a single object
  reference).

  2. needsRenderUpdate (9ms) — Expensive comparison

  At WebGPUBackend.js:1799-1828, there's a massive comparison of ~25 material
  properties every frame:
  if ( data.material !== material || data.materialVersion !== material.version
  ||
       data.transparent !== material.transparent || data.blending !== ... // 25+
   comparisons

  Potential improvement: Use a single version number or hash on the material
  rather than comparing every property individually.

  3. painterSortStable (3.2ms) — Called per comparison

  The sort comparator at RenderList.js:12-31 has 4 levels of nested if-else
  checks. With many objects, this adds up.

  Potential improvement: Pre-compute a single sort key (e.g., (groupOrder << 24)
   | (renderOrder << 16) | z_encoded) to allow a single numeric comparison.

  4. setCacheKey / getRenderCacheKey (7.4ms)

  Cache key generation involves string concatenation which is expensive.

  5. WebGPU Native Calls (28ms total)

  setVertexBuffer, setBindGroup, setIndexBuffer are already optimized by
  tracking current state to avoid redundant calls. This is working correctly.

  Would you like me to implement any of these optimizations? The most impactful
  would likely be:
  1. Simplifying needsRenderUpdate with a version/hash approach
  2. Optimizing the sort comparator with pre-computed keys

✻ Cooked for 2m 12s

/fyi @sunag @Mugen87

Solution

Reduce CPU usage.

Alternatives

Not reduce CPU usage.

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions