Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion benchmarking.html
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
window.location.protocol === "file:";
const script = document.createElement("script");
script.type = "module";
script.src = (isLocalhost ? window.location.origin : "https://gridwise-webgpu.github.io") + "/gridwise/benchmarking_chrome.mjs";
script.src = (isLocalhost ? window.location.origin : "https://gridwise-js.github.io") + "/gridwise/benchmarking_chrome.mjs";
document.body.appendChild(script);
// <script src="http://localhost:8000/gridwise/benchmarking_chrome.mjs" type="module"></script>
</script>
Expand Down
1 change: 0 additions & 1 deletion docs/Gemfile
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,6 @@ gem "minima", "~> 2.5"
gem "github-pages", "~> 232", group: :jekyll_plugins
# If you have any plugins, put them here!
group :jekyll_plugins do
gem "jekyll-feed", "~> 0.12"
end

# Windows and JRuby does not include zoneinfo files, so bundle the tzinfo-data gem
Expand Down
63 changes: 63 additions & 0 deletions docs/_config.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# Welcome to Jekyll!
#
# This config file is meant for settings that affect your whole blog, values
# which you are expected to set up once and rarely edit after that. If you find
# yourself editing this file very often, consider using Jekyll's data files
# feature for the data you need to update frequently.
#
# For technical reasons, this file is *NOT* reloaded automatically when you use
# 'bundle exec jekyll serve'. If you change this file, please restart the server process.
#
# If you need help with YAML syntax, here are some quick references for you:
# https://learn-the-web.algonquindesign.ca/topics/markdown-yaml-cheat-sheet/#yaml
# https://learnxinyminutes.com/docs/yaml/
#
# Site settings
# These are used to personalize your new site. If you look in the HTML files,
# you will see them accessed via {{ site.title }}, {{ site.email }}, and so on.
# You can create any custom variable you would like, and they will be accessible
# in the templates via {{ site.myvariable }}.

title: "Gridwise: WebGPU Compute Primitives in JavaScript"
email: [email protected]
description: >- # this means to ignore newlines until "baseurl:"
Gridwise defines WebGPU compute primitives using JavaScript.
baseurl: "/gridwise" # the subpath of your site, e.g. /blog

# Minima settings
show_excerpts: true

exclude:
- misc/
url: "https://github.com/gridwise-js/gridwise" # the base hostname & protocol for your site, e.g. http://example.com
github_username: gridwise

# Build settings
theme: minima
plugins: []

# Minima theme settings
minima:
social_links: []

rss: false

# Exclude from processing.
# The following items will not be processed, by default.
# Any item listed under the `exclude:` key here will be automatically added to
# the internal "default list".
#
# Excluded items can be processed by explicitly listing the directories or
# their entries' file path in the `include:` list.
#
# exclude:
# - .sass-cache/
# - .jekyll-cache/
# - gemfiles/
# - Gemfile
# - Gemfile.lock
# - node_modules/
# - vendor/bundle/
# - vendor/cache/
# - vendor/gems/
# - vendor/ruby/
37 changes: 37 additions & 0 deletions docs/_layouts/home.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
---
layout: default
---

<div class="home">
{%- if page.title -%}
<h1 class="page-heading">{{ page.title }}</h1>
{%- endif -%}

{{ content }}

{%- if site.posts.size > 0 -%}
<h2 class="post-list-heading">{{ page.list_title | default: "Posts" }}</h2>
<ul class="post-list">
{%- assign ordered_titles = "Gridwise Architecture,Gridwise WebGPU Object Caching Strategy,Gridwise WebGPU
Timing Strategy,Gridwise WebGPU Primitive Strategy wrt Subgroups,Gridwise WebGPU @builtins Strategy,Gridwise
WebGPU Subgroup Emulation Strategy,Abstraction Challenges in Writing a WebGPU/WGSL Workgroup Reduce
Function,Gridwise's Binary Operator Class,Gridwise Scan and Reduce,Gridwise Sort" | split: "," -%}
{%- for title in ordered_titles -%}
{%- assign found_post = site.posts | where: "title", title | first -%}
{%- if found_post -%}
<li>
<h3>
<a class="post-link" href="{{ found_post.url | relative_url }}">
{{ found_post.title | escape }}
</a>
</h3>
{%- if site.show_excerpts -%}
{{ found_post.excerpt }}
{%- endif -%}
</li>
{%- endif -%}
{%- endfor -%}
</ul>
{%- endif -%}

</div>
41 changes: 41 additions & 0 deletions docs/_posts/ 2025-08-12-timing-strategy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
---
layout: post
title: "Gridwise WebGPU Timing Strategy"
date: 2025-08-12
---

Building a high-performance WebGPU primitive library requires careful timing measurements to ensure that primitives deliver high performance to their users. In developing Gridwise, we designed both CPU and GPU timing methodologies and describe them here.

## GPU Timing

Gridwise's GPU timing uses WebGPU's GPU timestamp queries. Gregg Tavares's "[WebGPU Timing Performance
](https://webgpufundamentals.org/webgpu/lessons/webgpu-timing.html)" is both an excellent overview of using timestamp queries as well as the basis for our implementation. We began with his [timing helper](https://webgpufundamentals.org/webgpu/lessons/webgpu-timing.html#a-timing-helper) and augmented it to support calling multiple kernels (which is useful for benchmarking purposes).

In our implementation, the `Primitive` class has a static `__timingHelper` member that is initialized to understand. The timing helper is enabled by passing `enableGPUTiming: true` as a member of the argument object to `primitive.execute()` (i.e., `primitive.execute{ enableGPUTiming: true, ...}`).

When GPU timing is enabled in this way, `primitive.execute` checks for a defined `__timingHelper` and initializes one if not, counting the number of kernels in the current primitive for the timing helper's initialization. As explained by the [timing helper](https://webgpufundamentals.org/webgpu/lessons/webgpu-timing.html#a-timing-helper)'s documentation, the timing helper replaces `beginComputePass` with its own `beginComputePass` that initializes the timing helper.

We record a separate timing value for each kernel in the primitive and thus return a list of kernel time durations, one list element per kernel.

`primitive.execute`'s argument object also takes a `trials` argument that defaults to 1. If `trials` is _n_, then every kernel dispatch within the primitive is replaced with _n_ consecutive dispatches. Running more trials is helpful for more reliable timing measurements.

## CPU timing

CPU timing is only possible if the primitive creates its own encoder to ensure that the resulting command buffer only reflects the work within the primitive. It is enabled in a similar way to GPU timing (`primitive.execute{ enableCPUTiming: true, ...}`). In our internal discussions, we have settled on the following way to best measure CPU timing, which is what is implemented within Gridwise, but we are open to better suggestions here. Our strategy is to clear the device's queue, record the current time as `cpuStartTime`, submit the command buffer and wait for the queue to clear, then record the current time as `gpuStartTime`. The elapsed time is the time between the two CPU time stamps.

```js
const commandBuffer = encoder.finish();
if (args?.enableCPUTiming) {
await this.device.queue.onSubmittedWorkDone();
this.cpuStartTime = performance.now();
}
this.device.queue.submit([commandBuffer]);
if (args?.enableCPUTiming) {
await this.device.queue.onSubmittedWorkDone();
this.cpuEndTime = performance.now();
}
```

## Returning timing information

The primitive object has a `async getTimingResult` member function that returns the CPU and GPU timing result as `const { gpuTotalTimeNS, cpuTotalTimeNS } = await primitive.getTimingResult();`. For GPU timing, this call returns a list of total elapsed times per kernel in ns. For CPU timing, this call returns an elapsed time for the entire primitive in ns. Neither accounts for `trials`, so the caller of `getTimingResult` should divide any returned values by the number of trials.
67 changes: 67 additions & 0 deletions docs/_posts/ 2025-08-21-primitive-design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
---
layout: post
title: "Gridwise WebGPU Primitive Strategy wrt Subgroups"
date: 2025-08-21
---

We wish to implement WebGPU sort, scan, and reduce. The fastest known GPU techniques for these operations are single-pass (chained) algorithms that minimize overall memory bandwidth. However, these techniques have historically been written using warp/subgroup support, and that subgroup support appears to be critical for their high performance. This document looks at the different considerations for different primitive implementations.

## Brief overview of single-pass (chained) algorithm

Most GPU algorithms are limited by memory bandwidth. This is also true of sort, scan, and reduce. The fastest algorithms are those that require the least memory traffic. This description is high-level and elides many details.

The single-pass (chained) algorithms work as follows:

- The input is divided into tiles. We launch one workgroup per tile. Each workgroup can process its entire tile in parallel and in its own local workgroup memory.
- Each tile maintains a globally visible region of memory where it posts its results. Those results can be in one of three states: “invalid”, “local reduction” (reflecting a result that is only the function of the current tile’s input), and “global reduction” (reflecting a result that is a function of the current tile’s input AND all previous tiles’ inputs).
- Each tile follows these steps:
- Consume its input tile and post a globally visible “local reduction” of that input.
- Compute a globally visible “global reduction”. The global reduction is the result of ALL tiles up to and including the current tile. In practice, this is computed by fetching the previous tile’s global reduction result and combining it with the current tile’s local reduction result, resulting in a global reduction result for my tile.
- Post that globally visible global reduction result to globally visible memory.
- This approach computes tiles in parallel, but then serializes the reduction of the results of each of those tiles. Serializing this reduction seems like a performance bottleneck. However, the reduction operation itself is cheap, and all inputs and outputs for this reduction operation are likely in cache. In practice, this reduction is not a bottleneck on modern GPUs.
- Refinements to this process include:
- If your workgroup is waiting on the previous tile’s global reduction result, look farther back in the serialization chain to aggressively accumulate enough results for your tile to reconstruct the global summary your workgroup needs. (“Lookback”)
- If your workgroup is waiting on the previous tile’s local reduction to post, just (redundantly) recompute that local reduction in your own workgroup. (“Fallback”)

The benefit to a single-pass (chained) algorithm is that it requires a minimum of global memory traffic: each input element is only read once, each output element is only written once, and intermediate memory traffic is negligible compared to reads/writes of input/output elements.

### How can we make chained algorithms fast?

The most important implementation focus in these algorithms is to ensure that the entire implementation is memory-bound. The computation per tile typically requires reading a tile of data from memory, computing results based on that input tile data, and writing results back to memory. For maximum overall throughput, the computation throughput must be faster than the memory throughput. In practice, on today’s GPUs, this requires careful kernel design with respect to computation; simple kernels are likely to become performance bottlenecks. Kernels with less memory demand (specifically, reduction kernels) are especially performance-sensitive because there is less memory traffic to cover the cost of the computation.

Specifically, it appears likely that kernels must take advantage of subgroup instructions to achieve sufficient throughput. Without these primitives, kernels require numerous workgroup barriers that inhibit performance. Subgroup instructions are particularly challenging because different hardware has different subgroup sizes; writing subgroup-size-agnostic kernels is complex.

## Design choice: Use subgroups \+ emulation vs. no-subgroups

### Use subgroup instructions everywhere, emulate subgroup instructions in software where not available

- \+: One code base (easier maintenance)
- –: Emulated code is not likely to be performance-competitive
- –: Current subgroup support is fragile

### Never use subgroups anywhere

- \+: Most portable
- –: Unlikely to deliver top performance

## Design choice: Use chained algorithms vs. hybrid vs. not

### Always use chained algorithms

- \+: Likely to be highest-performance option
- \+: Maintain only one code base
- –: Most complex implementation
- –: Unlikely to deliver good performance without subgroups, which present definite fragility challenges in the chained context

### Hybrid approach: sometimes use chained algorithms, sometimes don’t

- \+: Allows most flexible performance tradeoffs between performance and capabilities
- –: Must maintain two different code bases (little overlap)
- –: Little ability to specialize beyond “has subgroups vs. no subgroups”

### Never use chained algorithms

- \+: Well-known and \-tested implementation strategy
- \+: Maintain one code base
- \+: Simplest code
- –: Will not achieve top performance (theoretically, ⅔ the performance of chained approaches)
Loading