WebGPU fitness for ML frameworks #66

dontcallmedom · 2020-08-19T09:55:12Z

@jasonmayes raises the question of whether WebGPU exposes the right API surface needed to support ML frameworks interactions with GPUs.

@jasonmayes, do you have a list of specific asks from the TFJS experience?

@grorg @tidoust any insights on this?

tidoust · 2020-08-19T10:42:01Z

Cc @Kangz

I'm afraid I don't have any insights on this for now.

Kangz · 2020-08-19T15:50:26Z

WebGPU provide compute shaders, by themselves they allow using "shared workgroup memory" which is nice but not the best you can do in native GPU ML today. Next are subgroup operations that could be a WebGPU extension and that some people are already looking at. And finally there's the cooperative matrix multiply (marketed as "tensor cores" by Nvidia): it might become a WebGPU extension if it becomes supported by more than one HW vendor.

dontcallmedom · 2020-08-19T16:01:03Z

Thanks @Kangz, very useful! here is the link to the current discussion on subgroup operations for others' benefits.

jasonmayes · 2020-08-20T02:30:59Z

So one thing we wanted to find out is if there is a way to have garbage collection much like JS currently has, but for GPU related activities too. Right now we made TF.tidy() to somewhat deal with the release of memory when finished but newer users take time to realise this exists and it would be better if consistent with how JS generally functions - most JS devs do not even think about memory management as they are used to the JS garbage collector doing its thing.

jasonmayes · 2020-08-20T02:34:06Z

Adding @pyu10055 @dsmilkov @nsthorat @annxingyuan @tafsiri @lina128 for any wish list items related to this topic too (TF.js team)

Kangz · 2020-08-20T15:53:49Z

So one thing we wanted to find out is if there is a way to have garbage collection much like JS currently has, but for GPU related activities too. Right now we made TF.tidy() to somewhat deal with the release of memory when finished but newer users take time to realise this exists and it would be better if consistent with how JS generally functions - most JS devs do not even think about memory management as they are used to the JS garbage collector doing its thing.

There isn't really a way to do automatic GC of GPU resources, and this can be seen in WebGL's gl.deleteX, WebGPU's resource.destroy() or even ImageBitmap.close(). That's because some very small amount of JS objects can hold to large amounts of GPU memory. Either the GC knows about it and will run often to try to reclaim memory (bad for realtime applications and overall perf), or it will just see the JS objects and let GPU objects leak. It's not possible to trigger in Javascript GC when the GPU runs out of memory for many reasons, including that GPU objects are in different process and can't roundtrip to JS to ask for the GC to run.

anssiko · 2020-08-26T11:37:46Z

Newly added is a SIMD operations in WebGPU for ML talk by @mehmetoguzderin discussing proposed subgroup operations @Kangz mentioned.

@mehmetoguzderin feel free to provide your further perspectives in this issue for workshop discussions. Also please review other workshop talks relevant to this issue as well as the WebNN API spec and its open issues. In particular, the WebNN API issue webmachinelearning/webnn#6 discussing custom ops using lower-level APIs such as WebGPU.

mehmetoguzderin · 2020-08-26T13:31:09Z

@anssiko WebNN is very interesting; I will have a look at it. And I will provide input in this repository for anything workshop related. Thanks for the mention.

mehmetoguzderin · 2020-09-10T13:16:39Z

Now a sample code that uses SIMD operations is available in the repository of my talk. For the speed benchmark's chart that compares SIMD to alternative methods, please check out the main README.md, and for the code itself, please check out the samples folder. (Code is written in Vulkan and GLSL but structured enough to give a general idea):
https://github.com/mehmetoguzderin/webgpu-20200828-simdgroup

wchao1115 · 2020-09-12T01:37:26Z

I'd like to offer a different take in response to @jasonmayes' question in his talk:

What lower level support do we need for efficient ML when using the graphics card?

As we know, most meaningful ML accelerations are rooted in the underlying hardware, and the work to surface such capabilities have been concentrating in the OS layer where the actual interaction between the platform tech and the hardware drivers meet. This is true for Windows, Linux, Android and MacOS. It is done this way because the hardware difference in the ecosystem is diverse and that hardware abstraction is a problem the OS is very good at.

WebNN is designed to provide an ML-specific path for the web platform to leverage OS native ML functionality that make use of this hardware acceleration in a more consistent and manageable way. So instead of relying on low-level, general-purpose compute constructs such as WebGL or WebGPU shaders, an ML framework could leverage native ML constructs more directly through an ML-specific web API like WebNN by letting it carry out platform-specific acceleration in the OS layer under the hood.

In the case of DirectML, in addition to providing a very optimized version of the compute-based ML implementation, being an OS component, it also leverages fine-grained interaction with the underlying compute drivers in the OS stack to maximize runtime performance and reduce latency; when appropriate, it provides short-cuts to operation-specific underlying capabilities based on hardware's availability. As discussed in my talk, we've so far been reasonably successful with the integration of DirectML to both ONNX and TensorFlow. DirectML functionality can be mapped through WebNN.

jeffhammond · 2020-09-16T15:27:50Z

During the Zoom, I asked about whether subgroups were the right level to seek portability, and if it might be better to target a DSL like Halide or MLIR as the portable abstraction layer.

The challenges of making anything at the level of OpenCL subgroups portable are:

the long-standing differences in how SIMD and SIMT are implemented in CPU and GPU hardware, and the lack of consistency in e.g. shuffle instructions.
the introduction of multidimensional SIMD instructions, e.g. NVIDIA Tensor Cores, Intel AMX and Apple AMX.

At least for some ML workloads, the second category are more useful, and a better target than vector operations.

Background

Halide is a Domain-Specific Language (DSL) for image processing and other data parallel computations, including neural network operations (e.g. https://people.csail.mit.edu/tzumao/gradient_halide/).
LIBXSMM is a library for smaller matrix multiplication and convolution operations that uses lightweight just-in-time (JIT) compilation to generate optimal code for each supported architecture. It is created by Intel and focused on AVX-512 and AMX instruction sets.
OpenCL subgroups shows the subgroup interface Intel/Khronos added to OpenCL. Even though OpenCL is portable, the usage of the API is hardware-dependent, which one of my motivations for wondering if a higher level API is better.
Apple AMX (sorry I cannot find official documentation yet) is a set of CPU-based matrix extensions.

mehmetoguzderin · 2020-09-16T15:56:33Z

Thanks a lot for the feedback, @jeffhammond An essential aspect of the SIMD proposal for WebGPU is the restricted set of operations exposed in itself. For example, shuffle operations and indexed accesses don't exist at all; this stems from the concerns they bring, and because not all target native APIs have those operations.

Demonstrated in the sample I provided for this workshop, even with a safer subset which requires a uniform control, the performance gain can push the bands of 10 times. As people said in the call, they want their GPU execution to be as little as possible when considering embedded or mobile aspects. SIMD operations enable that for very realistic use cases such as exploratory data analysis. And the rougher terrains of these operations are not that extreme (some driver bugs exist) given that atomics and writeable buffers are available in WebGPU. I believe if they are available in MVP, people that work on fantastic higher-level abstractions similar to Halide will squeeze the benefit of SIMD operations and reflect to benefit users that can't invest the time to work on SIMD reductions. But even for such people, SIMD operations bring a benefit because when it comes to reduction, atomic operations only work for integers. In contrast, SIMD operations give access to more types, and they outperform atomics even on integers.

I think exposing the kind of tensor cores is independent of SIMD operations discussion because they are way more recent, and their API surface is a bit different.

kvark · 2020-09-16T15:57:20Z

For a structured capture of the WebGPU debate on subgroups, one can also have a look at argdown-plain and argdown-component views.

jasonmayes mentioned this issue Aug 20, 2020

WebGL garbage collection #63

Open

anssiko mentioned this issue Sep 1, 2020

Heterogeneous parallel computing for the web #92

Open

anssiko added the Opportunities and Challenges Opportunities and Challenges of Browser-Based Machine Learning label Sep 3, 2020

anssiko mentioned this issue Sep 8, 2020

Add workshop agenda #98

Merged

anssiko added this to the 2020-09-16 Live Session #1 milestone Sep 17, 2020

dontcallmedom added the Discussion topic Topic discussed at the workshop label Oct 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WebGPU fitness for ML frameworks #66

WebGPU fitness for ML frameworks #66

dontcallmedom commented Aug 19, 2020

tidoust commented Aug 19, 2020

Kangz commented Aug 19, 2020

dontcallmedom commented Aug 19, 2020 •

edited

Loading

jasonmayes commented Aug 20, 2020

jasonmayes commented Aug 20, 2020

Kangz commented Aug 20, 2020

anssiko commented Aug 26, 2020

mehmetoguzderin commented Aug 26, 2020 •

edited

Loading

mehmetoguzderin commented Sep 10, 2020

wchao1115 commented Sep 12, 2020 •

edited

Loading

jeffhammond commented Sep 16, 2020

mehmetoguzderin commented Sep 16, 2020

kvark commented Sep 16, 2020

WebGPU fitness for ML frameworks #66

WebGPU fitness for ML frameworks #66

Comments

dontcallmedom commented Aug 19, 2020

tidoust commented Aug 19, 2020

Kangz commented Aug 19, 2020

dontcallmedom commented Aug 19, 2020 • edited Loading

jasonmayes commented Aug 20, 2020

jasonmayes commented Aug 20, 2020

Kangz commented Aug 20, 2020

anssiko commented Aug 26, 2020

mehmetoguzderin commented Aug 26, 2020 • edited Loading

mehmetoguzderin commented Sep 10, 2020

wchao1115 commented Sep 12, 2020 • edited Loading

jeffhammond commented Sep 16, 2020

Background

mehmetoguzderin commented Sep 16, 2020

kvark commented Sep 16, 2020

dontcallmedom commented Aug 19, 2020 •

edited

Loading

mehmetoguzderin commented Aug 26, 2020 •

edited

Loading

wchao1115 commented Sep 12, 2020 •

edited

Loading