Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WebGPU fitness for ML frameworks #66

Open
dontcallmedom opened this issue Aug 19, 2020 · 13 comments
Open

WebGPU fitness for ML frameworks #66

dontcallmedom opened this issue Aug 19, 2020 · 13 comments
Labels
Discussion topic Topic discussed at the workshop Opportunities and Challenges Opportunities and Challenges of Browser-Based Machine Learning

Comments

@dontcallmedom
Copy link
Member

@jasonmayes raises the question of whether WebGPU exposes the right API surface needed to support ML frameworks interactions with GPUs.

@jasonmayes, do you have a list of specific asks from the TFJS experience?

@grorg @tidoust any insights on this?

@tidoust
Copy link
Member

tidoust commented Aug 19, 2020

Cc @Kangz

I'm afraid I don't have any insights on this for now.

@Kangz
Copy link

Kangz commented Aug 19, 2020

WebGPU provide compute shaders, by themselves they allow using "shared workgroup memory" which is nice but not the best you can do in native GPU ML today. Next are subgroup operations that could be a WebGPU extension and that some people are already looking at. And finally there's the cooperative matrix multiply (marketed as "tensor cores" by Nvidia): it might become a WebGPU extension if it becomes supported by more than one HW vendor.

@dontcallmedom
Copy link
Member Author

dontcallmedom commented Aug 19, 2020

Thanks @Kangz, very useful! here is the link to the current discussion on subgroup operations for others' benefits.

@jasonmayes
Copy link

So one thing we wanted to find out is if there is a way to have garbage collection much like JS currently has, but for GPU related activities too. Right now we made TF.tidy() to somewhat deal with the release of memory when finished but newer users take time to realise this exists and it would be better if consistent with how JS generally functions - most JS devs do not even think about memory management as they are used to the JS garbage collector doing its thing.

@jasonmayes
Copy link

Adding @pyu10055 @dsmilkov @nsthorat @annxingyuan @tafsiri @lina128 for any wish list items related to this topic too (TF.js team)

@Kangz
Copy link

Kangz commented Aug 20, 2020

So one thing we wanted to find out is if there is a way to have garbage collection much like JS currently has, but for GPU related activities too. Right now we made TF.tidy() to somewhat deal with the release of memory when finished but newer users take time to realise this exists and it would be better if consistent with how JS generally functions - most JS devs do not even think about memory management as they are used to the JS garbage collector doing its thing.

There isn't really a way to do automatic GC of GPU resources, and this can be seen in WebGL's gl.deleteX, WebGPU's resource.destroy() or even ImageBitmap.close(). That's because some very small amount of JS objects can hold to large amounts of GPU memory. Either the GC knows about it and will run often to try to reclaim memory (bad for realtime applications and overall perf), or it will just see the JS objects and let GPU objects leak. It's not possible to trigger in Javascript GC when the GPU runs out of memory for many reasons, including that GPU objects are in different process and can't roundtrip to JS to ask for the GC to run.

@anssiko
Copy link
Member

anssiko commented Aug 26, 2020

Newly added is a SIMD operations in WebGPU for ML talk by @mehmetoguzderin discussing proposed subgroup operations @Kangz mentioned.

@mehmetoguzderin feel free to provide your further perspectives in this issue for workshop discussions. Also please review other workshop talks relevant to this issue as well as the WebNN API spec and its open issues. In particular, the WebNN API issue webmachinelearning/webnn#6 discussing custom ops using lower-level APIs such as WebGPU.

@mehmetoguzderin
Copy link

mehmetoguzderin commented Aug 26, 2020

@anssiko WebNN is very interesting; I will have a look at it. And I will provide input in this repository for anything workshop related. Thanks for the mention.

@anssiko anssiko added the Opportunities and Challenges Opportunities and Challenges of Browser-Based Machine Learning label Sep 3, 2020
@mehmetoguzderin
Copy link

Now a sample code that uses SIMD operations is available in the repository of my talk. For the speed benchmark's chart that compares SIMD to alternative methods, please check out the main README.md, and for the code itself, please check out the samples folder. (Code is written in Vulkan and GLSL but structured enough to give a general idea):
https://github.com/mehmetoguzderin/webgpu-20200828-simdgroup

@wchao1115
Copy link

wchao1115 commented Sep 12, 2020

I'd like to offer a different take in response to @jasonmayes' question in his talk:

What lower level support do we need for efficient ML when using the graphics card?

As we know, most meaningful ML accelerations are rooted in the underlying hardware, and the work to surface such capabilities have been concentrating in the OS layer where the actual interaction between the platform tech and the hardware drivers meet. This is true for Windows, Linux, Android and MacOS. It is done this way because the hardware difference in the ecosystem is diverse and that hardware abstraction is a problem the OS is very good at.

WebNN is designed to provide an ML-specific path for the web platform to leverage OS native ML functionality that make use of this hardware acceleration in a more consistent and manageable way. So instead of relying on low-level, general-purpose compute constructs such as WebGL or WebGPU shaders, an ML framework could leverage native ML constructs more directly through an ML-specific web API like WebNN by letting it carry out platform-specific acceleration in the OS layer under the hood.

In the case of DirectML, in addition to providing a very optimized version of the compute-based ML implementation, being an OS component, it also leverages fine-grained interaction with the underlying compute drivers in the OS stack to maximize runtime performance and reduce latency; when appropriate, it provides short-cuts to operation-specific underlying capabilities based on hardware's availability. As discussed in my talk, we've so far been reasonably successful with the integration of DirectML to both ONNX and TensorFlow. DirectML functionality can be mapped through WebNN.

@jeffhammond
Copy link
Collaborator

During the Zoom, I asked about whether subgroups were the right level to seek portability, and if it might be better to target a DSL like Halide or MLIR as the portable abstraction layer.

The challenges of making anything at the level of OpenCL subgroups portable are:

  1. the long-standing differences in how SIMD and SIMT are implemented in CPU and GPU hardware, and the lack of consistency in e.g. shuffle instructions.
  2. the introduction of multidimensional SIMD instructions, e.g. NVIDIA Tensor Cores, Intel AMX and Apple AMX.

At least for some ML workloads, the second category are more useful, and a better target than vector operations.

Background

  • Halide is a Domain-Specific Language (DSL) for image processing and other data parallel computations, including neural network operations (e.g. https://people.csail.mit.edu/tzumao/gradient_halide/).
  • LIBXSMM is a library for smaller matrix multiplication and convolution operations that uses lightweight just-in-time (JIT) compilation to generate optimal code for each supported architecture. It is created by Intel and focused on AVX-512 and AMX instruction sets.
  • OpenCL subgroups shows the subgroup interface Intel/Khronos added to OpenCL. Even though OpenCL is portable, the usage of the API is hardware-dependent, which one of my motivations for wondering if a higher level API is better.
  • Apple AMX (sorry I cannot find official documentation yet) is a set of CPU-based matrix extensions.

@mehmetoguzderin
Copy link

Thanks a lot for the feedback, @jeffhammond An essential aspect of the SIMD proposal for WebGPU is the restricted set of operations exposed in itself. For example, shuffle operations and indexed accesses don't exist at all; this stems from the concerns they bring, and because not all target native APIs have those operations.

Demonstrated in the sample I provided for this workshop, even with a safer subset which requires a uniform control, the performance gain can push the bands of 10 times. As people said in the call, they want their GPU execution to be as little as possible when considering embedded or mobile aspects. SIMD operations enable that for very realistic use cases such as exploratory data analysis. And the rougher terrains of these operations are not that extreme (some driver bugs exist) given that atomics and writeable buffers are available in WebGPU. I believe if they are available in MVP, people that work on fantastic higher-level abstractions similar to Halide will squeeze the benefit of SIMD operations and reflect to benefit users that can't invest the time to work on SIMD reductions. But even for such people, SIMD operations bring a benefit because when it comes to reduction, atomic operations only work for integers. In contrast, SIMD operations give access to more types, and they outperform atomics even on integers.

I think exposing the kind of tensor cores is independent of SIMD operations discussion because they are way more recent, and their API surface is a bit different.

@kvark
Copy link

kvark commented Sep 16, 2020

For a structured capture of the WebGPU debate on subgroups, one can also have a look at argdown-plain and argdown-component views.

@anssiko anssiko added this to the 2020-09-16 Live Session #1 milestone Sep 17, 2020
@dontcallmedom dontcallmedom added the Discussion topic Topic discussed at the workshop label Oct 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Discussion topic Topic discussed at the workshop Opportunities and Challenges Opportunities and Challenges of Browser-Based Machine Learning
Projects
None yet
Development

No branches or pull requests

9 participants