-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Custom operations #6
Comments
I agree to provide a way to create custom ops. More specifically, I guess that developers would like to write custom ops in the similar manner to making custom ops on Python ML frameworks, e.g. using tf.math in TensorFlow, numpy, etc. The current charter states that basic algebra ops using WebGL, WebGPU shaders and WebAssembly SIMD are out of scope. On the other hand, platform-level neural network APIs seem to support several basic arithmetic ops according to ops mapping table by @huningxin. IMO, we need to survey what ops would be typically needed to write custom ops and what are already provided or not provided by platform-level APIs. Also, it might be important to know how browser vendors will implement built-in ops. |
@dsmilkov @tomoyukilabs - updated API proposal, please check if this would suit your needs: https://github.com/DanielMazurkiewicz/WebAI (custom operations and atomic ml operations API described at the end of document) |
For starters, the ergonomics of algebraic operations (especially with custom units) would be problematic as there is no operator overloading. (There have been attempts in the past, but none have been particularly successful.) The pipeline operator proposal (which will also take a decent amount of time) might help with this but only in a limited way; since precedence isn’t nice/straightforward. The operator list on the document linked above definitely is not enough to do proper custom layers. I’m under the opinion that this should probably not be in scope for the initial deliverable. It’s a pretty big pond to boil. |
The current charter says, "The APIs in scope of this group will not be tied to any particular platform and will be implementable on top of existing major platform APIs." In other words, I guess that we may need to consider how operations not included in major platform-level APIs should be provided to framework implementors or web developers. There may be a couple of possible ways to provide such operations:
I agree with that. However, if custom operations are not supported, available DNN models completely depend on operations supported in WebNN API. We should carefully consider what built-in and custom operations should be, in terms of both application-level use cases and ease of browser vendor's implementation. |
@dsmilkov emphasized high-performance, falling back into the main thread to do the operation in JS means 1) a tensor copy from accelerator to the CPU over the bus and back, which is expensive and 2) not friendly to vectorization. |
Ok, I don't know what is exactly working on on bottom of TF, assumed it falls back to JS and copies data between VRAM and regular RAM for every operation (or first and last operation in chain). But if not, this API proposal is still valid, domain doesn't force any particular way of operation usage. It can be used to build "programmatically" code for GPU from Javascript (that is only other way that comes to my mind that it works, but correct me if I'm wrong here too please), just appropriate set of operators (or additional operators) should be provided then. |
@dsmilkov wrote:
I agree extensibility is an important requirement for the low-level API whose major consumers would be the libraries. @tomoyukilabs wrote:
Today's libraries implement neural network ops by WebGL and WebAssembly. I suppose they will continue to adopt new features, like WebGPU compute shader and WebAssembly SIMD/Threading, for performance enhancement. So it makes sense to me that this group leaves the implementation of custom ops out of scope. @dsmilkov worte:
I agree high-performance is the main reason to create this API, as the current charter states that WebNN is for neural network inference hardware acceleration. To access the hardware acceleration through WebNN, a library can partition a neural network into sub-graphs based on the supported ops of WebNN. Some sub-graphs can be executed by WebNN if their ops are all supported. Other sub-graphs need to be executed by custom ops within the library. The WebNN sub-graphs and custom ops should exchange data (tensor) in a efficient way. Otherwise, it is easy to lose the performance gain of WebNN execution due to memory movement/reordering overhead. We observed this issue when experimenting TensorFlow.js optimization on our WebNN POC earlier. @dsmilkov wrote:
It sounds like a good approach. Based on our POC and platform API investigation, WebNN may execute the neural network on different devices, e.g. CPU, GPU or accelerator. The API for custom ops implementation would be related to the WebNN execution device. For instance, a library may select WebAssembly-based custom ops when WebNN executes sub-graphs on CPU and select WebGPU-based custom ops when WebNN executes on GPU. Given the WebGPU is WIP, I propose to look at the WebAssembly-based custom ops first. Within our POC, the existing graph execution API accepts inputs and outputs in Any thoughts? |
LGTM, @huningxin. Thanks for your detailed explanation. |
LGTM as well. It comes down to high-performance data exchange between the built-in operations and user code. "Custom operations" follow from that and are the responsibility of user libraries to implement. |
Thanks @tomoyukilabs @dsmilkov ! As mentioned in 14 Feb 2019 call, I will investigate the "custom operations" support on CPU as the first step and hopefully can report back in March call. |
Doesn't this essentially require standardizing the in-memory representation of tensors? |
Here is the investigation report. I'll update it in today's CG meeting. @gramalingam worte:
I agree this is essential requirement. And according to our investigation, we also need to pay attention to memory re-layout overhead, say from plain tensor layout format (e.g. NHWC) to native CPU backends (e.g. MKLDNN uses blocked layout). |
As a case study of custom ops support for frameworks, @pinzhenx and me prototyped a WebNN backend for ONNX.js. There are some findings:
|
As a follow-up to 8 Aug 2019 call , I've done a very initial investigation of the WebGPU and WebNN memory sharing. The investigation is based on WebGPU backend of TF.js, WebGPU Dawn project and WebNN POC. In particular, I only touched the Metal backend of Dawn and MPS backend and WebNN POC. CompilationFor sub-graph compilation, WebNN may need to support // Create a Compilation object for the constructed model for sub-graph.
const nn_compilation = await model.createCompilation();
// webgpu_backend: framework's WebGPUBackend
// Get the GPUDevice of WebGPUBackend and set that as WebNN compilation target.
nn_compilation.setGPUDevice(webgpu_backend.device);
// Finish the compilation.
await nn_compilation.finish();
// Create an Execution object for the compiled model.
const nn_execution = await nn_compilation.createExecution(); In WebNN implementation, for example the MPS, it would get the ExecutionFramework implements custom kernels in WebGPU compute shader and uses // webgpu_backend: framework's WebGPUBackend
// pre_input, pre_output, post_input, post_output: framework's Tensor backed by WebGPUBuffer
// pre_program, post_program: framework's WebGPUProgram for pre and post processing
// Write the input_data from CPU to GPU.
// input_data: Float32Array
webgpu_backend.write(pre_input, input_data);
// Compile and run pre-processing kernel in WebGPU compute shader.
webgpu_backend.compileAndRun(pre_program, [pre_input], pre_output);
// Set pre processing kernel's output WebGPUBuffer as input of WebNN execution.
nn_execution.setInput(0, tensorToGPUBuffer(pre_output));
// Set post processing kernel's input WebGPUBuffer as output of WebNN execution.
nn_execution.setOutput(0, tensorToGPUBuffer(post_input));
// Start the WebNN sub-graph execution.
nn_execution.startCompute();
// Compile and run the post-processing kernel in WebGPU compute shader.
webgpu_backend.compileAndRun(post_program, [post_input], post_output);
// Get the output data from GPU to CPU.
// output_data: Float32Array
let output_data = await webgpu_backend.read(post_output); In WebNN implementation, for example MPS, it would:
LayoutDepending on the tensor layout definition of framework's and WebNN's, framework may need to reorder the tensor before and after WebNN execution. |
Here's a thought about a way to define custom ops that sidesteps the issue of data layout. What if we can define them in a more abstract way? Consider Halide: you can define things in a pointwise way. Abstractly, for us it might look like:
It would take a lot of work (an entire compiler!) to generate peak-performance {shaders,programs} for things defined in this way, but, since we want peak-performance ops to be natively supported by the API (e.g. conv2d/matmul), this may be OK. |
In WebML CG F2F on 17 Sep, I shared the initial investigation results of WebNN-WebGPU interoperability. According to the feedbacks, I'd like to share more details of the test code and proposal of API change. TestsThe test code is hosted at https://github.com/huningxin/webnn_webgpu_interop_test. It uses TensorFlow.js (tensorflow/tfjs@b3eed68) as an example of frameworks. All tests use WebGPU backend of TensorFlow.js by invoking Test 1 - conv2d/add/relu in WebGPUThis test executes all ops by WebGPU computer shader. No WebNN acceleration. source // input, filter, bias are tf.tensor
let convOutput = tf.conv2d(input, filter, 1, 'same');
let addOutput = tf.add(convOutput, bias);
let reluOutput = tf.relu(addOutput);
let result = await reluOutput.data(); Test 2 - conv2d (WebNN) -> ArrayBufferView -> add/relu (WebGPU)This test executes conv2d by WebNN and add/relu by WebGPU. WebNN conv2d result is read back to TypedArray and upload to WebGPUBuffer for WebGPU add/relu. source // Create a WebNN model contains conv2d
const model = await createWebNNConv(filterValue, noBias, noRelu);
const compilation = await model.createCompilation();
compilation.setPreference(nn.PREFER_SUSTAINED_SPEED);
await compilation.finish();
const execution = await compilation.createExecution();
// input and output are TypedArray
execution.setInput(0, input);
execution.setOutput(0, output);
// Wait for computation done and data is read back
await execution.startCompute();
// Upload to WebGPUBuffer
let outputTensor = tf.tensor(output, inputDims);
let addOutput = tf.add(outputTensor, biasTensor);
let reluOutput = tf.relu(addOutput);
let result = await reluOutput.data(); Test 3 - conv2d (WebNN) -> WebGPUBuffer -> add/relu (WebGPU)This test executes conv2d by WebNN and add/relu by WebGPU. WebNN conv2d result is written to a WebGPUBuffer which is used as input for WebGPU add/relu. source // Create a WebNN model contains conv2d
const model = await createWebNNConv(filterValue, noBias, noRelu);
const compilation = await model.createCompilation();
// Set WebNN compilation for the same WebGPUDevice as tfjs's
compilation.setGPUDevice(tf.backend().device);
compilation.setPreference(nn.PREFER_SUSTAINED_SPEED);
await compilation.finish();
const execution = await compilation.createExecution();
// input, output, bias are tf.tensor
// Get underlying WebGPUBuffer
const inputBuffer = tf.backend().getBuffer(input.dataId);
const outputBuffer = tf.backend().getBuffer(output.dataId);
// Set WebGPUBuffer as input and output to WebNN execution
execution.setInputGPUBuffer(0, inputBuffer);
execution.setOutputGPUBuffer(0, outputBuffer);
// Enqueue the execution to command buffer, no need to wait
execution.startCompute();
let addOutput = tf.add(output, bias);
let reluOutput = tf.relu(addOutput);
// Read back result from GPU
let result = await reluOutput.data(); Test4 - conv2d/bias/relu (WebNN)This test executes conv2d/bias/relu all in WebNN. The input and output of WebNN execution are WebGPUBuffer that are ready for WebGPU compute shaders produce or consume. source // Create a WebNN Model contains conv2d, bias and relu
const model = await createWebNNConv(filterValue, biasValue, fuseRelu);
const compilation = await model.createCompilation();
// Set WebNN compilation for the same WebGPUDevice as tfjs's
compilation.setGPUDevice(tf.backend().device);
compilation.setPreference(nn.PREFER_SUSTAINED_SPEED);
await compilation.finish();
const execution = await compilation.createExecution();
// Get underlying WebGPUBuffer of input and output (tf.tensor)
const inputBuffer = tf.backend().getBuffer(input.dataId);
const outputBuffer = tf.backend().getBuffer(output.dataId);
// Set WebGPUBuffer as input and output to WebNN execution
execution.setInputGPUBuffer(0, inputBuffer);
execution.setOutputGPUBuffer(0, outputBuffer);
// Enqueue the execution to command buffer, no need to wait
execution.startCompute();
// Read back result from GPU
let result = await output.data(); ResultThe prototype of WebNN-WebGPU interop support is based on Chromium 78.0.3891.0. Current prototype only supports macOS where WebNN ops are implemented by MPSCNN kernels and WebGPU is implemented by Metal API. The source code is hosted at https://github.com/huningxin/chromium-src/tree/webnn_webgpu_interop. The result is collected on a MacBook Pro (13-inch, 2017, Four Thunderbolt 3 Ports) running macOS 10.14.6. The test log
Summary:
Proposal(according to #6 (comment) @dsmilkov)
|
(Context: this topic is being discussed in the workshop GH, e.g.: w3c/machine-learning-workshop#68 (comment)) @wchao1115 @huningxin does operator composition in WebNN API address this issue adequately? Anything we'd like to spin into a separate issue? It would be helpful if you could summarize briefly in here the proposed way forward for WebNN API so we can then confirm with @dsmilkov whether the issue could be closed. |
@anssiko It has been our goal to make sure that for every operator we define, if there is a semantically equivalent composition graph of lower level operators, we will also provide the definitions for all of the lower operators in the graph. The latest and perhaps most illustrated sample of that practice is the GRU operators #83, notice that I also added definitions for |
Putting a v2 label on this to check the WG's latest thinking around this as we look our longer-term aspirations. No immediate action required. |
I'll note that WebGPU interop is only one approach for supporting custom ops. While I agree we should support WebGPU interop, this approach is not free of drawbacks - e.g. we don't yet have an idea of what the performance impact of communication/synchronization and data copies between the There are other approaches for supporting custom ops which WebNN might consider, such as StableHLO's recently-added |
@a-sully Understood. I would recommend we open a separate issue when (if ever) WGPU interop performance becomes a concern and an entirely new issue to support custom OPs within WebNN itself alongside a concrete proposal. This issue is unactionable and until recently, inactive so I'd vote we close. |
The custom operators could also be implemented in WebAssembly and exchange tensor data with WebNN CPU execution context. This is used by frameworks today, like ONNXRuntime Web. We should continue to support this usage even we have WebGPU interop later. |
Starting a thread to open the discussion for supporting custom operations.
The ML field is fast moving and model architectures and operations are evolving quickly. In TensorFlow.js, we have around ~200 ops and we still run into issues of missing ops when someone is trying to port a new model to the browser. I believe that the number of built-in ops will be relatively small and will grow very slowly due to standardization.
Thus, it is important to provide a way for library authors to write custom ops that can interop with the built-in neural net ops. That means having high-performance data exchange between custom ops and built-in ops. I stress the importance of high-performance, otherwise lib authors would revert back to implementing all of the ops using lower-level APIs (e.g. WebGPU).
A good way to start is to understand the scope and complexity of the problem. We can look at technical details on how browser-vendors plan to implement built-in ops, which gives us details about where these ops run and where the data lives.
The text was updated successfully, but these errors were encountered: