-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Define graph execution methods used in different threading models. #255
Conversation
I may be confused, but how would a developer know that their MLContext Generally speaking, branching on the type of the processing unit and support for WebGPU significantly raises the complexity of coding against the API; maybe that's acceptable given our targets (framework developers), but it's probably worth confirming it. Bikeshedding - to align with the existing pattern in the file API of |
@dontcallmedom The assumption here is that the caller choosing the graph execution method is also responsible for how the For instance, if a context is created specifically for the CPU, the graph can be executed either synchronously (via |
The {{MLAwaitedExecution}} interface represents a method of execution that asynchronously carries out the computational workload of a compiled graph {{MLGraph}} on a worker thread to avoid blocking the calling thread while producing results as defined by the operations in the graph. This method of execution requires an {{MLContext}} created with {{MLContextOptions}} or {{WebGLRenderingContext}}. | ||
|
||
<script type=idl> | ||
typedef (MLBufferView or WebGLTexture or GPUTexture) MLResource; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose MLResource
should not support WebGPU resources e.g.,GPUBuffer
(of MLBufferView
) and GPUTexture
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should MLResource
be limited to just ArrayBufferView
? It could be just an async version of MLExecution
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you suggesting that async execution of a GPU context would always automatically upload/download the result to/from the CPU memory? I can see that. How will async execution be used by TF.js in the readback scenario?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you suggesting that async execution of a GPU context would always automatically upload/download the result to/from the CPU memory?
Yes.
How will async execution be used by TF.js in the readback scenario?
The TF.js has Tensor::data()
method which "asynchronously downloads the values".
The {{MLExecution}} interface represents a way the execution of the graph is carried out immediately on the | ||
calling thread, which must also be a worker thread. The execution is carried out by the {{MLExecution/compute()}}, | ||
method which produces the results of the computation from all the inputs bound to graph on the bound outputs. | ||
This type of execution is limited to only when the computational device bound to the ML context is a CPU. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose it should support GPU as well. It would allow the caller who uses sync API could also access GPU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was my original thought, but @RafaelCintron believes that we should not allow sync execution on a GPU context, because it means you need to block on the calling thread until the GPU executes and produces the results in the output tensors, which could create jarring user experience.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this is restricted to DedicatedWorker
, I'm not sure I understand where the risk with blocking on a GPU execution would be. Having it applicable to GPU contexts would help with the concern I raised
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which could create jarring user experience.
Because it is limited in the worker, I suppose the UI thread (main thread) won't be blocked in this way.
interface MLCommandEncoder { | ||
constructor(MLContext context); | ||
|
||
undefined initializeGraph(MLGraph graph, MLNamedGPUInputs inputs); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this method be named as initialize
? MLGraph
is the first argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should MLExecution
and MLAwaitedExecution
expose initialize
method as well? With that, we can remove the MLGraphBuilder::constant
and move all constant buffers binding to specialized execution interfaces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normally, an initialize
method on an interface implies that the operation initializes the host interface object, which makes the host object stateful. I explicitly named this method initializeGraph
because the operation would initialize the graph input param to the method, and not the host object itself i.e. it implies that the graph is stateful, and not the encoder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should MLExecution and MLAwaitedExecution expose initialize
MLExecution
is CPU only, and so there isn't a need to initialize and pre-process the weights i.e. graph constants and inputs are treated the same way.
MLAwaitedExecution
when used for a GPU context, will need to initialize and pre-process the weights before the first inference call. But that can be done without defining the initializeGraph
method because weights are simply constant operands bound to the graph at construction time. The implementation can already distinguish the weight constants from the rest of the named inputs, so initializing them could technically be done before the first inference call (i.e. compute
).
The real reason that only MLCommandEncoder
will need a separate initializeGraph
method is that unlike MLExecution
and MLAwaitedExecution
, the command encoder does not actually carry out the execution of the compute workload. It merely records the workload onto the command buffer on behalf of the caller. This means that it is the caller's responsibility to track the weight constant tensors, and make sure that constant initialization workload is recorded separately by the initializeGraph
call.
Another way to think about it is that the command encoder is a lower level API than the execution API. And, while the execution API can "cache" the constant's initialization result before making the first inference call to compute the output result, the encoder API cannot, it can only record the act of initialization, as a workload separated from the inference workload captured by the subsequent dispatch
call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A CPU backend may also need to pre-process the weights, e.g., conversion to a SIMD instruction optimized block memory layout. And as the discussion above, I would suggest MLExecution
supports both CPU and GPU.
The graph initialization, e.g., constants pre-processing, may be time consuming. Doing it in sync MLGraphBuilder::build
may block the main thread. Hosting a sync version in MLExecution
and an async version in MLAwaitedExecution
may be a solution.
Regarding to the resource type, the initialize
of MLExecution
and MLAwaitedExecution
could only accept ArrayBufferView
, while initializeGraph
of MLCommandEncorder
only accepts GPUBuffer
. This would align with the execution methods respectively (with my suggestion in comment) and simplify the usage.
In this way, the graph building code could be shared for different execution methods. Otherwise, the callers need to call MLGraphBuilder::constant
to bind constant to a graph for MLExecution
and MLAwaitedExecution
, but call MLGraphBuilder::input
and bind the constant in MLCommandEncoder::initializeGraph
. That leads to different graph building code.
interface MLAwaitedExecution { | ||
constructor(MLContext context); | ||
|
||
Promise<MLNamedOutputs> compute(MLGraph graph, MLNamedInputs inputs, MLNamedOutputs outputs); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it just return Promise<void>
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that Promise<Return-Type>
is a semantic used for a deferred async result on the return value. In this case, the result will be placed on the outputs
param, which should also be the return value on the return type. Let me know if I get it wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Promise<Return-Type>
means that a developer would write code à la
const outputs = await executor.compute(graph, inputs, namedoutputs);
I agree with @huningxin that this feels wrong, since the expectation is instead that namedoutputs will be modified in place by the compute method - having the same object also returns via the promise is at best confusing.
@wchao1115 I understood the model you propose that way. My question is - imagine you create a default MLContext with: // Default context created
const context = navigator.ml.createContext(); That means the developer leaves entire control of where the model will be computed to the user agent; the developer has no way to know whether the resulting context implies the sync approach ( Presumably, this could be surfaced somehow on the Even if it gets surfaced, it still means that WebNN users would have to maintain 3 code pathways to deal with the general case of executing a graph (i.e. one where the developer wants to leave the choice of the context to the user agent); 3 pathways means that many opportunities for additional bugs, that much additional need for testing, etc - that may very well be OK, but I think we should check whether this is an acceptable trade-off. |
In the default case, the caller can use either
|
OK, I'm starting to understand more clearly the intent here:
If that's indeed the model, I think it needs to be surfaced in And I still don't understand with the Generally speaking, I am worried that the generic interface |
This type of execution is limited to only when the computational device bound to the ML context is a CPU. | ||
|
||
The {{MLAwaitedExecution}} interface represents a way the execution of the graph is performed asynchronously | ||
on a separate worker thread. The call to the {{MLAwaitedExecution/compute()}} method returns instantly without |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on a separate worker thread. The call to the {{MLAwaitedExecution/compute()}} method returns instantly without | |
in the background. The call to the {{MLAwaitedExecution/compute()}} method returns instantly without |
or maybe just say "separate thread" (but "worker thread" makes it feel that it will be run in a developer-exposed worker)
on a separate worker thread. The call to the {{MLAwaitedExecution/compute()}} method returns instantly without | ||
blocking the calling thread. This execution method is appropriate when the responsiveness of the calling thread | ||
is critical to good user experience. The computation results will be placed at the bound outputs at the time | ||
the operation is completed on a worker thread at which time the calling thread is signaled. This type of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the operation is completed on a worker thread at which time the calling thread is signaled. This type of | |
the operation is completed in the background at which time the calling thread is signaled. This type of |
Thanks @dontcallmedom. There is an alternative design discussed in the issue #230, which I think is what you alluded to here. Let me try to implement that in a PR, and see which one is more palatable. The entire matrix of required use cases is complex and so there may not be a "perfect" design in this case. But maybe we can try to pick the least bad one. I must point out though that this scenario:
Is not something we aim to support i.e. callers that use WebNN to create and execute the compiled ML graph also has the responsibility for context creation. The act of executing or recording the execution of the models is highly dependent on the type of context and the underlying device used to create and execute them. |
@dontcallmedom Please take a look at an alternative design at PR #257 |
Is the intent of WebNN does not necessarily need to introduce WebGPU's device execution model (deferred vs immediate) in order to execute ML work on the GPU asynchronously. Before "command encoders", the context itself would encode and schedule GPU work by generating a (internal) single command-buffer, each on a separate thread/worker, to effectively allow simultaneous encoding. The control/flexibility of explicit ML command encoders (deferred) does not seem useful if the WebNN developer can't broadly share it with WebGPU. |
@bbernhar Maybe I misunderstood your previous comment. If the WebNN context were to record its ML workloads onto an internal command buffer, who is responsible for executing that buffer? If that's also the responsibility of the context, then is the execution a blocking operation? The purpose of the |
Either Notice the use of an implicit ML command-buffer allows for a simple interop model of "queue sharing" (vs sharing command-lists), avoiding the need for WebGPU to broadly support non-native concepts like Hope this helps. |
@bbernhar @RafaelCintron Following our discussion, should we go ahead and close this discussion around |
I'd like to understand how resource sharing and execution ordering happen in this design. As the sample of WebGPU/WebNN background blur for the full-GPU-only video processing use case, it involves dispatching compute shaders and ML graph and sharing GPUBuffers between them. The processing steps include:
The pseudo code looks like: const commandEncoder = device.createCommandEncoder();
// Convert inputTexture to inputTensor
let computePass = commandEncoder.beginComputePass();
computePass.setPipeline(preprocessingPipeline);
computePass.setBindGroup();
computePass.dispatch();
computePass.end();
// ML graph segments background from inputTensor to outputTensor
// Should it record the command into commandEncoder? like
mlGraph.record(commandEncoder, inputBuffer, outputBuffer);
// Blend inputTexture to outputTexture based on outputTensor
let computePass = commandEncoder.beginComputePass();
computePass.setPipeline(postprocessingPipeline);
computePass.setBindGroup();
computePass.dispatch();
computePass.end();
// Render the outputTexture onto canvas
const passEncoder = commandEncoder.beginRenderPass();
passEncoder.setPipeline(fullscreenQuadPipeline);
passEncoder.setBindGroup();
passEncoder.draw(6, 1, 0, 0);
passEncoder.end();
device.queue.submit([commandEncoder.finish()]); |
If we do the "execute WebGPU indirect" method: const commandEncoder = device.createCommandEncoder();
let computePass1 = commandEncoder1.beginComputePass();
...
computePass1.dispatch();
computePass1.end();
mlcontext = navigator.ml.createContext(device)
let graph = /* Build graph */
mlcommandBuffer= mlcontext.createCommandBuffer(graph, inputBuffer, outputBuffer)
let computePass2 = commandEncoder2.beginComputePass();
...
computePass2.dispatch();
computePass2.end();
const renderPass1 = commandEncoder2.beginRenderPass();
...
renderPass1.draw(...);
renderPass1.end();
mlcontext.Submit(queue, [commandEncoder1.Finish(), mlcommandBuffer, commandEncoder2.Finish()])
WDYT? |
Get (create) a My question is whether we can leverage GPUQueue to submit it, like device.queue.submit([commandEncoder1.Finish(), mlcommandBuffer, commandEncoder2.Finish()]) |
Thanks! Yes, it's possible but there are no plans to support WebGPU interop through shared command-buffers and it may not be allowed unless it's explicitly defined in the WebGPU specification (ex. Dawn forbids this). Of course, going the other way is permitted. |
interface GPUQueue {
undefined submit(sequence<GPUCommandBuffer> commandBuffers);
}; So if we define Did I miss anything? |
@bbernhar, @RafaelCintron and myself did discuss this exact design at great length and agreed not to pursue them mainly because it creates a strong requirement that a WebNN implementation must produce a WebGPU primitive (i.e. Although I think it is conceptually sound, and am also a proponent of it, I agreed that it could make implementing WebNN much harder, especially when it's implemented using an OS API that has no design provision for interoping with the GPU stack used to implement WebGPU on the same environment e.g. it isn't at all clear how it's going to be implemented by Khronos API, Apple CoreML, or even hardware-vendor APIs like OpenVINO. The new consensus now is to leave the ML workloads generation and the command buffer submission as an implementation detail behind a public interface that simply accepts the WebGPU device's default queue. Please take a look at my latest commit on #257. |
I suppose this might not be an issue. Because as you mentioned, only when WebNN is implemented by using an OS API that supports interop with the GPU stack used to implement WebGPU on the same environment, the creation of |
GPUCommandBuffer is also expected to contain only "valid commands". This is a hard requirement for WebGPU implementations. Either WebGPU needs to know what these commands are (and where to find them) or be told how to validate/handle, otherwise Queue.submit will always fail. =( |
The WebNN implementation also needs to be able to produce an implementation of Additionally, if the MLContext is created with GPU device type, but without a |
It makes sense. If the
It's true. However, with the proposal of "execute WebGPU indirect" as mlcontext.Submit(queue, [commandEncoder1.Finish(), mlcommandBuffer, commandEncoder2.Finish()]) it requires WebNN implementation to validate / handlie of other types of GPUCommandBuffer, say rendering and compute, and submit them into the device queue. This part of work is out of scope of WebNN and duplicated with WebGPU implementation. My another concern is from developer experience, with the |
I'm sorry I'm not following. My latest commit on PR #257 doesn't call for a new |
Nothing is duplicated. WebNN would have it's own (internal) queue implementation that re-uses the WebGPU native API so it may co-execute GPU and ML commands in a single call. Going the other way requires WebGPU implementers to do the same. |
Closing this one. See #257 the evolution of this PR continues there. |
Based on the discussion in issue #230, we define three methods of execution for the compiled graph as follow:
Immediate Execution through the
MLExecution
interface:Async Execution through the
MLAwaitedExecution
interface:Queued Execution through the
MLCommandEncoder
interface:Preview | Diff