-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Should WebNN support async APIs? #230
Comments
I can help fill in the GPU story a bit. GPUDevice does not require WebNN compute() to be sync. WebGPU schedules work on the GPU/queue timeline which is async by design. WebNN behavior's here is non-normative and should clarify it shares ownership of WebGPU's (implicit) queue so calling compute() goes back on the queue timeline, which never blocks the main thread. This will be important for interop because we cannot have compute() work being scheduled alongside WebGPU work on two different timelines. |
Thanks @bbernhar for clarifying the GPU story. I agree this is important for WebNN-WebGPU interop and we need to improve this part.
I suppose you mean if an I think this align with what @RafaelCintron shared in the WebML WG Teleconference – 18 Nov 2021:
This is true if the inputs and outputs are GPU resources, such as However, I am not sure if the input and outputs are CPU resources. According to WebNN |
Sounds right to me!
GPUBuffer can also be CPU-visible (readback) and still be a GPU resource. What matters is compute() stays async until the CPU reads the GPU resource (ex. map/copy). You can upload (tensor) data to the GPU using only the CPU timeline so long as the GPU is not waiting to execute using it (or "in flight" in GPU-lingo). Hope this helps. |
@huningxin @bbernhar For most models sync execution is sufficient. But for models with control flow, it will require read back of the intermediate tensors to determine the following execution plan, this would mean JS main thread will be blocked by GPU resources and computation. Having an async compute would allow those type of models to be executed efficiently. |
@pyu10055 If we want to use WebGPU, I don't think "models sync execution" will be sufficient. Even if the CPU and GPU is serial, that doesn't prevent WebGPU and WebNN work from being executed in the wrong order on the GPU. =( |
@bbernhar can you elaborate on that, why would GPU execute the graph in a wrong order? |
If WebNN and WebGPU share the same GPUDevice (= same GPU queue) then GPU-side synchronization is only guaranteed by submission order. WebNN must submit work BEFORE WebGPU does if graph execution depends on it. |
@huningxin Yes, WebNN @bbernhar I think the main issue here is that the However, on the CPU device, there is no such concept as the command buffer, and that the execution can happen on any timeline depending on the calling thread. |
@wchao1115 Yup, that sounds right to me. @huningxin FYI. =) |
Thank you @wchao1115 @bbernhar for the explanation, seems everyone agrees that there will be an async API to access result from GPU. My point is that the compute API can be just a sync call for most serializable models, given that it does not perform the actually computation, but for models that require read back intermediate results during collection of the command, the sync compute method would not work. |
@pyu10055 I think there are some confusion here. The current Regarding the WebNN's ability to interop well with WebGPU, my suggestion is that it remains a sync call but altering the API contract so that it only records the command dispatches without actually executing the command queue, when the given device is a GPU device shared with the WebGPU context. This way the execution of the command queue and order of submission can be controlled by the caller of both WebGPU and WebNN. However that said change would have an impact on the CPU device case, and that is the part I think we should still think through. |
@wchao1115 Got it, in TensorFlow.js, we unify APIs for CPU/GPU devices by separating compute and data access APIs. The data access is an async call, which for CPU it will be returning an resolved promise, which GPU returns a promise that uses fence or polling to wait for GPU. |
+1
Probably we could define a new type of Context and Graph, e.g. For the generic |
I agree with this proposal and is what we've discussed in previous meetings.
As currently speced, WebNN's compute API would need to submit work to the WebGPU default queue, which I admit I haven't felt 100% comfortable with. If WebGPU adds multi-queue in the future, it will be even more confusing when and where work is happening. To alleviate the confusion, we can either:
|
I don't think we need to fork the We can also achieve what we want here by just extending the An additional sync method It might be easier to see this in code. I'll put together a PR that implements this change for reviews. |
Looking this from both developer ergonomics and future-proofing perspective, setting my chair hat aside, this suggestion from @wchao1115 strongly resonates with me:
Forking may be an easy solution now but may bite us back later. Also thanks for helping put together a PR. Please consider #250 that adds [[contextType]] internal slot for MLContext. This internal state may be useful in other places when decisions need to be made based on how the context was created. |
@wchao1115 ML work can't be encoded into compute passes because it cannot be dispatched, only submitted through the shared queue. A new "ML pass" type is needed to encode ML work or more simply, pass/get the queue to record. |
@bbernhar The key idea here is to separate out the act of submitting the ML GPU work into the command buffer from executing the commands in the queue. How would you suggest us define this behavior for WebGPU context? i.e. what is the right WebGPU "currency" that should be passed into the proposed |
Why not treat MLGraph like a D3D11-style immediate context? MLGraph.record(queue) just records the ML commands into it's (internal) command buffer but does not execute them until WebGPUQueue.Submit is called. Using passes just allows for finer grain scheduling (interleaving 3D and ML work) but that requires non-trivial changes to WebGPU. |
Thanks. That works. |
+1 to limit the
If the If we leave the |
It sounds good.
It looks like we can support them (MLGraph.record(queue) and ml pass) one by one for different usages. I am just wondering with the first one (MLGraph.record(queue)) whether it is capable to support the full-GPU-only video processing pipeline use case. Any insights? |
If you fork the graph type, then both the graph builder type and the context that creates them will have to fork too. This will create two parallel tracks of interface hierarchy that are largely similar but not really the same. It could be very confusing. One way to reduce the number of type specializations is to decouple the type hierarchy where it matters. In this case we can keep everything else polymorphic, but instead introducing a separate "graph executor" notion (e.g. |
Although forking graph builder type is not a good solution, we probably should consider whether to specialize the
It makes sense.
It does. However because it essentially depends on the type of the context, should we instead consider specializing context types and grouping the context dependent "graph execution / recording" respectively, e.g., |
Resource sharing through a shared queue is a preferred path for GPU interop (zero copy or GPU-only processing). But there is limitation by WebGPU, need |
@huningxin, maybe it's easier to show it in the code. (Not exactly WebNN syntax; some details are omitted here for simplicity) const context = ml.createContext(gpuDevice);
const builder = new MLGraphBuilder(context);
const conv2d = builder.conv2d(...);
const graph = builder.build(conv2d);
// Record the ML workload to a given queue, so it can be interleaved with the rest of other WebGPU workload
const gpuExecutor = new MLGPUExecutor(context);
gpuExecutor.record(graph, gpuQueue); The only time an interface specialization is needed here is when a graph executor is needed to execute the payload in the graph. I prefer to not specialize the context type for this because there may be methods we want to add in the future to the context interface that may make sense for all types of context. Specializing it now could potentially fragment the interface over time. The specialization for the executor is more future-proof since it is truly a point in the API call pattern that we know of right now where the callers do indeed need to know what they want to do next on what kind of context they operate on, and on what threading model they are bound to. This gives them that flexibility without polluting the rest of the API calling pattern. For example, if a caller wants to use WebNN just to construct a graph out of a given context regardless of what kind of context they may be given, they can do that without having to know the difference between different context type. |
@wchao1115 , thanks for the code example, it really helps.
Should the the graph initialization be specialized either? If the graph builder is created from a GPU context, it may accept the GPU resources as constants and initialize the graph with these GPU resources. Should the graph initialization commands be recorded into the GPU queue? Should it be decoupled into a "graph initializer"? |
Graph constants such as weights are normally uploaded from the CPU memory even for the GPU context. For the initializers, those actually need to be treated like one of the graph inputs that are bound at the execution time. I think the current support for the GPU resource view constants is probably not needed. |
It's true. However, to work with the WebGPU implementation of a framework, like TensorFlow.js WebGPU backend, a tensor data may be already uploaded to Should we decouple the graph initialization from graph build? It means the const context = ml.createContext(gpuDevice);
const builder = new MLGraphBuilder(context);
const filter = builder.constant();
const conv2d = builder.conv2d(filter);
const graph = builder.build(conv2d);
// Record the ML workload to a given queue, so it can be interleaved with the rest of other WebGPU workload
const gpuExecutor = new MLGPUExecutor(context);
gpuExecutor.init(graph, constantGpuBuffers, gpuQueue);
gpuExecutor.record(graph, inputGpuBuffers, outputGpuBuffers, gpuQueue); The downside is the |
There isn't a need to define an |
The weight resource might be bound once and owned by the runtime, e.g., by setting
Did I read it correctly? |
I suppose this is fixed by #257. |
Closing with a note that async context creation is discussed in its own issue #272. |
As mentioned in #229, the existing WebNN graph building (MLGraphBuilder.build) and execution (MLGraph.compute) are sync APIs. This is required by Wasm (C++) based ML framework's backend implementation. To avoid blocking the main thread, the good practice is to call these synchronous APIs in a worker context.
There are JavaScript based ML frameworks, like TensorFlow.js, that are mainly used in the main thread. Should WebNN support async APIs in the main thread? This would help not only the JS ML frameworks but also broader JS adoption of the API.
/cc @pyu10055
The text was updated successfully, but these errors were encountered: