-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Query mechanism for supported devices #815
Comments
It would work but perhaps of limited usefulness? I am not sure how we answer the question from #809 (comment) without providing some fine grained limits / capabilities as noted in #749. The concern there, mentioned in #749 (comment) was the privacy / security aspect from my understanding. |
Fine grained signals are not necessary to answer the question from #809 (comment). A developer who has developed GPU- and CPU-optimized models wants to implement this flowchart: flowchart TD
A[Device has GPU?] -->|Yes| B(Load GPU model)
A -->|No| C(Load CPU model)
B --> D(Can execute on GPU?)
D -->|Yes| E(Execute on GPU)
D -->|No| C
C --> F(Execute on CPU)
This only requires knowing whether the GPU is an option for a given context (to avoid loading a GPU model if it has no chance of working) and whether a graph can run (or has run) on the GPU or not. The actual flowchart may be more complex, as it could include non-WebNN fallbacks such as WASM and WebGPU as well as models with different system performance requirements. |
If we remove Just to note, earlier Chai (?) mentioned we need a way to easily create a GPU context without a WebGPU device. We are losing that with #809. Querying context capabilities of generic contexts will not solve that. It seems there is still a strong use case to query the platform capabilities and match that with the requirements of the model at hand. Moreover, I could imagine a more complex flowchart, where also want to query other context specific constraints. We could go and create a generic context using the current options and then query for capabilities. But this does not guarantee getting a GPU context, even when one would be available. I propose we walk a middle path, e.g. by saying: const context = await navigator.ml.createContext({
powerPreference: 'high-performance',
devicePreference: 'gpu'
}); This would only guarantee that GPU will be involved (e.g. a combination of GPU and NPU/CPU), otherwise it will fail. Note that passing const context = await navigator.ml.createContext({
powerPreference: 'low-power',
devicePreference: 'npu',
supportLimits: { output: [ "float16" ] }
}); I am not sure do we also need the query function proposed there (the one that might make fingerprinting easier): partial interface ML {
record<USVString, MLOpSupportLimits> opSupportLimitsPerAdapter();
}; If possible, I'd try doing the options above without this query function, at the first attempt. If yes, we could do the following:
WDYT? |
Thanks for your swift responses and suggestions, everyone. (Markdown-friendly Mermaid is a great tool for quick diagrams, we should use it more often in the spec-land.) @zolkis, I see consensus emerging on PR #809, so I'd like to see that through and work on the query mechanism on top of it. If we need to revive some bits from that PR later, that is OK. |
Technically, there is nothing in the WebGPU specification which requires a GPU (it was intentionally designed this way).
The concept that an npu device has certain operations and a gpu device has others, or a cpu device can run all operations is something WebKit thinks should remain out of the WebNN specification and leave these details to the UA. Foreseeably, one vendor's npu device could have the capabilities of another vendor's gpu device. Or one vendor's gpu has all the same operations / capabilities as the cpu with different performance characteristics. I understand there are privacy concerns, but achieving this use case seems better suited to something like WebGPU's supportedLimits / supportedFeatures: Perhaps we could have a similar approach with allowances to maintain privacy? |
I imagine the "Device has GPU?" check would be implemented by running, const context = await navigator.ml.createContext({
powerPreference: 'high-performance'
});
if (context.devices.has('gpu')) {
// Yes, load GPU model.
} else {
// No, load CPU model.
} Given the constraints of the Core ML framework the "Can execute on GPU?" check needs to actually invoke the graph, const [test_input, test_output] = await Promise.all([
context.createTensor({...}),
context.createTensor({...}),
]);
context.dispatch(graph, { input: test_input }, { output: test_output });
await context.readTensor(test_output);
if (graph.devices.has('gpu')) {
// Yes.
} else {
// No, load CPU model.
} The naming and how exactly |
OK, so there seems to be a preference for creating a device-agnostic context, leave it to the UA to select the device(s) based on context options (currently only
Sounds good, especially the latter, for our use case. So what about the following. enum MLDeviceType {
"cpu",
"gpu",
"npu"
};
interface MLDevices {
readonly setlike<MLDeviceType>;
};
partial interface MLContext {
[SameObject] readonly attribute MLDevices devices;
} So then the examples by @anssiko and @reillyeon would work:
|
Thanks for the discussion. I'd like to understand more about the use case.
As I commented, WebNN implementation could use WebGPU shaders to execute graph on GPU for some devices where IHV-optimized GPU kernels are not available. Native ML frameworks are doing similar thing, for example ONNX Runtime native WebGPU execution provider. I suppose when high performance execution is preferred, silently falling back to sub-optimal CPU execution is an implementation issue.
May I know what aspects make a model more CPU-friendly? Does it use CPU-friendly operators or data types? Are there any gaps indicating a WebNN implementation to select CPU device when the app loads a CPU-friendly model? |
I am not an expert here but my understanding is that a CPU-friendly model may use different data types or simply be smaller so that it can meet the application's performance needs while providing acceptably reduced inference quality.
I think there was some discussion of adding a specific "cpu-only" flag to context creation if the current power preference flags are insufficient. |
@zolkis What information does the 'cpu', 'gpu', 'npu' label really provide? The concern is it does not provide information over the chips capabilities. Instead, we could have a mechanism for saying that a given UA can, for example:
whether or not the UA is running on a device with an npu or gpu is not really meaningful. One vendor's NPU could be equal in ML capabilities to another vendor's GPU for instance. |
@mwyrzykowski yes, the idea for exposing capabilities has been recorded. I was thinking to remove device types (PR #809), then experiment with how to expose the capabilities, and/or other mappings between context options/properties and device selection. But it seems that #809 made people re-thinking what we'd lose and what we'd need, spanning this issue. IIUC, the capabilities approach does not necessarily conflict with having another mechanism to poll whether a given type (as standardized string label) of accelerator/device is associated with the context just created on the given platform/environment. I agree that knowing the current device types is not enough information, but I don't think this is useless information, at least today. If you have a strong view on avoiding CPU/GPU/NPU device names altogether and focus only on capabilities, we need to discuss this during the next call (@anssiko please check). The problem with capabilities is that they are not well defined ATM, and appear to be (far) more complex to use. We need time to figure that out. Yet, I'd agree this might be the most flexible approach, and also the most exact approach. |
Given discussion today on the telecon and @mwyrzykowski's comments above, WDYT about something like this: const support = await context.querySupport({
dataTypes: ['float16', 'int8'],
maximumRank: 6,
operators: ['lstm', 'hardSwish'],
});
console.log(support); // "optimized" or "fallback" |
Thanks for the discussion yesterday on the telcon! Just to repeat the clarification of the usecase that we have for realtime video processing:
@inexorabletash, @mwyrzykowski - that looks really interesting, and close to how MediaCapabilities works! I'd imagine In terms of naming here, perhaps "accelerated" is a term to consider to mean execution or NPU or GPU. It's widely used inside Chrome and elsewhere in other contexts. |
Another clarification about our use case is that we'd strongly recommend the API to not do internal fallbacks from "accelerated" to CPU in runtime. We'd prefer the API errors out in that case to enable logic in the app to choose what to try next which could include using other tech than WebNN. |
Pasting here the use cases summarized by @anssiko in the call yesterday, plus adding @mwyrzykowski 's original proposal as number 0.
@inexorabletash this looks good to me, as his would elegantly combine 0 and 2. We need to work on the content of the dictionary. So the developer use case / flow is:
|
@inexorabletash this looks neat - but will the backend actually need more info (the whole model graph?) to be able to accurately determine feasibility? E.g. in the example, will the result be "optimized" only if the implementation supports the cartesian product of data types and ops (even though that may not strictly be required)? |
The problem with this proposal is that the frameworks on which WebNN is implemented don't expose enough information to determine ahead of time whether a given operation will be optimized or not. At best in our current implementation we can tell whether the framework supports an op or if the browser needs to emulate it, but we can't tell whether the framework itself emulates the op on a given hardware configuration. |
@reillyeon What is your take on (As there have been issues with "gpu" and "npu" naming (especially NPU is very diverse), we could replace a) with checking if a context is "accelerated", based on this suggestion, but we should be able to well-define in the spec what "accelerated" means.) |
That's true. On the other hand, even if an op is emulated by browser, the underlying framework may still be able to fuse the decomposed small ops into optimized one, which we can't tell either.
It sounds like your use case needs a "none-CPU context" (or "accelerator-only context"). The context created by DirectML backend is accelerator-only, because DirectML itself doesn't support CPU execution. TFLite backend may avoid CPU fallback by checking whether the graph is fully delegated by an accelerator delegate. CoreML's MLComputeUnits seems to always have the CPU device and not sure whether there is a way to exclude CPU. @mwyrzykowski ? As another example, ONNX Runtime also allows to disable the fallback to default CPU EP (thanks @skottmckay for sharing that). |
Should we expose a context creation hint/option for preventing CPU fallback? And make the result (status) available for query? Corollary: should we make a distinction in the API between options (when not met, resulting in failure) and hints (not resulting in error)? Edit: looks like due to conflicting requirements and support levels, device preferences may only be hints, like "no-gpu", "no-npu", "no-cpu", "gpu", "npu", "cpu", with post-creation query support for "has" { "gpu", "npu", cpu" } AND/OR/XOR capabilities. |
That is correct, CPU fallback is unavoidable via CoreML. Implementations could use MPS and run on the GPU, but I'm not aware of any mechanism of supporting the NPU / ANE without potential CPU fallback |
From an interop perspective I think the mandatory CPU fallback behavior is a positive, we just need to provide a mechanism for developers who have developed better fallbacks to reliably detect it (as early as practical). |
I'm not so sure. To be pedantic, I might call a CPU inference engine using XNNPACK "accelerated" since it is using hand-optimized assembly routines tuned for a handful of CPU architectures rather than a naive implementation in C++. Similarly is a GPU always "accelerated" or does it need to have dedicated "tensor cores"? |
I have a conflict for today's meeting, but Apple's preference to #815 (comment) is (b). Specifically, a processing chip is best described in terms of capabilities (which ops they support, which limits they have, so on) as opposed to names (which can vary drastically between vendors or even within the same vendor). |
[ Spun off from PR #809 review comment by @fdwr and a discussion with @inexorabletash, @reillyeon, @RafaelCintron, @philloooo -- thanks! ]
(Emphasis mine:)
To kick off this discussion, I'll start with a simple and naive API further abstracted from the
GPUDevice.features
setlike interface:To help tease out more informed proposals, I'd like folks to test this imaginary API against the following:
@mwyrzykowski would this work with CoreML MLComputeUnits? A subset of possible
MLDeviceType
combinations is supported but there's also a catch-allall
?How about other backends?
This issue welcomes further discussion and proposals on the topic of post-context creation query mechanism for supported devices.
The text was updated successfully, but these errors were encountered: