Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query mechanism for supported devices #815

Open
anssiko opened this issue Feb 4, 2025 · 24 comments
Open

Query mechanism for supported devices #815

anssiko opened this issue Feb 4, 2025 · 24 comments

Comments

@anssiko
Copy link
Member

anssiko commented Feb 4, 2025

[ Spun off from PR #809 review comment by @fdwr and a discussion with @inexorabletash, @reillyeon, @RafaelCintron, @philloooo -- thanks! ]

(Emphasis mine:)

This afternoon we were talking with {@inexorabletash, @reillyeon, @RafaelCintron, Phillis} about the likely need for a caller to know whether a particular device is supported or not, because an app may want to (if say GPU is not supported) use a different more performant fallback than for WebNN to silently fall back to CPU. For example, if GPU was unavailable (even though you preferred high performance), then it might be faster to execute the model with WebGPU shaders than WebNN CPU, or it might be okay to use CPU, but the app could load a different model that's more CPU-friendly, if it knew that was the case.

This need is the reverse of what we currently have, querying the context for what kind of devices it actually supports rather than prescribing a preferred device type. Now, preferred device type prescription may return someday as a weaker hint (or list of hints, or list of antihints...) as we learn more about usage/needs, but for either case prescription or post-query, having device enums could be useful. 🤔

To kick off this discussion, I'll start with a simple and naive API further abstracted from the GPUDevice.features setlike interface:

// only MLPowerPreference supported at context creation time
const context = await navigator.ml.createContext({ powerPreference: 'high-performance' });

// query the newly created context
if (context.devices.has('cpu') && context.devices.has('npu')) {
   // context supports both CPU and NPU, choose the model accordingly
   // how the work is split across CPU and NPU is an implementation detail
} else if (!context.devices.has('gpu')) {
   //  no GPU available, perhaps try WebGPU?
} else {
   // ...
}

To help tease out more informed proposals, I'd like folks to test this imaginary API against the following:

@mwyrzykowski would this work with CoreML MLComputeUnits? A subset of possible MLDeviceType combinations is supported but there's also a catch-all all?

How about other backends?

This issue welcomes further discussion and proposals on the topic of post-context creation query mechanism for supported devices.

@mwyrzykowski
Copy link

@mwyrzykowski would this work with CoreML MLComputeUnits? A subset of possible MLDeviceType combinations is supported but there's also a catch-all all?

It would work but perhaps of limited usefulness? context.devices.has(...) would return true but there is no indication which device a particular model will run on. E.g., usage of float32 arithmetic will prevent the model from running on Apple's NPUs. Whereas on possibly another computing device, usage of float16 arithmetic could prevent running on a certain processor unit.

I am not sure how we answer the question from #809 (comment) without providing some fine grained limits / capabilities as noted in #749. The concern there, mentioned in #749 (comment) was the privacy / security aspect from my understanding.

@reillyeon
Copy link
Contributor

reillyeon commented Feb 4, 2025

Fine grained signals are not necessary to answer the question from #809 (comment). A developer who has developed GPU- and CPU-optimized models wants to implement this flowchart:

flowchart TD
    A[Device has GPU?] -->|Yes| B(Load GPU model)
    A -->|No| C(Load CPU model)
    B --> D(Can execute on GPU?)
    D -->|Yes| E(Execute on GPU)
    D -->|No| C
    C --> F(Execute on CPU)
Loading

This only requires knowing whether the GPU is an option for a given context (to avoid loading a GPU model if it has no chance of working) and whether a graph can run (or has run) on the GPU or not.

The actual flowchart may be more complex, as it could include non-WebNN fallbacks such as WASM and WebGPU as well as models with different system performance requirements.

@zolkis
Copy link
Collaborator

zolkis commented Feb 5, 2025

If we remove MLDeviceType (as in #809), the only way to do the flowchart above is trying to create a WebGPU context, and if that fails, then create a generic context - and then this issue argues for adding query support to those contexts.

Just to note, earlier Chai (?) mentioned we need a way to easily create a GPU context without a WebGPU device. We are losing that with #809. Querying context capabilities of generic contexts will not solve that.

It seems there is still a strong use case to query the platform capabilities and match that with the requirements of the model at hand. Moreover, I could imagine a more complex flowchart, where also want to query other context specific constraints.

We could go and create a generic context using the current options and then query for capabilities. But this does not guarantee getting a GPU context, even when one would be available.

I propose we walk a middle path, e.g. by saying:

const context = await navigator.ml.createContext({ 
    powerPreference: 'high-performance', 
    devicePreference: 'gpu'
});

This would only guarantee that GPU will be involved (e.g. a combination of GPU and NPU/CPU), otherwise it will fail.

Note that passing "npu" alone might be too little information. So I do I see the reason in passing other context creation options/constraints, too, as suggested in #749 (comment)

const context = await navigator.ml.createContext({ 
    powerPreference: 'low-power', 
    devicePreference: 'npu',
    supportLimits: { output: [ "float16" ] }
});

I am not sure do we also need the query function proposed there (the one that might make fingerprinting easier):

partial interface ML {
  record<USVString, MLOpSupportLimits> opSupportLimitsPerAdapter();
};

If possible, I'd try doing the options above without this query function, at the first attempt.
@mwyrzykowski would that work for you? @reillyeon?

If yes, we could do the following:

  • I close Remove MLDeviceType #809 (i.e. cancel the PR) -- or modify it as follows.
  • keep the MLDeviceType enum
  • support a new context creation option as MLOpSupportLimits supportLimits, as proposed in #749 (comment)
  • write the related algorithmic steps.

WDYT?

@anssiko
Copy link
Member Author

anssiko commented Feb 5, 2025

Thanks for your swift responses and suggestions, everyone. (Markdown-friendly Mermaid is a great tool for quick diagrams, we should use it more often in the spec-land.)

@zolkis, I see consensus emerging on PR #809, so I'd like to see that through and work on the query mechanism on top of it. If we need to revive some bits from that PR later, that is OK.

@mwyrzykowski
Copy link

If we remove MLDeviceType (as in #809), the only way to do the flowchart above is trying to create a WebGPU context, and if that fails, then create a generic context - and then this issue argues for adding query support to those contexts.

Technically, there is nothing in the WebGPU specification which requires a GPU (it was intentionally designed this way).

const context = await navigator.ml.createContext({
powerPreference: 'low-power',
devicePreference: 'npu',
supportLimits: { output: [ "float16" ] }
});

If possible, I'd try doing the options above without this query function, at the first attempt. @mwyrzykowski would that work for you? @reillyeon?

The concept that an npu device has certain operations and a gpu device has others, or a cpu device can run all operations is something WebKit thinks should remain out of the WebNN specification and leave these details to the UA.

Foreseeably, one vendor's npu device could have the capabilities of another vendor's gpu device. Or one vendor's gpu has all the same operations / capabilities as the cpu with different performance characteristics.

I understand there are privacy concerns, but achieving this use case seems better suited to something like WebGPU's supportedLimits / supportedFeatures:
https://www.w3.org/TR/webgpu/#gpusupportedlimits
https://www.w3.org/TR/webgpu/#gpusupportedfeatures

Perhaps we could have a similar approach with allowances to maintain privacy?

@reillyeon
Copy link
Contributor

I imagine the "Device has GPU?" check would be implemented by running,

const context = await navigator.ml.createContext({
    powerPreference: 'high-performance'
});
if (context.devices.has('gpu')) {
  // Yes, load GPU model.
} else {
  // No, load CPU model.
}

Given the constraints of the Core ML framework the "Can execute on GPU?" check needs to actually invoke the graph,

const [test_input, test_output] = await Promise.all([
    context.createTensor({...}),
    context.createTensor({...}),
]);
context.dispatch(graph, { input: test_input }, { output: test_output });
await context.readTensor(test_output);
if (graph.devices.has('gpu')) {
  // Yes.
} else {
  // No, load CPU model.
}

The naming and how exactly graph.devices gets populated is TBD but would be based on the history of previous inference runs.

@zolkis
Copy link
Collaborator

zolkis commented Feb 6, 2025

OK, so there seems to be a preference for creating a device-agnostic context, leave it to the UA to select the device(s) based on context options (currently only powerPreference) (any other hints to add to context options about device selection?), then apps would query the context for more detailed capabilities.

@mwyrzykowski
I understand there are privacy concerns, but achieving this use case seems better suited to something like WebGPU's supportedLimits / supportedFeatures:
https://www.w3.org/TR/webgpu/#gpusupportedlimits
https://www.w3.org/TR/webgpu/#gpusupportedfeatures

Sounds good, especially the latter, for our use case. So what about the following.

enum MLDeviceType {
  "cpu",
  "gpu",
  "npu"
};

interface MLDevices {
    readonly setlike<MLDeviceType>;
};

partial interface MLContext {
    [SameObject] readonly attribute MLDevices devices;
}

So then the examples by @anssiko and @reillyeon would work:

if (context.devices.has('gpu')) {

@huningxin
Copy link
Contributor

@fdwr

This afternoon we were talking with @inexorabletash, @reillyeon, @RafaelCintron, @philloooo about the likely need for a caller to know whether a particular device is supported or not,

Thanks for the discussion. I'd like to understand more about the use case.

because an app may want to (if say GPU is not supported) use a different more performant fallback than for WebNN to silently fall back to CPU. For example, if GPU was unavailable (even though you preferred high performance), then it might be faster to execute the model with WebGPU shaders than WebNN CPU,

As I commented, WebNN implementation could use WebGPU shaders to execute graph on GPU for some devices where IHV-optimized GPU kernels are not available. Native ML frameworks are doing similar thing, for example ONNX Runtime native WebGPU execution provider. I suppose when high performance execution is preferred, silently falling back to sub-optimal CPU execution is an implementation issue.

or it might be okay to use CPU, but the app could load a different model that's more CPU-friendly, if it knew that was the case.

May I know what aspects make a model more CPU-friendly? Does it use CPU-friendly operators or data types? Are there any gaps indicating a WebNN implementation to select CPU device when the app loads a CPU-friendly model?

@reillyeon
Copy link
Contributor

May I know what aspects make a model more CPU-friendly? Does it use CPU-friendly operators or data types?

I am not an expert here but my understanding is that a CPU-friendly model may use different data types or simply be smaller so that it can meet the application's performance needs while providing acceptably reduced inference quality.

Are there any gaps indicating a WebNN implementation to select CPU device when the app loads a CPU-friendly model?

I think there was some discussion of adding a specific "cpu-only" flag to context creation if the current power preference flags are insufficient.

@mwyrzykowski
Copy link

Sounds good, especially the latter, for our use case. So what about the following.

enum MLDeviceType {
"cpu",
"gpu",
"npu"
};

interface MLDevices {
readonly setlike;
};

partial interface MLContext {
[SameObject] readonly attribute MLDevices devices;
}
So then the examples by @anssiko and @reillyeon would work:

if (context.devices.has('gpu')) {

@zolkis What information does the 'cpu', 'gpu', 'npu' label really provide? The concern is it does not provide information over the chips capabilities.

Instead, we could have a mechanism for saying that a given UA can, for example:

  • support the following MLOperandDataType values (e.g., float16, float32, int64, float64)
  • create tensors of maximum shape = K
  • other distinguishing limits?

whether or not the UA is running on a device with an npu or gpu is not really meaningful. One vendor's NPU could be equal in ML capabilities to another vendor's GPU for instance.

@zolkis
Copy link
Collaborator

zolkis commented Feb 11, 2025

@mwyrzykowski yes, the idea for exposing capabilities has been recorded. I was thinking to remove device types (PR #809), then experiment with how to expose the capabilities, and/or other mappings between context options/properties and device selection.

But it seems that #809 made people re-thinking what we'd lose and what we'd need, spanning this issue.

IIUC, the capabilities approach does not necessarily conflict with having another mechanism to poll whether a given type (as standardized string label) of accelerator/device is associated with the context just created on the given platform/environment.
In my current understanding, in most use cases CPU/GPU/NPU still have meanings (as well as symbolic mappings like low-power and high-performance), though it might be blurry and conflated with capabilities, and NPU alone is a rather diverse category.

I agree that knowing the current device types is not enough information, but I don't think this is useless information, at least today.

If you have a strong view on avoiding CPU/GPU/NPU device names altogether and focus only on capabilities, we need to discuss this during the next call (@anssiko please check). The problem with capabilities is that they are not well defined ATM, and appear to be (far) more complex to use. We need time to figure that out. Yet, I'd agree this might be the most flexible approach, and also the most exact approach.

@inexorabletash
Copy link
Member

Given discussion today on the telecon and @mwyrzykowski's comments above, WDYT about something like this:

const support = await context.querySupport({
  dataTypes: ['float16', 'int8'],
  maximumRank: 6,
  operators: ['lstm', 'hardSwish'],
});
console.log(support); // "optimized" or "fallback" 

@handellm
Copy link

Thanks for the discussion yesterday on the telcon!

Just to repeat the clarification of the usecase that we have for realtime video processing:

  1. If the user selects to use functionality like background blur, we want to offer the best quality the device can offer. So the product has a small set of candidate models and technologies (WebNN, WebGPU, WASM) that it has to choose between. Accelerated technologies come with allowance for beefier models.
  2. The model/tech choser algorithm needs to be fast, and we need to avoid spending seconds or even hundreds of milliseconds to figure out if a given model should be able to run accelerated. So for example downloading the entirety (could be large things..), compiling & try-running a model seems infeasible.

@inexorabletash, @mwyrzykowski - that looks really interesting, and close to how MediaCapabilities works! I'd imagine querySupport would potentially be almost instant. It's easy to keep that kind of small metadata with each model and being a small download.

In terms of naming here, perhaps "accelerated" is a term to consider to mean execution or NPU or GPU. It's widely used inside Chrome and elsewhere in other contexts.

@handellm
Copy link

Another clarification about our use case is that we'd strongly recommend the API to not do internal fallbacks from "accelerated" to CPU in runtime. We'd prefer the API errors out in that case to enable logic in the app to choose what to try next which could include using other tech than WebNN.

@zolkis
Copy link
Collaborator

zolkis commented Feb 14, 2025

Pasting here the use cases summarized by @anssiko in the call yesterday, plus adding @mwyrzykowski 's original proposal as number 0.

  1. query limits/capabilities before context creation
  2. hints provided at context-creation time (I want a "high-performance" context)
  3. query/request device availability after context creation ("does the context support... (types, rank, ops)?")
  4. query device capabilities after compile ("can you actually run this model?")

@inexorabletash this looks good to me, as his would elegantly combine 0 and 2. We need to work on the content of the dictionary.

So the developer use case / flow is:

  • create an MLContext with options (high-perf, low-power, + discuss adding explicit or specify implicit fallback constraints, e.g. "no CPU fallback wanted") (that is implicit device selection, aka v1). Note that we can create a WebGPU context already at this point.
  • query/request capabilities with given model's constrants (the current "device selection v2" proposal, with query parameters to be distilled/discussed)
    • if not supported, try another model / query
    • if none works, use something else...
  • if supported, go ahead and build the graph
  • build/compile (should we add any compile options?) (open / future)
  • query capabilities after compile (open / future)
  • run.

@fsolenberg
Copy link

@inexorabletash this looks neat - but will the backend actually need more info (the whole model graph?) to be able to accurately determine feasibility? E.g. in the example, will the result be "optimized" only if the implementation supports the cartesian product of data types and ops (even though that may not strictly be required)?

@reillyeon
Copy link
Contributor

The problem with this proposal is that the frameworks on which WebNN is implemented don't expose enough information to determine ahead of time whether a given operation will be optimized or not. At best in our current implementation we can tell whether the framework supports an op or if the browser needs to emulate it, but we can't tell whether the framework itself emulates the op on a given hardware configuration.

@zolkis
Copy link
Collaborator

zolkis commented Feb 15, 2025

@reillyeon
Fair enough. AFAICT it doesn't invalidate the given API approach, though... we could possibly just expose the best information available, and we'd need to make a note in the spec warning about this possible issue.
Figuring out workarounds for that given problem will take more than a single API call anyway, it will require developer experience on the given platform.
We could alleviate that if we could make requirements / feedback to those frameworks, too?

What is your take on
a) checking if a context has "gpu" vs.
b) query some named capabilities (after context creation, and possibly after compilation)?

(As there have been issues with "gpu" and "npu" naming (especially NPU is very diverse), we could replace a) with checking if a context is "accelerated", based on this suggestion, but we should be able to well-define in the spec what "accelerated" means.)

@huningxin
Copy link
Contributor

@reillyeon

At best in our current implementation we can tell whether the framework supports an op or if the browser needs to emulate it, but we can't tell whether the framework itself emulates the op on a given hardware configuration.

That's true. On the other hand, even if an op is emulated by browser, the underlying framework may still be able to fuse the decomposed small ops into optimized one, which we can't tell either.

@handellm

Another clarification about our use case is that we'd strongly recommend the API to not do internal fallbacks from "accelerated" to CPU in runtime.

It sounds like your use case needs a "none-CPU context" (or "accelerator-only context").

The context created by DirectML backend is accelerator-only, because DirectML itself doesn't support CPU execution.

TFLite backend may avoid CPU fallback by checking whether the graph is fully delegated by an accelerator delegate.

CoreML's MLComputeUnits seems to always have the CPU device and not sure whether there is a way to exclude CPU. @mwyrzykowski ?

As another example, ONNX Runtime also allows to disable the fallback to default CPU EP (thanks @skottmckay for sharing that).

@zolkis
Copy link
Collaborator

zolkis commented Feb 17, 2025

@handellm

Another clarification about our use case is that we'd strongly recommend the API to not do internal fallbacks from "accelerated" to CPU in runtime.

It sounds like your use case needs a "none-CPU context" (or "accelerator-only context").

The context created by DirectML backend is accelerator-only, because DirectML itself doesn't support CPU execution.

TFLite backend may avoid CPU fallback by checking whether the graph is fully delegated by an accelerator delegate.

CoreML's MLComputeUnits seems to always have the CPU device and not sure whether there is a way to exclude CPU. @mwyrzykowski ?

As another example, ONNX Runtime also allows to disable the fallback to default CPU EP (thanks @skottmckay for sharing that).

Should we expose a context creation hint/option for preventing CPU fallback? And make the result (status) available for query?

Corollary: should we make a distinction in the API between options (when not met, resulting in failure) and hints (not resulting in error)?

Edit: looks like due to conflicting requirements and support levels, device preferences may only be hints, like "no-gpu", "no-npu", "no-cpu", "gpu", "npu", "cpu", with post-creation query support for "has" { "gpu", "npu", cpu" } AND/OR/XOR capabilities.

@anssiko anssiko changed the title Post-context creation query mechanism for supported devices Query mechanism for supported devices Feb 20, 2025
@mwyrzykowski
Copy link

CoreML's MLComputeUnits seems to always have the CPU device and not sure whether there is a way to exclude CPU. @mwyrzykowski ?

That is correct, CPU fallback is unavoidable via CoreML. Implementations could use MPS and run on the GPU, but I'm not aware of any mechanism of supporting the NPU / ANE without potential CPU fallback

@reillyeon
Copy link
Contributor

From an interop perspective I think the mandatory CPU fallback behavior is a positive, we just need to provide a mechanism for developers who have developed better fallbacks to reliably detect it (as early as practical).

@reillyeon
Copy link
Contributor

[W]e should be able to well-define in the spec what "accelerated" means.

I'm not so sure. To be pedantic, I might call a CPU inference engine using XNNPACK "accelerated" since it is using hand-optimized assembly routines tuned for a handful of CPU architectures rather than a naive implementation in C++. Similarly is a GPU always "accelerated" or does it need to have dedicated "tensor cores"?

@mwyrzykowski
Copy link

I have a conflict for today's meeting, but Apple's preference to #815 (comment) is (b).

Specifically, a processing chip is best described in terms of capabilities (which ops they support, which limits they have, so on) as opposed to names (which can vary drastically between vendors or even within the same vendor).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants