-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use USVString for operand name #713
Comments
To make the problem more concrete: if specified as a DOMString and something like "blah\uD800" is passed, since that has an unpaired UTF-16 surrogate there's no standard way to interpret that as a sequence of Unicode code points, so if back-ends treat it as anything but an opaque sequence of 16-bit code units, then there's a risk of interop issues - the label might come back out from errors as "blah\uD800" or "blah\uFFFD" or "blah\xED\xA0\x80" or ... Using USVString enforces standard behavior when a string like that is passed - it becomes "blah\uFFFD" which can be transcoded to UTF-8. The down sides are that if code were to intentionally try to use "blah\uD800" as a name or label then they wouldn't get what they expected coming out - but as noted, that's already a problem if back-ends do anything with the strings. And if code passed both "blah\uD800" and "blah\uD801" they are both corrected to "blah\uFFFD" so would collide as keys. But again, depending on what back-ends do, that may already be a problem. |
Ah, so USVString, DOMString, and JavaScript String are all just different names of the identical underlying encoding, UTF-16. It's just that USVString is a string which guarantees no unpaired surrogates (any occurrences have been replaced with � U+FFFD).
I like the consistency aspect with WebGPU's existing label.
🤔 I'm really not concerned with such degenerate cases 🐞, because where would the ill-formed sequence come from in the first place? If these names are coming from model files, then I expect they're well-formed to begin with (e.g. in ONNX "All strings are encoded in UTF-8", and SafeTensors "a JSON UTF-8 string representing the header"). Of course, one could call WebNN directly, but using unpaired surrogates like that would be a brittle abuse (and like Joshua says, "depending on what back-ends do, that may already be a problem."). Additionally Python (a likely source language where these models will originate) builds apparently come in flavors of UCS-2 (not UTF-16, meaning it lacks surrogate pair code units anyway) or UCS-4 (which shouldn't have surrogate pairs either because they're reserved code units in UTF-32). |
Signed-off-by: Zoltan Kis <[email protected]>
Signed-off-by: Zoltan Kis <[email protected]> Fix #713
I was reviewing the label implementation and realized the WebGPU uses USVString for label (see previous discussion).
I think we should use USVString for MLOperand name too? Chromium implementation currently utf-8 encode the name and send it to backends, and the names are further serialized to backend specific representations(coreml, tflite).
The text was updated successfully, but these errors were encountered: