Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use USVString for operand name #713

Closed
philloooo opened this issue Jun 28, 2024 · 2 comments · Fixed by #715
Closed

Use USVString for operand name #713

philloooo opened this issue Jun 28, 2024 · 2 comments · Fixed by #715
Assignees
Labels

Comments

@philloooo
Copy link
Contributor

philloooo commented Jun 28, 2024

I was reviewing the label implementation and realized the WebGPU uses USVString for label (see previous discussion).

I think we should use USVString for MLOperand name too? Chromium implementation currently utf-8 encode the name and send it to backends, and the names are further serialized to backend specific representations(coreml, tflite).

@inexorabletash
Copy link
Member

To make the problem more concrete: if specified as a DOMString and something like "blah\uD800" is passed, since that has an unpaired UTF-16 surrogate there's no standard way to interpret that as a sequence of Unicode code points, so if back-ends treat it as anything but an opaque sequence of 16-bit code units, then there's a risk of interop issues - the label might come back out from errors as "blah\uD800" or "blah\uFFFD" or "blah\xED\xA0\x80" or ...

Using USVString enforces standard behavior when a string like that is passed - it becomes "blah\uFFFD" which can be transcoded to UTF-8.

The down sides are that if code were to intentionally try to use "blah\uD800" as a name or label then they wouldn't get what they expected coming out - but as noted, that's already a problem if back-ends do anything with the strings. And if code passed both "blah\uD800" and "blah\uD801" they are both corrected to "blah\uFFFD" so would collide as keys. But again, depending on what back-ends do, that may already be a problem.

@fdwr
Copy link
Collaborator

fdwr commented Jun 28, 2024

Ah, so USVString, DOMString, and JavaScript String are all just different names of the identical underlying encoding, UTF-16. It's just that USVString is a string which guarantees no unpaired surrogates (any occurrences have been replaced with � U+FFFD).

WebGPU uses USVString for label

I like the consistency aspect with WebGPU's existing label.

The down sides are that if code were to intentionally try to use "blah\uD800" as a name or label then they wouldn't get what they expected coming out

🤔 I'm really not concerned with such degenerate cases 🐞, because where would the ill-formed sequence come from in the first place? If these names are coming from model files, then I expect they're well-formed to begin with (e.g. in ONNX "All strings are encoded in UTF-8", and SafeTensors "a JSON UTF-8 string representing the header"). Of course, one could call WebNN directly, but using unpaired surrogates like that would be a brittle abuse (and like Joshua says, "depending on what back-ends do, that may already be a problem."). Additionally Python (a likely source language where these models will originate) builds apparently come in flavors of UCS-2 (not UTF-16, meaning it lacks surrogate pair code units anyway) or UCS-4 (which shouldn't have surrogate pairs either because they're reserved code units in UTF-32).

zolkis added a commit to zolkis/webnn that referenced this issue Jun 29, 2024
huningxin pushed a commit that referenced this issue Jul 5, 2024
Signed-off-by: Zoltan Kis <[email protected]>

Fix #713
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants