diff --git a/proposals/stringref/Overview.md b/proposals/stringref/Overview.md index c26efbe..9cedcec 100644 --- a/proposals/stringref/Overview.md +++ b/proposals/stringref/Overview.md @@ -202,11 +202,11 @@ value reaches any instruction in this proposal. The one exception is ### Creating strings ``` -(string.new_utf8 $memory ptr:address bytes:i32) +(string.decode_from_utf8 $memory ptr:address bytes:i32) -> str:stringref -(string.new_lossy_utf8 $memory ptr:address bytes:i32) +(string.decode_from_lossy_utf8 $memory ptr:address bytes:i32) -> str:stringref -(string.new_wtf8 $memory ptr:address bytes:i32) +(string.decode_from_wtf8 $memory ptr:address bytes:i32) -> str:stringref ``` Create a new string from the *`bytes`* bytes in memory at *`ptr`*. @@ -215,22 +215,22 @@ Out-of-bounds access will trap. The maximum value for *`bytes`* is These three instructions decode the bytes in three different ways: - * `string.new_utf8` decodes using a strict UTF-8 decoder. If the + * `string.decode_from_utf8` decodes using a strict UTF-8 decoder. If the bytes are not valid UTF-8, trap. - * `string.new_lossy_utf8` decodes using a sloppy UTF-8 decoder: all + * `string.decode_from_lossy_utf8` decodes using a sloppy UTF-8 decoder: all maximal subparts of an invalid subsequence are decoded as if they were `U+FFFD` (the replacement character) instead. This instruction will never trap due to a decoding error. See the section entitled "U+FFFD Substitution of Maximal Subparts" in the Unicode standard, version 14.0.0, page 126. - * `string.new_wtf8` decodes using a strict WTF-8 decoder, which is like + * `string.decode_from_wtf8` decodes using a strict WTF-8 decoder, which is like UTF-8 but also allows isolated surrogates. If the bytes are not valid WTF-8, trap. ``` -(string.new_wtf16 $memory ptr:address codeunits:i32) +(string.decode_from_wtf16 $memory ptr:address codeunits:i32) -> str:stringref ``` Create a new string from the *`codeunits`* code units encoded in memory at @@ -240,14 +240,14 @@ is 230–1; passing a higher value traps. Each code unit is read from memory as if with `i32.load16`, and is therefore decoded using little-endian byte order. -#### `string.new` size limits +#### `string.decode_from_*` size limits Creating a string is a form of dynamic allocation and can fail. The same implementation running on different machines can have different behaviors. The specification can only say that byte/code-unit sizes above a certain limit *must* fail; but for sizes within the limits, the allocations *may* fail. If an allocation fails, the implementation must -trap. Fallible `string.new` is a possible future extension. +trap. Fallible `string.decode_from_*` is a possible future extension. ### String literals @@ -281,7 +281,7 @@ string literal section as a future extension. The maximum size for the WTF-8 encoding of an individual string literal is 231–1 bytes. Embeddings may impose their own limits which -are more restricted. But similarly to `string.new_wtf8`, instantiating +are more restricted. But similarly to `string.decode_from_wtf8`, instantiating a module with string literals may fail due to lack of memory resources, even if the string size is formally within the limits. However `string.const` itself never traps when passed a valid literal offset. @@ -331,7 +331,7 @@ is 230-1. If an encoding would require more code units than the limit, the result is -1. ``` -(string.encode_utf8 $memory str:stringref ptr:address) +(string.encode_to_utf8 $memory str:stringref ptr:address) -> codeunits:i32 ``` Encode the contents of the string *`str`* as UTF-8 to memory at *ptr*. @@ -340,11 +340,11 @@ written, which will be the same as returned by the corresponding `string.measure_utf8`. The maximum number of bytes that can be encoded at once by -`string.encode` is 231-1. If an encoding would require more +`string.encode_to_utf8` is 231-1. If an encoding would require more bytes, it is as if the codepoints can't be encoded (a trap). ``` -(string.encode_lossy_utf8 $memory str:stringref ptr:address) +(string.encode_to_lossy_utf8 $memory str:stringref ptr:address) -> codeunits:i32 ``` Encode the contents of the string *`str`* as UTF-8 to memory at *`ptr`*. @@ -353,11 +353,11 @@ character) instead. Return the number of code units written, which will be the same as returned by the corresponding `string.measure_wtf8`. The maximum number of bytes that can be encoded at once by -`string.encode` is 231-1. If an encoding would require more +`string.encode_to_lossy_utf8` is 231-1. If an encoding would require more bytes, it is as if the codepoints can't be encoded (a trap). ``` -(string.encode_wtf8 $memory str:stringref ptr:address) +(string.encode_to_wtf8 $memory str:stringref ptr:address) -> codeunits:i32 ``` Encode the contents of the string *`str`* as WTF-8 to memory at *`ptr`*. @@ -365,11 +365,11 @@ Return the number of code units written, which will be the same as returned by the corresponding `string.measure_wtf8`. The maximum number of bytes that can be encoded at once by -`string.encode` is 231-1. If an encoding would require more +`string.encode_to_wtf8` is 231-1. If an encoding would require more bytes, it is as if the codepoints can't be encoded (a trap). ``` -(string.encode_wtf16 $memory str:stringref ptr:address) +(string.encode_to_wtf16 $memory str:stringref ptr:address) -> codeunits:i32 ``` Encode the contents of the string *`str`* as WTF-16 to memory at @@ -380,7 +380,7 @@ Each code unit is written to memory as if stored by `i32.store16`, so WTF-16 code units are in little-endian byte order. The maximum number of bytes that can be encoded at once by -`string.encode` is 231-1. If an encoding would require more +`string.encode_to_wtf16` is 231-1. If an encoding would require more bytes, it is as if the codepoints can't be encoded (a trap). ### Concatenation @@ -603,13 +603,13 @@ The instructions below shall be available in WebAssembly implementations that support both GC and stringrefs. ``` -(string.new_utf8_array codeunits:$t start:i32 end:i32) +(string.decode_from_utf8_array codeunits:$t start:i32 end:i32) if expand($t) => array i8 -> str:stringref -(string.new_lossy_utf8_array codeunits:$t start:i32 end:i32) +(string.decode_from_lossy_utf8_array codeunits:$t start:i32 end:i32) if expand($t) => array i8 -> str:stringref -(string.new_wtf8_array codeunits:$t start:i32 end:i32) +(string.decode_from_wtf8_array codeunits:$t start:i32 end:i32) if expand($t) => array i8 -> str:stringref ``` @@ -617,12 +617,12 @@ Create a new string from a subsequence of the *`codeunits`* bytes in a GC-managed array, starting from offset *`start`* and continuing to but not including *`end`*. If *`end`* is less than *`start`* or is greater than the array length, trap. The bytes are decoded in the same way as -`string.new_utf8`, `string.new_lossy_utf8`, and `string.new_wtf8`, +`string.decode_from_utf8`, `string.decode_from_lossy_utf8`, and `string.decode_from_wtf8`, respectively. The maximum value for *`end`*–*`start`* is 231–1; passing a higher value traps. ``` -(string.new_wtf16_array codeunits:$t start:i32 end:i32) +(string.decode_from_wtf16_array codeunits:$t start:i32 end:i32) if expand($t) => array i16 -> str:stringref ``` @@ -634,16 +634,16 @@ for *`end`*–*`start`* is 230–1; passing a higher value traps. ``` -(string.encode_utf8_array str:stringref array:$t start:i32) +(string.encode_to_utf8_array str:stringref array:$t start:i32) if expand($t) => array (mut i8) -> codeunits:i32 -(string.encode_lossy_utf8_array str:stringref array:$t start:i32) +(string.encode_to_lossy_utf8_array str:stringref array:$t start:i32) if expand($t) => array (mut i8) -> codeunits:i32 -(string.encode_wtf8_array str:stringref array:$t start:i32) +(string.encode_to_wtf8_array str:stringref array:$t start:i32) if expand($t) => array (mut i8) -> codeunits:i32 -(string.encode_wtf16_array str:stringref array:$t start:i32) +(string.encode_to_wtf16_array str:stringref array:$t start:i32) if expand($t) => array (mut i16) -> codeunits:i32 ``` @@ -655,8 +655,8 @@ same as the result of a the corresponding `string.measure_wtf8` or code units in the array, trap. Note that no `NUL` terminator is ever written. -For `string.encode_utf8_array`, trap if an isolated surrogate is seen. -For `string.encode_lossy_utf8_array`, replace isolated surrogates with +For `string.encode_to_utf8_array`, trap if an isolated surrogate is seen. +For `string.encode_to_lossy_utf8_array`, replace isolated surrogates with `U+FFFD`. ## Binary encoding @@ -669,21 +669,21 @@ reftype ::= ... | 0x61 ⇒ stringview_iter ; SLEB128(-0x1f) instr ::= ... - | 0xfb 0x80:u32 $mem:u32 ⇒ string.new_utf8 $mem - | 0xfb 0x81:u32 $mem:u32 ⇒ string.new_wtf16 $mem + | 0xfb 0x80:u32 $mem:u32 ⇒ string.decode_from_utf8 $mem + | 0xfb 0x81:u32 $mem:u32 ⇒ string.decode_from_wtf16 $mem | 0xfb 0x82:u32 $idx:u32 ⇒ string.const $idx | 0xfb 0x83:u32 ⇒ string.measure_utf8 | 0xfb 0x84:u32 ⇒ string.measure_wtf8 | 0xfb 0x85:u32 ⇒ string.measure_wtf16 - | 0xfb 0x86:u32 $mem:u32 ⇒ string.encode_utf8 $mem - | 0xfb 0x87:u32 $mem:u32 ⇒ string.encode_wtf16 $mem + | 0xfb 0x86:u32 $mem:u32 ⇒ string.encode_to_utf8 $mem + | 0xfb 0x87:u32 $mem:u32 ⇒ string.encode_to_wtf16 $mem | 0xfb 0x88:u32 ⇒ string.concat | 0xfb 0x89:u32 ⇒ string.eq | 0xfb 0x8a:u32 ⇒ string.is_usv_sequence - | 0xfb 0x8b:u32 $mem:u32 ⇒ string.new_lossy_utf8 $mem - | 0xfb 0x8c:u32 $mem:u32 ⇒ string.new_wtf8 $mem - | 0xfb 0x8d:u32 $mem:u32 ⇒ string.encode_lossy_utf8 $mem - | 0xfb 0x8e:u32 $mem:u32 ⇒ string.encode_wtf8 $mem + | 0xfb 0x8b:u32 $mem:u32 ⇒ string.decode_from_lossy_utf8 $mem + | 0xfb 0x8c:u32 $mem:u32 ⇒ string.decode_from_wtf8 $mem + | 0xfb 0x8d:u32 $mem:u32 ⇒ string.encode_to_lossy_utf8 $mem + | 0xfb 0x8e:u32 $mem:u32 ⇒ string.encode_to_wtf8 $mem | 0xfb 0x90:u32 ⇒ string.as_wtf8 | 0xfb 0x91:u32 ⇒ stringview_wtf8.advance | 0xfb 0x92:u32 $mem:u32 ⇒ stringview_wtf8.encode_utf8 $mem @@ -700,14 +700,14 @@ instr ::= ... | 0xfb 0xa2:u32 ⇒ stringview_iter.advance | 0xfb 0xa3:u32 ⇒ stringview_iter.rewind | 0xfb 0xa4:u32 ⇒ stringview_iter.slice - | 0xfb 0xb0:u32 [gc] ⇒ string.new_utf8_array - | 0xfb 0xb1:u32 [gc] ⇒ string.new_wtf16_array - | 0xfb 0xb2:u32 [gc] ⇒ string.encode_utf8_array - | 0xfb 0xb3:u32 [gc] ⇒ string.encode_wtf16_array - | 0xfb 0xb4:u32 [gc] ⇒ string.new_lossy_utf8_array - | 0xfb 0xb5:u32 [gc] ⇒ string.new_wtf8_array - | 0xfb 0xb6:u32 [gc] ⇒ string.encode_lossy_utf8_array - | 0xfb 0xb7:u32 [gc] ⇒ string.encode_wtf8_array + | 0xfb 0xb0:u32 [gc] ⇒ string.decode_from_utf8_array + | 0xfb 0xb1:u32 [gc] ⇒ string.decode_from_wtf16_array + | 0xfb 0xb2:u32 [gc] ⇒ string.encode_to_utf8_array + | 0xfb 0xb3:u32 [gc] ⇒ string.encode_to_wtf16_array + | 0xfb 0xb4:u32 [gc] ⇒ string.decode_from_lossy_utf8_array + | 0xfb 0xb5:u32 [gc] ⇒ string.decode_from_wtf8_array + | 0xfb 0xb6:u32 [gc] ⇒ string.encode_to_lossy_utf8_array + | 0xfb 0xb7:u32 [gc] ⇒ string.encode_to_wtf8_array ;; New section. If present, must be present only once, and right before ;; the globals section (or where the globals section would be). Each @@ -733,11 +733,11 @@ operand allows you to elide the memory, in which case it defaults to 0. local.get $ptr local.get $ptr call $strlen - string.new_utf8) + string.decode_from_utf8) ``` If the bytes being decoded aren't actually valid UTF-8, this function -will trap. Use `string.new_lossy_utf8` in contexts where replacing +will trap. Use `string.decode_from_lossy_utf8` in contexts where replacing invalid data with `U+FFFD` is a better strategy than trapping. ### Make string from an array of WTF-8 code units in memory @@ -746,20 +746,20 @@ invalid data with `U+FFFD` is a better strategy than trapping. (func $string-from-wtf8n (param $ptr i32) (param $len i32) (result stringref) local.get $ptr local.get $len - string.new_wtf8) + string.decode_from_wtf8) ``` -Note that `string.new_wtf8` (and `string.new_wtf8_array`) are always +Note that `string.decode_from_wtf8` (and `string.decode_from_wtf8_array`) are always strict decoders: if the bytes are not valid WTF-8, the instruction traps. -### Make string from UTF-16 in memory +### Make string from WTF-16 in memory ```wasm -(func $string-from-utf16 (param $ptr i32) (param $units i32) (result stringref) +(func $string-from-wtf16n (param $ptr i32) (param $units i32) (result stringref) local.get $ptr local.get $units - string.new_wtf16) + string.decode_from_wtf16) ``` This proposal doesn't distinguish between UTF-16 and WTF-16 at all; @@ -971,7 +971,7 @@ open to considering adding more instructions. local.get $str local.get $ptr - string.encode_utf8 ;; push bytes written, same as $len + string.encode_to_utf8 ;; push bytes written, same as $len local.get $ptr i32.add @@ -986,8 +986,8 @@ Using `string.measure_utf8` ensures that the encoded string is a valid unicode scalar value sequence. How to handle invalid UTF-8 is up to the user; instead of `unreachable` we could throw an exception. -Note that in this case, the subsequent `string.encode_utf8` could just -as well have been `string.encode_lossy_utf8` or `string.encode_wtf8`, as +Note that in this case, the subsequent `string.encode_to_utf8` could just +as well have been `string.encode_to_lossy_utf8` or `string.encode_to_wtf8`, as these instructions are all the same for strings that do not contain isolated surrogates, and we checked that there were none. @@ -1012,7 +1012,7 @@ will encode isolated surrogates as WTF-8. local.get $cursor global.get $buf i32.const 1024 - string.encode_wtf8 ;; push bytes written + string.encode_to_wtf8 ;; push bytes written local.tee $bytes (if i32.eqz (then return)) ;; if no bytes encoded, done local.get $bytes @@ -1445,7 +1445,7 @@ faster than `externref`+imports: predictable performance than e.g. an encoder implemented in JS (for web embeddings). 4. Reading string contents, either via - `string.encode_wtf8`-then-process-inline or via `stringview_wtf16`, + `string.encode_to_wtf8`-then-process-inline or via `stringview_wtf16`, is likely faster than calling out to JavaScript to read code units one at a time. WebAssembly-to-JavaScript calls are cheap but not free. @@ -1506,8 +1506,8 @@ concrete adapter function specialized to the data representations used by the caller and the callee. The instruction set in this proposal can be used to implement the adapter function for passing a `stringref` as a string; assuming that the adapter function is generated in such a way -that it has access to the target memory, `string.encode_wtf8` can -implement the copy and validation at the same time. `string.new_wtf8` +that it has access to the target memory, `string.encode_to_wtf8` can +implement the copy and validation at the same time. `string.decode_from_wtf8` would be the implementation of getting a `stringref` from an interface-typed string value, again assuming UTF-8 encoding for these values.