The Big UTF16 String Rewrite #5339

moulins · 2021-09-20T22:51:22Z

Note: This PR is based on #4285.

Currently, there's a mismatch between the type used by Ruffle to represent Flash strings (Rust's String) and the actual semantics of Flash strings
The differences:

String is UTF8, but Flash strings are UTF16, with unpaired surrogates allowed
String can contains null bytes, Flash strings cannot (not entirely sure? there's no way to create a \0 from the AVM, but maybe loading a \0 from the constant pool works?)
String is unlimited in length (in practice), but Flash strings can't have more than 2³¹-1 chars.

This PR aims to fix 1 and 3 (but not 2) by introducing a new UTF16 string type, ruffle_core::string::Str, and by replacing all usages of String in AVM code with it.

This string type comes in several flavors: Str<'a> for immutable slices, StrMut<'a> for mutable slices, StrBuf and BoxedStr for owned strings, and AvmString for GC'd strings in AVM1/2.

These types store internally either a [u8] or a [u16], depending the string contents, and manage conversions between these two forms mostly transparently. This choice was made instead of only storing [u16] for two reasons:

Most strings will be LATIN1-only, and so storing them in [u16] would be a waste of memory;
To limit code churn, we want to be able to reinterpret Rust static UTF8 strings (for built-ins classes, methods, and fields) as AvmStrings without copying the buffer contents. (Another option would be to add a proc-macro to embed UTF16 strings directly, e.g. wstr!("foobar") instead of "foobar");
(bonus third reason: this is what avmplus does).

However, this comes with some drawbacks:

We need some unsafe code to pack the string representation efficiently in two usizes.
We need the types Str<'a> and StrMut<'a>, instead of using a custom DST and having &'a Str and &'a mut Str.
This prevents us from having the owned string types implement Deref or DerefMut, forcing us to duplicate strings function on all string types with a macro.
- We could actually have a custom Str DST, and avoid these two issues, but this requires some black magic that may be unsound, depending on the exact rules of the (future) Rust memory model regarding provenance (see https://github.com/moulins/ruffle/tree/ruffle-string-unsound for a proof-of-concept).
Performance may be slightly worse, as there is an extra branch on the storage type when manipulating strings.

This PR isn't finished yet: the new string types are implemented (modulo some missing APIs), and are being used in some modules of ruffle_core, but many places still use String.

Bale001 · 2021-09-20T23:19:52Z

2. String can contains null bytes, Flash strings cannot (not entirely sure? there's no way to create a \0 from the AVM, but maybe loading a \0 from the constant pool works?)

Here is an example of this:

var ba = new ByteArray();
ba.writeUTFBytes("example");
ba.writeByte(0);
ba.writeUTFBytes("test");
var s = ba.toString();
trace(s);
trace(s.length);

Traces:

example
12

Shows that the strings are null terminated

moulins · 2021-09-20T23:32:02Z

var ba = new ByteArray();
ba.writeUTFBytes("example");
ba.writeByte(0);
ba.writeUTFBytes("test");
var s = ba.toString();
trace(s);
trace(s.length);

Oh, that's clever ^^

Shows that the strings are null terminated

Well, if the length is still 12 it means that the string itself doesn't care about interior nulls, but that some other APIs do.

If we want to reproduce this, we should check all the ways strings can be used, and look where \0s truncate strings and where they don't...

Bale001 · 2021-09-20T23:48:33Z

Yeah your right I got that a bit wrong, they store the length and the underlying buffer separately, so it's not null terminated. I'm looking through AVMPlus, it looks more like this. When we call toString it makes a call to this function to create the string. The null terminating part seems to only happen during the call to trace (as you said).

moulins · 2021-10-07T01:07:34Z

Okay, this is ready for review, at last!

I've renamed the custom string types to WStr, WStrMut, WString, etc (W for Wide), to avoid confusion between Str and str.

There are still improvements that could be made, but this PR is already big enough and I believe extra improvements are better made in future PRs:

Move core::string module into a separate crate.
Add a SwfStr::to_wstr method to directly get a WStr from a string stored in the SWF. (needs some testing to know how Flash handles non-UTF8 strings in SWF)
There are still UTF8 conversions that could be avoided for correctness/performance.
Replace WStr<'a> and WStrMut<'a> with a single dynamically-sized WStr. As said in the PR's main comment, this require unsafe code that is unsound under the current version of Stacked Borrow, but there's hope that it will be sound in the future, and this could bring significant ergonomic improvements for Ruffle (mainly, the ability to implement Deref, AsRef, and other related traits).
Bring WStr's API surface to parity with std::str; this isn't really necessary, but would still be useful in the future.

Herschel

Thanks for your massive effort on this! It's like we have a real VM :-)

My only high-level worry is the ergonomic hit from avoiding DSTs. It'd be a lot of churn re-adding all of the &s if things improve in the future, if we learn for certain that we can soundly use DST wstrs.

So I'm curious what your feeling is, or the Rust community's feeling in general. Is it expected to satisfy stacked borrows now for some potential future where it's finalized; or is it okay to do the unsafe-fu now for some potential future where it can be made sound?

Otherwise, it's mainly a bunch of style nits and some pre-existing issues.

To simplify if avm_str == WStr::from_units(b"literal"), could we, say, impl<'gc, N: usize> PartialEq<&[u8; N]> for AvmString<'gc>? Then hopefully all of those becomeif avm_str == b"literal".
Should WStr indices, lengths etc. be u32 instead of usize? Not sure if this makes it more or less annoying to work with, but recently I've wanted to be more strict with matching Flash's types.

Future thoughts you may already be considering:

I'm sure there'll be more to pack into WStr eventually, so packing the wide bit into len on 32-bit may be a little overkill -- we'll probably want more bytes to work with anyway.
Following from the above, it seems like there's a path to merging WStr and AvmString eventually. Static vs. GCd vs. owned are some more bits to pack in there.

String can contains null bytes, Flash strings cannot (not entirely sure? there's no way to create a \0 from the AVM, but maybe loading a \0 from the constant pool works?)

In SWF and AVM1 bytecode, the strings are null-terminated, so you can't inject a zero byte in the middle of a string. (DefineFont tag font name is the exception, as I've seen these with zero bytes in the wild).
In AVM2 bytecode, the strings are all length-prefixed, so you can have a zero byte in them. For example, you can do var s = "AB\u0000CD";. For display purposes such as trace, this will stop at the zero byte, but doing "AB\0CD" != "AB" will return false.

Sorry for the length of the review, been working on it all day! Thanks for the patience.

core/src/avm1/function.rs

core/src/avm1/globals/string.rs

core/src/avm2/globals/flash/display/graphics.rs

core/src/string/buf.rs

core/src/string/raw.rs

core/src/string/parse.rs

core/src/avm1/value.rs

moulins · 2021-10-29T00:58:32Z

@Herschel I addressed most of the review's comments, and left some explanations in response.

I'll answer your more general remarks and questions in a later message ;).

moulins · 2021-10-31T23:52:42Z

My only high-level worry is the ergonomic hit from avoiding DSTs. It'd be a lot of churn re-adding all of the &s if things improve in the future, if we learn for certain that we can soundly use DST wstrs.

So I'm curious what your feeling is, or the Rust community's feeling in general. Is it expected to satisfy stacked borrows now for some potential future where it's finalized; or is it okay to do the unsafe-fu now for some potential future where it can be made sound?

Here's my take on this subject:

Under the current Stacked Borrows rules, it is impossible to soundly implement WStr as a DST (in short, what goes wrong is that, we need to have WStr { repr: [()] }, or something equivalent, but this has only provenance over 0 bytes, and so we can't use a &WStr to access the string data that it logically points to).
SB also disallows some semi-frequently used patterns related to the one we'd need for WStr, e.g. extern types behind references, or "header widening", were you cast a &Header to a &FullStruct (see these issues on UCG).
SB isn't the ultimate Rust semantics, only the best one we have yet, so we can expect SB to continue evolving, and maybe allow us to have a truly sound WStr DST one day.
In practice, even is SB says WStr DST is unsound, rustc doesn't seem to exploit the unsoundness now, and it looks like the LLVM IR it generates is sound under the LLVM memory model.

Personally, I think the unsoundness is "sufficiently-theoretic", and the ergomonic gains big enough that making WStr a DST is worth it, but if you don't trust rustc to not one day exploit the unsoundness (e.g. by putting more information in the generated LLVM IR) and break everything horribly, I understand completely :)

To simplify if avm_str == WStr::from_units(b"literal"), could we, say, impl<'gc, N: usize> PartialEq<&[u8; N]> for AvmString<'gc>? Then hopefully all of those become if avm_str == b"literal".

Good idea; should I rebase the PR to add this, or wait for another PR?

Should WStr indices, lengths etc. be u32 instead of usize? Not sure if this makes it more or less annoying to work with, but recently I've wanted to be more strict with matching Flash's types.

Personally, I see the 2³¹-length limitation as a sort of implementation detail, so I think keeping consistency with the rest of std is more important here. Additionally, some times you need to work on the raw slices by calling WStr::units, and here you'll need usizes anyways to index the slices, so it seems better to me to keep usizes for WStr.

I'm sure there'll be more to pack into WStr eventually, so packing the wide bit into len on 32-bit may be a little overkill -- we'll probably want more bytes to work with anyway.
Following from the above, it seems like there's a path to merging WStr and AvmString eventually. Static vs. GCd vs. owned are some more bits to pack in there.

My thinking here is that WStr shouldn't change in the future (modulo DST stuff if we do it); it is the equivalent of &str and should be as dumb as possible, so there's nothing more to pack in it. The AvmString type is here to augment WStr with GC support, and any extra flags would go there.

(Actually, there is one optimization possible: on 64-bits targets, there are 32 unused bits in WStrPtr, so I was thinking of adding a pub(crate) extra: u32 to expose them to any wrapper type that wants to use them (e.g. WString could store its capacity here, and AvmString could store its flags). But even here WStrPtr stays "dumb" and doesn't care what is stored in this extra field)

Herschel · 2021-11-01T03:55:45Z

Personally, I think the unsoundness is "sufficiently-theoretic", and the ergomonic gains big enough that making WStr a DST is worth it, but if you don't trust rustc to not one day exploit the unsoundness (e.g. by putting more information in the generated LLVM IR) and break everything horribly, I understand completely :)

Let's go ahead and use the DST, and keep a close eye on how the situation goes with stacked borrows. I feel like it's more likely that we'll be able to guarantee soundness in the long term, and by being another project in the wild relying on this behavior, it adds some gentle pressure for the unsafe WG to provide a way to ensure soundness. Worst case, we can just switch over when it becomes necessary.

If this seems like it'd be very tedious to change, I'd be okay with merging without the DST, and possibly changing later. I'll let you decide on this point.

Good idea; should I rebase the PR to add this, or wait for another PR?

This significantly improves readability, so let's go for it now. And cross fingers that custom literal suffixes comes along soon. 😄

My thinking here is that WStr shouldn't change in the future (modulo DST stuff if we do it); it is the equivalent of &str and should be as dumb as possible, so there's nothing more to pack in it. The AvmString type is here to augment WStr with GC support, and any extra flags would go there.

(Actually, there is one optimization possible: on 64-bits targets, there are 32 unused bits in WStrPtr, so I was thinking of adding a pub(crate) extra: u32 to expose them to any wrapper type that wants to use them (e.g. WString could store its capacity here, and AvmString could store its flags). But even here WStrPtr stays "dumb" and doesn't care what is stored in this extra field)

I see the argument about AvmString augmenting WStr, but in general, I feel the fewer layers of "wrapping", the better, both for maintainability and ease of use. The user code in the player doesn't really care which type it's working with, and the large majority of the code will need to work with both GCd and static strings. So the question for me is what benefit the distinction gives us long term. Specifically WString feels like the extraneous type, as it is nearly always tucked inside an AvmString -- if WString vanishes , then WStr becomes more like AvmStr, a view into an AvmString. In which case maybe RuffleStr/RuffleString is a better name. 😄 The fact that it eases the packing of additional bits is a bonus.

For comparison, avmplus bakes all of this into their one String type (wide flags, static/GCd flags, interned flags, subslice ptrs, etc.) -- not that we necessarily want to mimic what they do. In any case, I think WStr is the right path for now just because it's an easier stepping stone to improving things, and all of the above will require improvements in our GC situation.

Thanks again for your work and patience. All of the changes LGTM!

moulins · 2021-11-01T05:50:07Z

If this seems like it'd be very tedious to change, I'd be okay with merging without the DST, and possibly changing later. I'll let you decide on this point.

Doing the s.borrow() -> &s change in the rest of the codebase should be straightforward, but I'll need a few days to rewrite the string module with the DST; so it depends whether you want to merge the PR now or if you're ok for waiting a little more :).

Specifically WString feels like the extraneous type, as it is nearly always tucked inside an AvmString -- if WString vanishes , then WStr becomes more like AvmStr, a view into an AvmString. In which case maybe RuffleStr/RuffleString is a better name. smile The fact that it eases the packing of additional bits is a bonus.

Well, grepping the codebase for WString gives 150 uses, so it seems it's somewhat used ^^. I think that separating the "UTF16 handling" part and the "GC flags" part is valuable; I don't like much how avmplus' string type smushes everything into a single class. Moreover, one thing I'd like to do in the future is pull out the whole core::string module (without AvmString) into a separate ruffle_wstr GC-agnostic crate (and almost Flash-agnostic too, except for the case handling and the limitation on length), and for this we need the WStr, WString / AvmString split.

moulins · 2021-11-01T09:15:46Z

I've rebased on master and made the PartialEq changes; the PR should be ready to merge, unless you want to include the DST stuff in it.

moulins · 2021-11-05T22:37:07Z

I've made the change from WStr<'_>/WStrMut<'_> to the DST type, it's okay to merge now ;)

…n.rs

This allows `Str::{find, rfind, split}` to accept multiple types

This is a little tricky, because we have to map the utf8 indices returned by the regex engine to utf16 indices usable by Ruffle. To limit the impact on performance, the regex, the string we're currently matching on, and the last known (utf8, utf16) positions are cached, avoiding extra utf8 conversions in common use cases where a single string is repeatedly searched with increasing `lastIndex`.

This has the nice side-effect of reducing string cloning, because we can just pass AvmStrings around instead.

Also remove some useless back-and-forth conversions between AvmString and String

This avoids converting the string to UTF8 if it can't possibly be a float

moulins force-pushed the ruffle-string branch 2 times, most recently from 8b77d63 to 6cd716d Compare September 27, 2021 23:01

moulins force-pushed the ruffle-string branch 5 times, most recently from fb03a07 to 8c56c5a Compare October 7, 2021 00:36

moulins changed the title ~~WIP: The Big UTF16 String Rewrite~~ The Big UTF16 String Rewrite Oct 7, 2021

moulins force-pushed the ruffle-string branch 2 times, most recently from 43337f7 to 989322f Compare October 11, 2021 19:45

Herschel mentioned this pull request Oct 17, 2021

avm1: Display objects are indirectly referenced by string path (fix #1513) #5492

Closed

4 tasks

moulins force-pushed the ruffle-string branch from 989322f to 174295b Compare October 20, 2021 14:29

Herschel self-assigned this Oct 23, 2021

Herschel reviewed Oct 24, 2021

View reviewed changes

moulins force-pushed the ruffle-string branch 3 times, most recently from 8bf6d95 to 3e684bc Compare October 29, 2021 00:55

moulins force-pushed the ruffle-string branch from 3e684bc to a38b8b2 Compare October 29, 2021 01:04

moulins force-pushed the ruffle-string branch from a38b8b2 to b36bb89 Compare November 1, 2021 09:04

moulins force-pushed the ruffle-string branch from b36bb89 to 7de1f45 Compare November 5, 2021 22:34

moulins added 24 commits November 26, 2021 11:17

avm1: Requires a WStr<'_> to access PropertyMap

9472c2a

core: Replace string::utils *_ignore_case functions by methods on WStr

9043234

core: Remove some useless utf8<->utf16 conv. in {avm1,avm2}/activatio…

80c05e7

…n.rs

core: Add string::Pattern trait, based on std::str::Pattern

27b69aa

This allows `Str::{find, rfind, split}` to accept multiple types

avm1: partially rewrite Value::primitive_as_number

f9073c6

avm1: don't use &str methods when parsing enum parameters in MovieClip

d635b9c

avm1: don't use &str for activation's path & variable logic

feaa5a2

core: add FromWStr trait and WStr::parse

07bbea8

avm2: use WStr in date parsing logic

27bb21b

xml: use AvmString in xml code

79a5307

This has the nice side-effect of reducing string cloning, because we can just pass AvmStrings around instead.

avm1: adapt parseInt, escape & unescape to use WStr API

185d772

core: implement FromWStr for various enum types

37f7936

core: use WStrings for display object's frame and scene labels

e8e037f

avm1: remove AvmString::as_str usage in activation.rs and movie_clip.rs

32360c6

core: implement WStr::join

f2ec5e9

avm2: use WString in QName methods

a813739

core: remove some AvmString::as_str calls

6603792

core: adapt TextField, Font & Html handling code to use WStr

332fcbd

core: remove last instances of AvmString::as_str

cc2e021

core: remove AvmString::as_str, and the extra UTF8 String inside

ecb8067

chore: rename AvmString::{new -> new_utf8, new_ucs2 -> new}

0829a4c

core: Reduce allocations in AvmString::new_utf8

5974d71

Also remove some useless back-and-forth conversions between AvmString and String

core: improve f64's FromWStr impl

6042df8

This avoids converting the string to UTF8 if it can't possibly be a float

moulins force-pushed the ruffle-string branch from 5656a71 to d6850ac Compare November 27, 2021 14:56

core: Replace WStr<'_> and WStrMut<'_> by a single DST type WStr

6420227

moulins force-pushed the ruffle-string branch from d6850ac to 6420227 Compare November 27, 2021 15:39

kmeisthax merged commit 9701b81 into ruffle-rs:master Nov 27, 2021

moulins deleted the ruffle-string branch November 28, 2021 13:44

Aaron1011 mentioned this pull request Dec 18, 2021

Collection of cases of deliberate UB rust-lang/unsafe-code-guidelines#158

Closed

Uh oh!

The Big UTF16 String Rewrite #5339

The Big UTF16 String Rewrite #5339

Uh oh!

Conversation

moulins commented Sep 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Bale001 commented Sep 20, 2021

Uh oh!

moulins commented Sep 20, 2021

Uh oh!

Bale001 commented Sep 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

moulins commented Oct 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Herschel left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

moulins commented Oct 29, 2021

Uh oh!

moulins commented Oct 31, 2021

Uh oh!

Herschel commented Nov 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

moulins commented Nov 1, 2021

Uh oh!

moulins commented Nov 1, 2021

Uh oh!

moulins commented Nov 5, 2021

Uh oh!

Uh oh!

moulins commented Sep 20, 2021 •

edited

Loading

Bale001 commented Sep 20, 2021 •

edited

Loading

moulins commented Oct 7, 2021 •

edited

Loading

Herschel left a comment •

edited

Loading

Herschel commented Nov 1, 2021 •

edited

Loading