Skip to content

The Big UTF16 String Rewrite #5339

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 29 commits into from
Nov 27, 2021
Merged

Conversation

moulins
Copy link
Contributor

@moulins moulins commented Sep 20, 2021

Note: This PR is based on #4285.

Currently, there's a mismatch between the type used by Ruffle to represent Flash strings (Rust's String) and the actual semantics of Flash strings
The differences:

  1. String is UTF8, but Flash strings are UTF16, with unpaired surrogates allowed
  2. String can contains null bytes, Flash strings cannot (not entirely sure? there's no way to create a \0 from the AVM, but maybe loading a \0 from the constant pool works?)
  3. String is unlimited in length (in practice), but Flash strings can't have more than 2³¹-1 chars.

This PR aims to fix 1 and 3 (but not 2) by introducing a new UTF16 string type, ruffle_core::string::Str, and by replacing all usages of String in AVM code with it.

This string type comes in several flavors: Str<'a> for immutable slices, StrMut<'a> for mutable slices, StrBuf and BoxedStr for owned strings, and AvmString for GC'd strings in AVM1/2.

These types store internally either a [u8] or a [u16], depending the string contents, and manage conversions between these two forms mostly transparently. This choice was made instead of only storing [u16] for two reasons:

  • Most strings will be LATIN1-only, and so storing them in [u16] would be a waste of memory;
  • To limit code churn, we want to be able to reinterpret Rust static UTF8 strings (for built-ins classes, methods, and fields) as AvmStrings without copying the buffer contents. (Another option would be to add a proc-macro to embed UTF16 strings directly, e.g. wstr!("foobar") instead of "foobar");
  • (bonus third reason: this is what avmplus does).

However, this comes with some drawbacks:

  • We need some unsafe code to pack the string representation efficiently in two usizes.
  • We need the types Str<'a> and StrMut<'a>, instead of using a custom DST and having &'a Str and &'a mut Str.
  • This prevents us from having the owned string types implement Deref or DerefMut, forcing us to duplicate strings function on all string types with a macro.
    • We could actually have a custom Str DST, and avoid these two issues, but this requires some black magic that may be unsound, depending on the exact rules of the (future) Rust memory model regarding provenance (see https://github.com/moulins/ruffle/tree/ruffle-string-unsound for a proof-of-concept).
  • Performance may be slightly worse, as there is an extra branch on the storage type when manipulating strings.

This PR isn't finished yet: the new string types are implemented (modulo some missing APIs), and are being used in some modules of ruffle_core, but many places still use String.

@Bale001
Copy link
Member

Bale001 commented Sep 20, 2021

2. String can contains null bytes, Flash strings cannot (not entirely sure? there's no way to create a \0 from the AVM, but maybe loading a \0 from the constant pool works?)

Here is an example of this:

var ba = new ByteArray();
ba.writeUTFBytes("example");
ba.writeByte(0);
ba.writeUTFBytes("test");
var s = ba.toString();
trace(s);
trace(s.length);

Traces:

example
12

Shows that the strings are null terminated

@moulins
Copy link
Contributor Author

moulins commented Sep 20, 2021

var ba = new ByteArray();
ba.writeUTFBytes("example");
ba.writeByte(0);
ba.writeUTFBytes("test");
var s = ba.toString();
trace(s);
trace(s.length);

Oh, that's clever ^^

Shows that the strings are null terminated

Well, if the length is still 12 it means that the string itself doesn't care about interior nulls, but that some other APIs do.

If we want to reproduce this, we should check all the ways strings can be used, and look where \0s truncate strings and where they don't...

@Bale001
Copy link
Member

Bale001 commented Sep 20, 2021

Yeah your right I got that a bit wrong, they store the length and the underlying buffer separately, so it's not null terminated. I'm looking through AVMPlus, it looks more like this. When we call toString it makes a call to this function to create the string. The null terminating part seems to only happen during the call to trace (as you said).

@moulins moulins force-pushed the ruffle-string branch 2 times, most recently from 8b77d63 to 6cd716d Compare September 27, 2021 23:01
@moulins moulins force-pushed the ruffle-string branch 5 times, most recently from fb03a07 to 8c56c5a Compare October 7, 2021 00:36
@moulins
Copy link
Contributor Author

moulins commented Oct 7, 2021

Okay, this is ready for review, at last!

I've renamed the custom string types to WStr, WStrMut, WString, etc (W for Wide), to avoid confusion between Str and str.

There are still improvements that could be made, but this PR is already big enough and I believe extra improvements are better made in future PRs:

  • Move core::string module into a separate crate.
  • Add a SwfStr::to_wstr method to directly get a WStr from a string stored in the SWF. (needs some testing to know how Flash handles non-UTF8 strings in SWF)
  • There are still UTF8 conversions that could be avoided for correctness/performance.
  • Replace WStr<'a> and WStrMut<'a> with a single dynamically-sized WStr. As said in the PR's main comment, this require unsafe code that is unsound under the current version of Stacked Borrow, but there's hope that it will be sound in the future, and this could bring significant ergonomic improvements for Ruffle (mainly, the ability to implement Deref, AsRef, and other related traits).
  • Bring WStr's API surface to parity with std::str; this isn't really necessary, but would still be useful in the future.

@moulins moulins changed the title WIP: The Big UTF16 String Rewrite The Big UTF16 String Rewrite Oct 7, 2021
@moulins moulins force-pushed the ruffle-string branch 2 times, most recently from 43337f7 to 989322f Compare October 11, 2021 19:45
@Herschel Herschel self-assigned this Oct 23, 2021
Copy link
Member

@Herschel Herschel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your massive effort on this! It's like we have a real VM :-)

My only high-level worry is the ergonomic hit from avoiding DSTs. It'd be a lot of churn re-adding all of the &s if things improve in the future, if we learn for certain that we can soundly use DST wstrs.

So I'm curious what your feeling is, or the Rust community's feeling in general. Is it expected to satisfy stacked borrows now for some potential future where it's finalized; or is it okay to do the unsafe-fu now for some potential future where it can be made sound?

Otherwise, it's mainly a bunch of style nits and some pre-existing issues.

  • To simplify if avm_str == WStr::from_units(b"literal"), could we, say, impl<'gc, N: usize> PartialEq<&[u8; N]> for AvmString<'gc>? Then hopefully all of those becomeif avm_str == b"literal".

  • Should WStr indices, lengths etc. be u32 instead of usize? Not sure if this makes it more or less annoying to work with, but recently I've wanted to be more strict with matching Flash's types.

Future thoughts you may already be considering:

  • I'm sure there'll be more to pack into WStr eventually, so packing the wide bit into len on 32-bit may be a little overkill -- we'll probably want more bytes to work with anyway.
  • Following from the above, it seems like there's a path to merging WStr and AvmString eventually. Static vs. GCd vs. owned are some more bits to pack in there.

String can contains null bytes, Flash strings cannot (not entirely sure? there's no way to create a \0 from the AVM, but maybe loading a \0 from the constant pool works?)

In SWF and AVM1 bytecode, the strings are null-terminated, so you can't inject a zero byte in the middle of a string. (DefineFont tag font name is the exception, as I've seen these with zero bytes in the wild).
In AVM2 bytecode, the strings are all length-prefixed, so you can have a zero byte in them. For example, you can do var s = "AB\u0000CD";. For display purposes such as trace, this will stop at the zero byte, but doing "AB\0CD" != "AB" will return false.

Sorry for the length of the review, been working on it all day! Thanks for the patience.

@moulins moulins force-pushed the ruffle-string branch 3 times, most recently from 8bf6d95 to 3e684bc Compare October 29, 2021 00:55
@moulins
Copy link
Contributor Author

moulins commented Oct 29, 2021

@Herschel I addressed most of the review's comments, and left some explanations in response.

I'll answer your more general remarks and questions in a later message ;).

@moulins
Copy link
Contributor Author

moulins commented Oct 31, 2021

My only high-level worry is the ergonomic hit from avoiding DSTs. It'd be a lot of churn re-adding all of the &s if things improve in the future, if we learn for certain that we can soundly use DST wstrs.

So I'm curious what your feeling is, or the Rust community's feeling in general. Is it expected to satisfy stacked borrows now for some potential future where it's finalized; or is it okay to do the unsafe-fu now for some potential future where it can be made sound?

Here's my take on this subject:

  • Under the current Stacked Borrows rules, it is impossible to soundly implement WStr as a DST (in short, what goes wrong is that, we need to have WStr { repr: [()] }, or something equivalent, but this has only provenance over 0 bytes, and so we can't use a &WStr to access the string data that it logically points to).
  • SB also disallows some semi-frequently used patterns related to the one we'd need for WStr, e.g. extern types behind references, or "header widening", were you cast a &Header to a &FullStruct (see these issues on UCG).
  • SB isn't the ultimate Rust semantics, only the best one we have yet, so we can expect SB to continue evolving, and maybe allow us to have a truly sound WStr DST one day.
  • In practice, even is SB says WStr DST is unsound, rustc doesn't seem to exploit the unsoundness now, and it looks like the LLVM IR it generates is sound under the LLVM memory model.

Personally, I think the unsoundness is "sufficiently-theoretic", and the ergomonic gains big enough that making WStr a DST is worth it, but if you don't trust rustc to not one day exploit the unsoundness (e.g. by putting more information in the generated LLVM IR) and break everything horribly, I understand completely :)

To simplify if avm_str == WStr::from_units(b"literal"), could we, say, impl<'gc, N: usize> PartialEq<&[u8; N]> for AvmString<'gc>? Then hopefully all of those become if avm_str == b"literal".

Good idea; should I rebase the PR to add this, or wait for another PR?

Should WStr indices, lengths etc. be u32 instead of usize? Not sure if this makes it more or less annoying to work with, but recently I've wanted to be more strict with matching Flash's types.

Personally, I see the 2³¹-length limitation as a sort of implementation detail, so I think keeping consistency with the rest of std is more important here. Additionally, some times you need to work on the raw slices by calling WStr::units, and here you'll need usizes anyways to index the slices, so it seems better to me to keep usizes for WStr.

I'm sure there'll be more to pack into WStr eventually, so packing the wide bit into len on 32-bit may be a little overkill -- we'll probably want more bytes to work with anyway.
Following from the above, it seems like there's a path to merging WStr and AvmString eventually. Static vs. GCd vs. owned are some more bits to pack in there.

My thinking here is that WStr shouldn't change in the future (modulo DST stuff if we do it); it is the equivalent of &str and should be as dumb as possible, so there's nothing more to pack in it. The AvmString type is here to augment WStr with GC support, and any extra flags would go there.

(Actually, there is one optimization possible: on 64-bits targets, there are 32 unused bits in WStrPtr, so I was thinking of adding a pub(crate) extra: u32 to expose them to any wrapper type that wants to use them (e.g. WString could store its capacity here, and AvmString could store its flags). But even here WStrPtr stays "dumb" and doesn't care what is stored in this extra field)

@Herschel
Copy link
Member

Herschel commented Nov 1, 2021

Personally, I think the unsoundness is "sufficiently-theoretic", and the ergomonic gains big enough that making WStr a DST is worth it, but if you don't trust rustc to not one day exploit the unsoundness (e.g. by putting more information in the generated LLVM IR) and break everything horribly, I understand completely :)

Let's go ahead and use the DST, and keep a close eye on how the situation goes with stacked borrows. I feel like it's more likely that we'll be able to guarantee soundness in the long term, and by being another project in the wild relying on this behavior, it adds some gentle pressure for the unsafe WG to provide a way to ensure soundness. Worst case, we can just switch over when it becomes necessary.

If this seems like it'd be very tedious to change, I'd be okay with merging without the DST, and possibly changing later. I'll let you decide on this point.

Good idea; should I rebase the PR to add this, or wait for another PR?

This significantly improves readability, so let's go for it now. And cross fingers that custom literal suffixes comes along soon. 😄

My thinking here is that WStr shouldn't change in the future (modulo DST stuff if we do it); it is the equivalent of &str and should be as dumb as possible, so there's nothing more to pack in it. The AvmString type is here to augment WStr with GC support, and any extra flags would go there.

(Actually, there is one optimization possible: on 64-bits targets, there are 32 unused bits in WStrPtr, so I was thinking of adding a pub(crate) extra: u32 to expose them to any wrapper type that wants to use them (e.g. WString could store its capacity here, and AvmString could store its flags). But even here WStrPtr stays "dumb" and doesn't care what is stored in this extra field)

I see the argument about AvmString augmenting WStr, but in general, I feel the fewer layers of "wrapping", the better, both for maintainability and ease of use. The user code in the player doesn't really care which type it's working with, and the large majority of the code will need to work with both GCd and static strings. So the question for me is what benefit the distinction gives us long term. Specifically WString feels like the extraneous type, as it is nearly always tucked inside an AvmString -- if WString vanishes , then WStr becomes more like AvmStr, a view into an AvmString. In which case maybe RuffleStr/RuffleString is a better name. 😄 The fact that it eases the packing of additional bits is a bonus.

For comparison, avmplus bakes all of this into their one String type (wide flags, static/GCd flags, interned flags, subslice ptrs, etc.) -- not that we necessarily want to mimic what they do. In any case, I think WStr is the right path for now just because it's an easier stepping stone to improving things, and all of the above will require improvements in our GC situation.

Thanks again for your work and patience. All of the changes LGTM!

@moulins
Copy link
Contributor Author

moulins commented Nov 1, 2021

If this seems like it'd be very tedious to change, I'd be okay with merging without the DST, and possibly changing later. I'll let you decide on this point.

Doing the s.borrow() -> &s change in the rest of the codebase should be straightforward, but I'll need a few days to rewrite the string module with the DST; so it depends whether you want to merge the PR now or if you're ok for waiting a little more :).

Specifically WString feels like the extraneous type, as it is nearly always tucked inside an AvmString -- if WString vanishes , then WStr becomes more like AvmStr, a view into an AvmString. In which case maybe RuffleStr/RuffleString is a better name. smile The fact that it eases the packing of additional bits is a bonus.

Well, grepping the codebase for WString gives 150 uses, so it seems it's somewhat used ^^. I think that separating the "UTF16 handling" part and the "GC flags" part is valuable; I don't like much how avmplus' string type smushes everything into a single class. Moreover, one thing I'd like to do in the future is pull out the whole core::string module (without AvmString) into a separate ruffle_wstr GC-agnostic crate (and almost Flash-agnostic too, except for the case handling and the limitation on length), and for this we need the WStr, WString / AvmString split.

@moulins
Copy link
Contributor Author

moulins commented Nov 1, 2021

I've rebased on master and made the PartialEq changes; the PR should be ready to merge, unless you want to include the DST stuff in it.

@moulins
Copy link
Contributor Author

moulins commented Nov 5, 2021

I've made the change from WStr<'_>/WStrMut<'_> to the DST type, it's okay to merge now ;)

This allows `Str::{find, rfind, split}` to accept multiple types
This is a little tricky, because we have to map the utf8 indices
returned by the regex engine to utf16 indices usable by Ruffle.

To limit the impact on performance, the regex, the string we're
currently matching on, and the last known (utf8, utf16) positions
are cached, avoiding extra utf8 conversions in common use cases
where a single string is repeatedly searched with increasing
`lastIndex`.
This has the nice side-effect of reducing string cloning, because we can
just pass AvmStrings around instead.
Also remove some useless back-and-forth conversions between
AvmString and String
This avoids converting the string to UTF8 if it can't possibly
be a float
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants